## Abstract

A large body of evidence suggests that an approximate number sense allows humans to estimate numerosity in sensory scenes. This ability is widely observed in humans, including those without formal mathematical training. Despite this, many outstanding questions remain about the nature of the numerosity representation in the brain. Specifically, it is not known whether approximate numbers are represented as scalar estimates of numerosity or, alternatively, as probability distributions over numerosity. In the present study, we used a multisensory decision task to distinguish these possibilities. We trained human subjects to decide whether a test stimulus had a larger or smaller numerosity compared with a fixed reference. Depending on the trial, the numerosity was presented as either a sequence of visual flashes or a sequence of auditory tones, or both. To test for a probabilistic representation, we varied the reliability of the stimulus by adding noise to the visual stimuli. In accordance with a probabilistic representation, we observed a significant improvement in multisensory compared with unisensory trials. Furthermore, a trial-by-trial analysis revealed that although individual subjects showed strategic differences in how they leveraged auditory and visual information, all subjects exploited the reliability of unisensory cues. An alternative, nonprobabilistic model, in which subjects combined cues without regard for reliability, was not able to account for these trial-by-trial choices. These findings provide evidence that the brain relies on a probabilistic representation for numerosity decisions.

- Bayesian
- decision making
- multisensory
- numerosity
- vision

considerable evidence suggests that animals and humans possess an approximate number sense, believed to be the precursor of complex symbolic manipulations of numbers developed by humans (Dehaene 2011). The approximate number sense allows us to perceive numerosity in sensory scenes even when other cues that might be correlated with number, such as size or surface area, are eliminated (Piazza et al. 2004). Furthermore, evidence suggests that this ability is innate: preverbal babies are capable of estimating numbers as large as 32 (Xu 2003; Xu and Spelke 2000; Xu et al. 2005), and adults in indigenous cultures without mathematical education can represent and manipulate numbers in some circumstances (Pica et al. 2004).

A number of experiments have identified cortical regions that may be necessary for the representation of number. For example, neurons in monkey posterior parietal cortex are sometimes “tuned” for a particular number: they fire maximally for that number and less so for other numbers (Nieder and Dehaene 2009; Nieder and Miller 2004; but see Roitman et al. 2007). Some areas in the human brain are likewise sensitive to numerosity: human parietal areas are activated in numerosity tasks (Chochon et al. 1999; Pinel et al. 2001).

Despite these findings, it is not clear whether the brain uses a scalar or a probabilistic strategy to represent number. If neurons were to represent numbers as scalars (Fig. 1, *A* and *B*, *right* panels in *insets*), neural circuits would represent one single-value estimate of numerosity on a given trial in response to a stimulus such as the dot display shown in Fig. 1. Naturally, this estimate might differ from trial-to-trial since either sensory noise or a subject's bias might drive the scalar estimate up or down. However, within a trial, only a single number is represented. The disadvantage of a scalar representation is that the reliability of the number estimate is unknown (*right* panels in *insets* of Fig. 1, *A* and *B*, are the same, although the stimulus is unreliable in Fig. 1*B*). This is particularly problematic when there are two sources of information, a reliable source and an unreliable source. In this case, the optimal strategy consists of summing the two estimates weighted by their respective reliability (Ernst and Banks 2002; Jacobs 1999). This is intuitive: reliable inputs should have a stronger impact on decisions than unreliable ones. Without access to reliability, subjects cannot apply this optimal strategy but can use a simpler strategy such as selecting one source of information randomly on every trial to guide the choice (we refer to this as the “coin flip” model). Interestingly, scalar representations are commonly used in neural models, particularly models based on attractor dynamics where the dimension of the attractor is equal to the dimension of the encoded variables (Cohen and Grossberg 1983; Deneve et al. 2001; Hopfield 1982; Skaggs et al. 1995; Wang 2009; Wills et al. 2005).

An alternative to a scalar representation is a probabilistic representation (Fig. 1, *A* and *B*, *left* panels in *insets*). Theories of probabilistic inference in neural circuits describe how probability distributions could be extracted, represented, and used for further inference (for review, see Pouget et al. 2013). For instance, neurons might use probabilistic population codes (Ma et al. 2006) in which the mean of the distribution is encoded by the position of the hill of activity, whereas the variance of the distribution is encoded by the gain, or height, of the hill of activity, with low gain corresponding to higher variance. These representations have the advantage that they automatically encode the reliability associated with the numerosity estimate: unreliable (e.g., low contrast) stimuli would lead to much broader distributions than reliable stimuli (compare *left* panels in *insets* of Fig. 1, *A* and *B*). Access to stimulus reliability means that subjects are better equipped to use multiple sources of information for decisions. For instance, they could combine sources of information according to their reliability (“optimal model;” Ernst and Banks 2002; Jacobs 1999; van Beers et al. 1996), or they could determine the more reliable source and ignore the other source (referred to as “max model” hereafter). Both strategies are probabilistic in that they require an estimate of stimulus reliability to define how stimuli are used to inform the decision. The two strategies differ in the possible weights that are assigned to each source of information: in an optimal model, weights can take on a range of values, whereas in a max model, weights are either 0 or 1.

The consequence of scalar versus probabilistic representations, as described above, is explicit when multiple sources of information weigh in on the choice. Because most numerosity studies present subjects with a single source of information, the default assumption has been that the brain represents numbers as scalars. Existing experiments have mainly focused on the models' predictions for how noise scales with the number of stimulus objects: a logarithmic mental number line with constant noise (Dehaene and Changeux 1993) versus a linear mental number line with linearly increasing noise (Gallistel and Gelman 1992). These potential noise sources are critical for understanding trial-to-trial fluctuations of number estimate but leave unexamined how numbers are represented on individual trials.

The existence of a probabilistic representation for quantities other than numerosity has been explicitly tested via decisions in which multiple sources of evidence are present. These decisions have exposed probabilistic representations in the neural systems used to estimate depth (Jacobs 1999), limb position (van Beers et al. 1996), surface slant (Knill and Saunders 2003), heading direction (Fetsch et al. 2009), repetition rate (Sheppard et al. 2013), object location (Alais and Burr 2004), and object height (Ernst and Banks 2002). In all cases, experimenters first manipulated the reliability of each source of information, typically by varying the amount of noise added to a stimulus (Fetsch et al. 2009) on every trial or adjusting its intensity (Raposo et al. 2012; Sheppard et al. 2013). Next, they measured whether subjects leveraged that reliability when making multisensory decisions. In many studies, subjects' choices on multisensory trials were influenced by the reliability of each unisensory stimulus, in keeping with a probabilistic representation. Interestingly, similar probabilistic strategies are evident in multiple stimulus contexts (Knill and Saunders 2003; van Beers et al. 1996) and multiple species (Fetsch et al. 2009; Raposo et al. 2012; Sheppard et al. 2013), despite different strategies for manipulating reliability in each study.

The use of multiple sources of information that differ in their reliability is key because it is under these conditions that probabilistic and scalar representations make different predictions. Thus, although previous experiments have shown that bimodal numerosity stimuli can improve behavioral performance (Jordan and Baker 2011; Jordan et al. 2008), these results cannot be taken as conclusive evidence for probabilistic numerosity representations because they used only one level of reliability for each modality.

We used this approach to test for a probabilistic representation of numerosity. We designed a novel multisensory numerosity task and evaluated whether subjects used stimulus reliability to guide decisions. We then considered three models for multisensory integration (the optimal, max, and coin flip models), only two of which take into account reliability (the optimal and max models), and asked which one of these models best accounts for the subjects' performance via Bayesian model comparison. We found that subjects made more precise numerosity judgments when stimuli were multisensory and that they combined the two unisensory cues via a strategy that took into account reliability. These observations support a probabilistic number representation.

## METHODS

#### Behavioral task.

Subjects were trained on a multisensory numerosity task. Each trial consisted of a series of auditory or visual “events” (duration: 1 frame = 10 ms) embedded in background noise of constant amplitude. Visual events were flashes of light, and auditory events were brief tones. Next, the following procedure was undertaken to prevent subjects from performing the task solely on the basis of nonnumeric stimulus attributes such as event rate or total duration (Fig. 2). After selecting numerosity (Fig. 2*A*), we sampled the desired mean interevent interval time of a given trial from a uniform distribution between 55 and 110 ms (Fig. 2*B*). The *N* individual intervals consisted of a fixed duration of 50 ms plus a random duration sampled from an exponential distribution (Fig. 2*C*). This ensured intervals would be longer than 50 ms, which is long enough for events to be perceived as separate (Burr et al. 2009). The falloff of the exponential distribution was selected in such a way that its mean, when added to the fixed duration, coincided with the desired mean interval time. The chosen values produced a rapid, arrhythmic event sequence, making verbal counting highly inefficient and inaccurate (Cordes et al. 2007).

Although the above procedure completely decorrelates the independently chosen event rate from numerosity, there is still a correlation between total stimulus duration and numerosity (correlation coefficient 0.70). To ameliorate this, the background noise continued until either the subjects responded or until the maximum time of 2.25 s, which is longer than the expected duration of a stimulus of the highest numerosity at the lowest rate (1.54 s). Furthermore, we instructed subjects that they could respond with a key press whenever they were ready and did not need to wait until the end of the stimulus.

The numerosity *n* of the stimulus was the same for auditory and visual trials and was randomly selected from the set {7, 8, 9, 11, 12, 14}. Subjects compared the stimulus with a fixed reference (*n* = 10), which they learned with experience. The selected numerosities are large enough to prevent subitizing (Kaufman et al. 1949) but small enough to allow temporal integration of the entire stimulus. In preliminary experiments, informing subjects of the reference numerosity in advance biased them toward small numerosities. The biases were hard to eliminate even after extensive training; when subjects had to learn the reference numerosity implicitly over trials, this problem did not occur. Similar biases were previously shown in experiments asking subjects to map numerosity on a verbal estimate and are likely a consequence of mapping an exact verbal number onto an approximate numerosity estimate (Izard and Dehaene 2008).

Trials were defined not only by numerosity but also by “reliability.” Manipulating reliability is a standard procedure in multisensory experiments and allows experimenters to control performance on unisensory trials, leading to different predictions for multisensory trials (Alais and Burr 2004; Ernst and Banks 2002; Fetsch et al. 2009; Sheppard et al. 2013). Reliability was defined as the amplitude of the stimulus events relative to background noise. For visual stimuli, the amplitude is the brightness of the visual flashes; for auditory stimuli, the amplitude is the loudness of the auditory tones. In the visual condition, three levels of reliability were used for most subjects. These were determined empirically during training (see below). The reliabilities were defined as the stimulus amplitudes that gave rise to ∼65%, 80%, and 95% performance for numerosities *n* = 7 and *n* = 14. The same level of background noise was used in all three levels of reliability; only the brightness of the stimulus changed. This further discouraged subjects from using total brightness (integrated over time) as a proxy for numerosity. For the auditory stimuli, a single level of reliability was used; this was the stimulus amplitude that gave rise to ∼80% performance for numerosities *n* = 7 and *n* = 14 (Fig. 2*E*). Because the predictions of the three models only depend on the relative reliability of visual and auditory modality, choosing a single auditory reliability was sufficient to test the models while allowing for more trials per condition.

We randomly interleaved trials of different modality (auditory, visual, multisensory) and numerosity. For most subjects, we also randomly interleaved the reliability level (65%, 80% and 90%) for visual and multisensory trials. Two subjects (S42 and S44) were presented with only one visual reliability level at a time during the study. Each specific modality and reliability was chosen with equal probability. For example, for subjects with three reliability levels, 3/7 of trials were visual, 3/7 multisensory, and 1/7 auditory.

For multisensory trials, visual and auditory streams with equal numerosity were generated at the same rate and presented simultaneously, but individual events (except the last one) were not synchronized (Fig. 2*D*). The advantage of presenting asynchronous event streams in the multisensory condition is that it rules out the possibility that multisensory improvement only stems from improved detection of individual events rather than integration over modalities and time (Raposo et al. 2012).

#### Procedure.

We report data from 11 human volunteers (5 men, 10 right-handed, 0 self-reported synesthetes, age 18–62 yr) with normal or corrected-to-normal vision and normal hearing. Volunteers were recruited through fliers posted at Cold Spring Harbor Laboratory. Experiments were conducted in a large sound-isolating booth (Industrial Acoustics, Bronx, NY); all subjects were naive about the experiment and did not know the value of the reference numerosity (*n* = 10). Subjects provided written consent before participating in experiments and were paid $15/h. All experimental procedures were consistent with the Declaration of Helsinki and were approved by the Cold Spring Harbor Internal Review Board. Subjects were instructed to use one of two key press responses to indicate whether the number of events was high or low. They were not informed of the category boundary but learned it with experience.

Visual stimuli were displayed on a cathode ray tube monitor (Dell M991) using a refresh rate of 100 Hz. Subjects were seated comfortably in front of the monitor; the distance from their eyes to the screen was ∼51 cm. Stimulus presentation and data acquisition were controlled by the Psychophysics Toolbox (Pelli 1997) running in MATLAB (The MathWorks, Natick, MA) on a Quad Core Macintosh computer. Subjects were instructed to fixate a central black spot (0.22 × 0.24° of visual angle); eye position was not monitored because small deviations in eye position are unlikely to impact rate judgments about a large flashing stimulus. After a delay (500 ms), the stimulus presentation began.

Auditory events were pure tones (220 Hz) that were played from a single speaker attached to the left side of the monitor. The visual stimulus, a flashing square that subtended 10° × 10° of visual angle (azimuth = 17.16°), was positioned eccentrically so that its leftmost edge touched the left side of the screen. This configuration meant that auditory and visual stimuli were close together; spatial proximity has been previously shown to encourage multisensory binding (Kording et al. 2007; Slutsky and Recanzone 2001). The timing of auditory and visual events was confirmed regularly using a photodiode and a microphone connected to an oscilloscope. Both auditory and visual events were played amid background white noise. For the visual stimulus, the white noise was restricted to the region of the screen where the stimulus was displayed. The auditory and visual equipment configurations used in the present study are similar to those used in a previous study in which subjects were trained to make decisions about rate and were shown to ignore numerosity (Raposo et al. 2012).

Subjects reported decisions by pressing one of two keys on a standard keyboard. They received auditory feedback, which consisted of a high tone (7,019 Hz) for correct and a low tone (219 Hz) for incorrect choices. After the feedback, subjects could initiate a new trial by pressing a third key.

We provided feedback to keep subjects motivated throughout the experiment, which lasted for many days. In each daily session, subjects completed ∼750 trials, separated into 3 blocks. Between blocks, subjects were encouraged to take a 5 -min break. At the end of the session, subjects were shown their average score of the day for further motivation.

We trained subjects for 2–3 days to reinforce the association between numerosity and correct choice for stimuli of very high reliability. Afterwards we used 2 days of Bayesian adaptive estimation (Kontsevich and Tyler 1999) and/or a few days of manual adjusting to find the three amplitude levels for the visual and the single amplitude level for the auditory condition with approximately desired performance. We then collected data for an additional 5–20 days (1,342–24,077 trials). Subjects typically ran on consecutive days (excluding weekends) and performed the task for 45–50 min (with breaks). Variability in the total number of days for each subject was driven by two factors. First, subjects differed in the total number of days they were willing to participate. Second, some subjects were willing to come for additional days but exhibited unstable behavior. For example, some subjects achieved the desired performance level during training but then showed improvement on one modality but not the other. The improved performance required us to change the signal-to-noise in the stimulus to achieve the desired performance level for each modality. Because we wanted to pool data only over days with equal stimulus parameters, days preceding stimulus adjustments were excluded.

#### Data analysis.

We computed psychometric functions for trials in which either of the two valid response keys was pressed. For each numerosity, modality, and reliability, we computed the proportion of trials in which subjects reported numerosity to be higher than the reference. Standard errors for these proportions were computed using the binomial distribution. We then fit the choice data for each modality and reliability by a cumulative Gaussian function of log (numerosity) using maximum-likelihood estimation. The logarithm of numerosity is a particularly advantageous parameterization, because Weber's law, which is valid for numerosity (Shepard et al. 1975), can be modeled as constant noise on a logarithmic number line (Dehaene 2003). On the logarithmic number line we expect the average width of the posterior distribution (the Weber fraction) to be the same for the largest and smallest test numerosity and to also coincide with the width of the cumulative Gaussian fit of the psychometric curve.

Next, we generated error estimates on the standard deviation by a parametric bootstrap given the fitted psychometric function (Wichmann and Hill 2001). For each visual reliability level *i* and bootstrap iteration *k*, we predicted the optimal multisensory standard deviation by
(1)
This predicted bootstrap distribution for the multisensory condition was then compared with the parametric bootstrap distribution using the fitted multisensory psychometric function to assess optimality. Using this approach provides a way of not only predicting multisensory performance but also generating a standard error on that prediction. This allows us to quantitatively evaluate the degree to which individual subjects conform to the optimal prediction. Bootstrapped distributions were examined carefully in each subject to ensure that there was no evidence of bias or skew (Efron 1987).

For the max model and the coin flip model, *Eq. 1* is replaced by
(2)
and
(3)
respectively. That *Eq. 3* is a good approximation for the coin flip model has been shown by (Bahrami et al. 2010).

#### Bayesian model comparison of the optimal, max, and coin flip models.

The models are distinguished by their prediction of multisensory performance given the unisensory performances of a subject at a given reliability level. These relationships have the form
(4)
where
For Bayesian model comparison, we first calculate the model likelihood *p*(*data*|*model*), which is given by marginalizing the (parameter) likelihood over all parameters θ,
(5)
where *p*(*data*|θ, *model*) is the likelihood of a given model and *p*(θ|*model*) is the prior over parameters. We assume a flat, improper prior over parameters. Bayesian model comparison then evaluates the Bayes factor
to test whether *model 1* is better supported by the data than *model 2*, which is the case if the Bayes factor is substantially larger than 1. We also assume the model priors *p*(*model*_{i}) to be equal such that the Bayes factor is purely given in terms of the model likelihoods.

In our case, the integral in *Eq. 5* cannot be calculated analytically and requires an approximation. We use the Laplace approximation, which approximates the integrand by a Gaussian distribution in θ. The mean and covariance of the Gaussian are the maximum-likelihood estimate θ_{ML} and the inverse Hessian Σ_{ML} of the log-likelihood evaluated at θ_{ML}, both of which can be determined numerically. The model log-likelihood is then given by
where *n* is the number of parameters. As we show below, *n* = 5 in our case.

The model likelihood can be evaluated individually for each data triplet of auditory, visual, and multisensory responses of a single subject at a given reliability level. It is given by
(6)
The index *c* runs over modalities and *x* over presented numerosities, *num*_{larger} and *num*_{cond} are the number of “larger” responses and total number of responses at given numerosity and modality, *Bin*(*n*|*N*, *p*) is the binomial distribution, and Ψ(*x*, σ_{c}, *b*_{c}) is the psychometric function. The psychometric function in this case is defined by
where ϵ = 0.01 is a static lapse rate and Φ(*y*) is the normal cumulative distribution function. Note that after the model-specific constraint in *Eq. 4* is imposed, which determines σ_{mult} in terms of σ_{aud} and σ_{vis}, the parameter space of each model is five-dimensional:

The logarithm of the likelihood in *Eq. 6* can be evaluated explicitly, and we find the maximum-likelihood estimate θ_{ML} using the MATLAB function *fminsearch*. We find the Hessian (i.e., the 5 × 5 matrix of 2nd derivatives) numerically using the tool *hessian* (D'Errico 2011).

The overall Bayes factor is obtained by multiplying individual Bayes factors over all reliabilities and subjects. The logarithm of the overall Bayes factor for the three models is shown in Fig. 4*D* and is reported in results.

To ensure that subjects' decisions were informed by the numerosity of the stimulus and not only by its duration, we performed a probit logistic regression with both numerosity and duration as predictors. Because duration and numerosity are not perfectly correlated, this analysis allows us to isolate their relative influence on subjects' decisions. The logistic regression was performed separately for each subject, modality, and reliability level, for a total of 75 conditions.

## RESULTS

We measured behavior in 11 human volunteers to determine whether subjects take into account stimulus reliability when making numerosity decisions about multisensory stimuli. Assuming that the visual and auditory estimates of numerosity are corrupted by zero mean noise with variances σ_{vis}^{2} and σ_{aud}^{2}, the optimal strategy is to weight the two unisensory estimates in proportion to their normalized reliability (or inverse variance). In other words, the more reliable unisensory estimate should contribute more to the multisensory estimate. This strategy requires knowledge of the reliability of each unisensory estimate, which is available if the brain represents numerosity probabilistically.

The optimal model also predicts that multisensory decisions will be more precise than either of the unisensory estimates (Alais and Burr 2004; Ernst and Banks 2002; Fetsch et al. 2009, 2012; Jacobs 1999; Raposo et al. 2012; Sheppard et al. 2013; van Beers et al. 1996), as specified by *Eq. 1*. Numerosity judgments of a representative subject at a single reliability are shown in Fig. 3*A*. The psychometric function for multisensory decisions (Fig. 3*A*, blue curve) is steeper than that for either auditory or visual decisions (Fig. 3*A*, black and green curves), indicating greater precision. For this individual subject, the improvement was significant (*N* = 4,283 trials; σ_{mult} = 0.1657; σ_{aud} = 0.2565; σ_{vis} = 0.2172; *P* < 0.001, bootstrapping).

Next, we tested whether the measured multisensory performance matched the predictions from the optimal model in *Eq. 1* on the basis of measured unisensory performance (Fig. 3*B*, black and green solid lines). Importantly, this approach generates a distribution of parameters, obtained using bootstrapping, for both the measured and predicted multisensory precisions. The near match between the predicted and actual distributions shows that the multisensory improvement of the subject is close to optimal.

This improvement was consistent: for most subjects and conditions, the precision of multisensory decisions was improved relative to unisensory decisions (Fig. 3*C*, 21 of 30 points are below the line presenting *x* = *y*; 8 were individually significant, bootstrap, *P* < 0.05). We further tested multisensory improvement on a population level by evaluating min(σ_{vis,} σ_{aud})/σ_{mult} across subjects and reliabilities. We found this quotient to be on average significantly smaller than 1, showing that average performance on multisensory trials was significantly better than that of unisensory trials (1-sided *t*-test, *P* = 0.015). On average, subjects were unbiased on multisensory trials (μ from the cumulative Gaussian fit was not significantly different from 0, *P* = 0.568). The unbiased choices on multisensory trials reflected an improvement over unisensory trials. Subjects showed idiosyncratic biases on auditory trials (although μ was not significantly different from 0, *P* = 0.09) and an overall weak bias for low numerosities on visual trials (μ was significantly >0, *P* = 0.02). The low numerosity bias on visual trials may indicate that the low signal-to-noise caused subjects to miss individual flashes.

In addition, the performance of the population was frequently close to the optimal prediction (*Eq. 1*). To systematically compare predicted versus optimal performance, we plotted the measured sensitivity on multisensory trials against the optimal sensitivity predicted by the Bayesian model (Fig. 4*A*). We found that for many subjects and reliabilities, the measured sensitivity lies in close proximity to the predicted sensitivity, supporting our hypothesis that subjects perform near-optimal multisensory integration of numerosity. The multisensory sensitivity in 22 of 30 subjects and reliabilities, when tested separately, did not differ significantly from the optimal prediction (bootstrap, *P* > 0.05). To ensure that subjects' decisions were informed by the numerosity of the stimulus and not only stimulus duration, we performed logistic regression. This measures the degree to which numerosity influenced decisions, taking into account that numerosity and duration are weakly correlated (see methods). We found that in 72 of 75 conditions (each a unique combination of a single subject, modality, and reliability), the effect of numerosity on choice was consistently significant and always influenced choice in the same way (choice regressor significantly >0, indicating that higher numerosity led to more “high-numerosity” choices; *P* < 0.05, Wald test). The effect of stimulus duration on choice, by contrast, was only inconsistently significant and had a variable influence on choice (long-duration trials led to significantly more high-numerosity choices in 42 conditions and to significantly more low-numerosity choices in 7 conditions; *P* < 0.05, Wald test). These results indicate that subjects were nearly always influenced by numerosity but were unable to entirely suppress duration information.

Because of the correlation between the numerosity and duration regressors, our finding of a significant numerosity effect might be sensitive to outliers in a subset of conditions. In each of the 75 conditions, we also performed a more robust nonparametric permutation test on the orthogonalized numerosity label (1,000 permutations) while keeping the original duration label (Potter 2005). We found that the numerosity coefficient is significant (*P* < 0.05) in the same 72 conditions as under the parametric Wald test described above. This analysis further confirms that subjects were nearly always influenced by numerosity.

Although subjects consistently estimated numerosity better on multisensory trials, their improvement deviated slightly from optimality in some cases. It is therefore possible that a suboptimal strategy, and in particular one that does not take into account the reliability of the visual and auditory cue, could account for our data. To explore this possibility, we used Bayesian model comparison to determine whether two alternatives to the optimal model could better fit the trial-by-trial choices (see methods).

The max model (*Eq. 2*, Fig. 4*B*) corresponds to a probabilistic strategy in which subjects base the decision on the more reliable modality instead of weighting the individual estimates. This strategy is suboptimal but is very close to optimal if one modality is far more reliable than the other.

By contrast, the coin flip model (*Eq. 3*, Fig. 4*C*) does not take reliability into account and would provide evidence against a probabilistic number representation. According to this strategy, the subjects randomly select the modality on which the decision is based. It predicts a performance equal to the max model for equal unisensory reliabilities but falls behind the max model for unequal unisensory reliabilities.

Figure 4*D* shows the performance of the max model relative to the optimal and coin flip models. For both the optimal and the coin flip models, the log of this ratio is negative, indicating that the max model is the most likely one given our data. The optimal model is a close second with a log Bayes factor of −1.6, compared with −41.5 for the coin flip model. This implies that the coin flip model is extremely unlikely to account for our data.

Finally, we restricted the model comparison to those 9 of 11 subjects for whom we varied the reliability on a trial-by-trial basis, which prevents subjects from being able to systematically ignore the less reliable modality. For this restricted pool of subjects, the max model was again the most likely, followed by the optimal model with a log Bayes factor of −7.2 and the coin flip model with a log Bayes factor of −33.

## DISCUSSION

We found that human subjects can combine multiple sensory sources to refine their representation of numerosity, resulting in better performance when combining visual and auditory cues compared with using either cue alone. Moreover, subjects appear to either pick the most reliable modality on a given trial (max model) or to combine the two cues linearly with weights proportional to the reliability of each cue (optimal model). Both strategies require that human subjects encode the reliability of the numerosity estimates provided by each modality. This supports our hypothesis that subjects represent numerosity probabilistically and can exploit those representations when making decisions.

Our results are broadly consistent with previous attempts to measure multisensory number estimates (Jordan and Baker 2011; Jordan et al. 2008; Vo et al. 2014). However, our study goes far beyond existing work in that we measured responses in well-trained subjects and explicitly measured the effect of manipulating reliability, the strongest test for a probabilistic representation. We observed that reliability influences how stimuli are combined. This finding cannot be described with existing quantitative behavioral models in which the approximate number system extracts and represents a scalar estimate (see Dehaene 2007 for review). Instead, our results are consistent with neural circuits that encode numerosity probabilistically, perhaps by representing an entire distribution over numerosities, or at the very least, the mean and the variance (inverse reliability) of this distribution. Indeed, our results were best captured by models that leveraged the reliability of each cue. The results were largely incompatible with the coin flip model, which does not have a representation of reliability.

Subjects may have obtained an estimate of reliability from a number of sources. A commonly held assumption is that subjects acquire reliability estimates by inverting the generative model of the stimulus (Bishop 2006). A second possibility is that subjects estimate reliability using stimulus duration. Indeed, if the ease of detecting individual events makes duration estimates more reliable for brighter stimuli, leveraging duration to estimate reliability could be a sensible strategy. In our experiment and as reported by others (Ernst and Banks 2002; Knill and Saunders 2003), the source of the reliability estimate cannot be known precisely. Future experiments that independently vary multiple possible sources of reliability are needed to resolve this issue fully.

Although our data support the notion that humans use probabilistic representations for numerosity, we did find deviations from optimal integration. Across all subjects, the best model is the max model, according to which subjects pick the most reliable source on each trial. This strategy leads to near-optimal performance, but it is nonetheless suboptimal. This stands in contrast with other reports for optimal multisensory integration in other domains (Ernst and Banks 2002; Fetsch et al. 2012; Jacobs 1999; Raposo et al. 2012; van Beers et al. 1996). Several factors may have contributed to our finding. First, the suboptimality was only revealed by the Bayesian model comparison, which previous studies have rarely performed. If our conclusions had relied only on the analysis of behavioral thresholds, as most studies have done, it would have been tempting to conclude that our subjects were indeed optimal (Figs. 3 and 4*A*). This argues that future multisensory studies should include the max model as a candidate strategy. Second, there was a substantial amount of subject-to-subject variability: the performance of some subjects was better described by the max model and that of others by the optimal model. This variability in both performance and adherence to the optimal prediction is common on multisensory tasks (Rosas et al. 2005, 2007; Zalevski et al. 2007). The overall Bayesian model comparison weights each subject according to the amount of data we were able to collect from him or her. Because we were able to collect more data from some subjects than from others, the winning model depends to some degree on which strategy the subjects with more data happened to follow. Third, subjects cannot be optimal without training because they have to learn the statistics of the sensory input they are exposed to in order to be optimal. Our task is particularly challenging because it involves a comparison against an arbitrary reference (= 10). It is possible that some of our subjects did not receive sufficient training to achieve optimality. Finally, some subjects may have experienced “interference” because the auditory and visual stimuli were played asynchronously. We used stimulus rates for which a majority of subjects cannot detect audiovisual synchrony (Burr et al. 2009), but there may have been a few subjects for whom the asynchrony was apparent and caused confusion.

Our findings suggest that numerosity is represented probabilistically but leave unanswered how such probabilistic representations are encoded in the brain. Candidate areas are in the frontal and parietal cortices, both of which have been implicated as supporting numerosity estimates in humans (Castelli et al. 2006; Naccache and Dehaene 2001) and monkeys (Nieder and Miller 2004; Sawamura et al. 2002). One possibility is that neural circuits in these regions use a code known as probabilistic population code (Ma et al. 2006). Probabilistic population codes have been shown to be particularly well-suited to perform optimal multisensory integration that can be implemented by a simple linear combination of neural activities (Fetsch et al. 2012; Ma et al. 2006). Implementing multisensory integration with probabilistic population codes requires the existence of two separate unisensory populations whose activity is combined to drive a multisensory population. There is experimental evidence for all three types of populations. Recently, a study by Nieder (2012) tested trained monkeys in a multimodal comparison task and indeed was able to find groups of neurons in parietal and frontal areas that encoded either the number of auditory pulses or the number of visual items, or both. Whether the multimodal neurons add their input linearly remains to be tested, although this prediction appears to be true in other domains, such as the integration of visual and vestibular information for estimating heading (Fetsch et al. 2012; Morgan et al. 2008).

Our observation that the brain leverages probabilistic representations for numerosity decisions opens the possibility that other kinds of mathematical operations might likewise rely on probabilistic computations. For example, the approximate number system can be used not only to estimate numerosity but also to sum quantities together. Recent work (Beck et al. 2011) has shown that neural circuits with divisive normalization, a commonly found nonlinearity in neural circuits (Carandini and Heeger 2012), can implement optimal addition with probabilistic population codes. However, whether summing and increasingly complex computations rely on probabilistic inference remains an open and experimentally tractable question.

## GRANTS

A. K. Churchland was supported by National Science Foundation (NSF) Grant NSF1111197 and A. Pouget by NSF1109366.

## DISCLOSURES

No conflicts of interest, financial or otherwise, are declared by the authors.

## AUTHOR CONTRIBUTIONS

I.K., A.P., and A.K.C. conception and design of research; I.K., A.B., and A.K.C. analyzed data; I.K., A.B., A.P., and A.K.C. interpreted results of experiments; I.K. and A.K.C. prepared figures; I.K. and A.K.C. drafted manuscript; I.K., A.P., and A.K.C. edited and revised manuscript; I.K., A.B., A.P., and A.K.C. approved final version of manuscript; A.B. performed experiments.

## ACKNOWLEDGMENTS

We acknowledge Michael Ryan for help in collecting human subject data.

- Copyright © 2015 the American Physiological Society