## Abstract

We investigated the representation of visual inputs by multiple simultaneously recorded single neurons in the human medial temporal lobe, using their firing rates to infer which images were shown to subjects. The selectivity of these neurons was quantified with a novel measure. About four spikes per neuron, triggered between 300 and 600 ms after image onset in a handful of units (7.8 on average), predicted the identity of images far above chance. Decoding performance increased linearly with the number of units considered, peaked between 400 and 500 ms, did not improve when considering correlations among simultaneously recorded units, and generalized to very different images. The feasibility of decoding sensory information from human extracellular recordings has implications for the development of brain–machine interfaces.

## INTRODUCTION

The information from images captured by the retina is transmitted as a stream of binary pulses to the visual cortex in the occipital lobe. Visual neurons encode basic properties of inputs, such as orientation, spatial location, spatial frequency, and wavelength of the incident light. After several further intervening stages, neurons in the inferior temporal (IT) cortex—the final purely visual processing region—respond to individual images as well as to categories of stimuli, such as faces, objects, bent paperclips, and other complex visual stimuli (Brincat and Connor 2004; Desimone et al. 1984; Gross et al. 1969, 1972; Hung et al. 2005; Kiani et al. 2005; Logothetis and Sheinberg 1996; Logothetis et al. 1994; Miyashita 1988; Perrett et al. 1982, 1992; Sato et al. 1980; Schwartz et al. 1983; Tanaka 1996; Young and Yamane 1992). Functional brain imaging of the fusiform gyrus along the ventral pathway (Haxby et al. 2001; Kanwisher et al. 1997) as well as clinical lesion data (Damasio et al. 2000; Farah 1990) support the inference that such neuronal representation is the basis of visual recognition and categorization. These structures project to the medial temporal lobe (MTL), including the hippocampus and the amygdala (Cheng et al. 1997; Saleem and Tanaka 1996; Suzuki 1996). It is from the human MTL that our group has recorded individual neurons responding to pictures of individuals, landmarks, and animals (Fried et al. 1997; Kreiman et al. 2000a,b; Quian Quiroga et al. 2005). About one third of these responsive neurons were selectively activated by completely different views of a given individual or object and in some cases even by letter strings with their names (Quian Quiroga et al. 2005).

How information is represented by a population of neurons can be quantified in an objective manner by inferring the stimulus from its associated firing pattern (Abbott 1994; Abbott et al. 1996; Brown et al. 2004; Keysers et al. 2001; Rieke et al. 1996; Salinas and Abbott 1994; Warland et al. 1997). Such decoding constitutes an objective method to quantify how much stimulus information can be read-out from a neuronal population. We here apply a linear classifier to determine, from the activity of simultaneously recorded neurons in the MTL, which picture was shown to the patient on a trial-by-trial basis. We also determine how these predictions develop in time, how they depend on the number of neurons, and how they depend on correlations among them.

## METHODS

### Subjects and recordings

The data come from 34 sessions in 11 patients with pharmacologically intractable epilepsy (all right-handed, four males, 17 to 49 yr old). Extensive noninvasive monitoring did not yield concordant data corresponding to a single resectable epileptogenic focus. Therefore they were implanted with chronic depth electrodes for 7–10 days to determine the seizure focus for possible surgical resection (Fried et al. 1997). Here we report data from sites in the hippocampus, amygdala, entorhinal cortex, and parahippocampal gyrus. Fifteen of these sessions were previously analyzed for invariance of visual representation (Quian Quiroga et al. 2005). These sessions were also used for decoding in the present study. Moreover, they were used to stress further the invariance results using a decoding approach. For this, we studied whether it was possible to discriminate between the different pictures showing an invariant representation and even predict presentations of pictures that were not seen before by the decoding algorithm. In other words, we tested whether, based on invariance, a decoding algorithm was able to generalize.

All studies conformed to the guidelines of the Medical Institutional Review Board at the University of California at Los Angeles and the Institutional Review Board at Caltech. The electrode locations were based exclusively on clinical criteria and were verified by magnetic resonance imaging (MRI) or by computer tomography coregistered to preoperative MRI. Each electrode probe had a total of nine microwires at its end, eight active recording channels, and one reference. The differential signal from the microwires was amplified using a 64-channel Neuralynx system (Tucson, AZ), filtered between 1 and 9,000 Hz and sampled at 28 kHz. Each recording session lasted about 30 min.

Subjects lay in bed, facing a laptop computer on which pictures were shown. The images covered about 1.5 ° and were displayed six times each in pseudorandom order for 1 s. Images were photos of animals, landmarks, celebrities (partially chosen according to the patients' preferences), and photos of people and places unknown to the patients. More details about the stimulus set are available from Quian Quiroga et al. (2005). The interstimulus interval (ISI) was also randomized with the minimum ISI equal to 1.5 s. To enforce attention to the picture presentations, subjects had to respond after stimulus offset whether the pictures contained a face or something else by pressing the “Y” and “N” keys, respectively. As we will see in the following sections, neuronal responses were very selective and therefore they cannot be explained by the performance of this simple categorization task.

### Spike detection and sorting

From the continuous wide-band data, spike detection and sorting was accomplished using a stochastic algorithm (Quian Quiroga et al. 2004). (A Matlab implementation of the algorithm as well as exemplary data are available from www.vis.caltech.edu/∼rodri.) After band-pass filtering between 300 and 3,000 Hz, an automatic threshold was set at (1) where *x* is the band-pass-filtered signal and σ_{n} is an estimate of the standard deviation of the background noise. Note that taking the standard deviation of the signal (including the spikes) could lead to unreliable threshold values, especially in cases with high firing rates and large spike amplitudes. In contrast, by using the estimation based on the median, the interference of the spikes is diminished (Quian Quiroga et al. 2004). We heuristically found the criterion of 5σ_{n} to be optimal for our data. Although this value is relatively high and it increases the probability of missing low-amplitude spikes, it minimizes the number of false positives (i.e., the detection of noise crossing the threshold by chance). False positives can be easily discriminated from large-amplitude spikes after sorting, but they contaminate multiunit clusters; i.e., those clusters comprising activity from several neurons that could not be split further due to their low signal-to-noise ratio.

Once spikes are detected, the algorithm uses the wavelet transform for extracting features of the spike shapes that are used as inputs to the clustering algorithm. This gives a dimensionality reduction that outperforms results using principal-component analysis (PCA) or the whole spike shape (Quian Quiroga et al. 2004). Clustering—i.e., assigning spikes with similar shapes to the same unit—is done using superparamagnetic clustering, a stochastic method that does not assume any particular distribution of the clusters. Superparamagnetic clustering groups the data into clusters as a function of a single parameter, the temperature, which can be changed by the user if the automatic clustering is not satisfactory. Sometimes, clusters can be chosen from different temperatures and spikes not assigned to any of the clusters can be merged to the nearest cluster.

After sorting, the clusters were classified into single units or multiunits. This was done based on *1*) the spike shape and its variance, *2*) the ratio between the spike peak value and the noise level, *3*) the ISI distribution of each cluster, and *4*) the presence of a refractory period for the single units (i.e., <1% spikes within <3-ms ISI).

### Data analysis

The response to a picture was defined as the median number of spikes across trials between 300 and 1,000 ms after stimulus onset. Similarly, the baseline for each picture was the median spike count between 1,000 and 300 ms before stimulus onset. Given our relatively low number of trials per picture, we used the median instead of the mean to decrease the effect of outliers, such as a spontaneous burst of several spikes in one of the trials. A unit was considered responsive if the activity to at least one picture fulfilled two criteria (Quian Quiroga et al. 2005): *1*) the median number of spikes was larger than the average baseline (across pictures) plus 5 SDs; and *2*) the median number of spikes was at least two. Responsive pictures were those that elicited a significant response in at least one unit.

In a first stage, for each session separately we predicted which of the responsive pictures was shown in each trial using only the firing of the responsive neurons, assuming that it would be possible to predict the other pictures if more neurons responding to them had been recorded. Note that the selection of responsive units was done automatically using the above-mentioned criterion and can be seen as the first step of the decoder. In a second stage, for each session we used all simultaneously recorded neurons to predict all the pictures shown to the subject. This was done to verify that our results were not due to a particular definition of responsiveness. Out of an average of 88.4 pictures shown in each session (SD: 11.9, range: 57–114), the average number of responsive pictures per session was 15.9 (SD: 11.2, range: 4–50). Six extra sessions had fewer than three responsive pictures and were not included in the analysis. The number of responsive units was on average 7.8 (SD: 4.5, range: 2–19).

### Selectivity measure

Responses of MTL neurons were very selective in the sense that each unit fired to only very few of the pictures shown based on our criterion for responsiveness defined earlier. To rule out that this was not due to the choice of a large threshold for defining responses, we quantified selectivity using a novel index: *S*. Figure 1 illustrates the procedure with simulated responses. We simulated a neuron with 100 uniformly distributed random firings (Fig. 1*A*, *left*); a second simulated neuron was obtained by multiplying 99 of these responses by 1/3 so that only a single response retained its original value (Fig. 1*A*, *right*). If we denote by *f _{i}* the firing of a given neuron to the stimulus

*i*(

*i*= 1, …,

*N*), we can define the normalized number of “responses”

*R*(

*T*) as the relative number of stimuli with firing larger than a threshold

*T*(2) with θ(

*x*) = 1 for

*x*> 0; θ(

*x*) = 0 for

*x*≤ 0. Figure 1

*B*shows the normalized number of “responses”

*R*(

*T*) for

*M*= 1,000 threshold values in equal steps between the minimum and the maximum firing (

*f*

_{min}and

*f*

_{max}, respectively). The area under this curve (

*A*) is given by (3) where

*T*=

_{j}*f*

_{min}+

*j*[(

*f*

_{max}−

*f*

_{min})/

*M*] are equally distant thresholds between

*f*

_{min}and

*f*

_{max}. This area will be close to 0.5 for a uniform distribution of random firings (Fig. 1

*B*,

*left*), whereas it will be much smaller when only one significant response exists (Fig. 1

*B*,

*right*). We now define the

*selectivity index*(

*S*) as (4) which is close to 0 for uniformly distributed random firings and approaches 1 the more selective the neuron is. For an inhibitory neuron that responds significantly to all but one stimuli,

*S*approaches a minimum value of −1 (see Supplementary Material).

^{1}The solid black curve in Fig. 1

*C*shows the selectivity index

*S*as a function of the number of responses. More details on the selectivity index

*S*, including alternative definitions, how it behaves for inhibitory neurons, for neurons with very low firing, and for binary neurons are given in the Supplementary Material.

For comparison, we also calculated selectivity using a measure proposed by Rolls and Tovee (1995). They defined selectivity (or more precisely breadth of tuning) as the ratio (5) where *r _{j}* is the response to the

*j*th stimulus. This measure yields 1 for equal values of

*r*and approaches 0 for very selective units. The dashed black trace in Fig. 1

_{j}*C*shows the values of the selectivity index

*a*as a function of the number of responses. Note that in this case the selectivity index

*a*spans a limited range of values. Moreover, it shows a nonlinear dependence on the number of responses and it gives the same value of about 0.65 for the case of one single response and 66 responses. On the contrary, the selectivity index

*S*of

*Eq. 4*decreases linearly with the number of responses and can clearly distinguish these two cases.

### Decoding

For each session and in separate runs, the numbers of spikes between 300 and 600, 300 and 1,000, and 300 and 2,000 ms for each trial were used as inputs to the decoding algorithm. Trials were represented as points in an *m*-dimensional space, each coordinate showing the number of spikes for each of the (simultaneously recorded) *m* responsive units. One at a time, the picture shown in each trial was predicted based on the distribution of all the remaining trials (leave-one-out decoding). The result was averaged by considering each of the six trials left out.

Trials to be decoded were classified using a Fisher linear discriminant algorithm (Duda et al. 2001). To simplify, let us consider the case of two classes to be decoded (i.e., two different pictures) given the firing of *m* neurons. The first step is a dimensionality reduction by projecting the *m*-dimensional measurements (i.e., the firing of the *m* neurons in each trial) onto a line where the samples of each class are optimally separated. The direction of this line (*v̄*) is the one that maximizes the ratio of the between-class over the within-class distances. If we denote by *m̄ _{1}* and

*m̄*the centers of the cluster of points corresponding to class 1 and class 2, respectively, the within-class scatter matrix is given by (6) The optimal direction that separates the points of class 1 and class 2 can be demonstrated to be (for details see Duda et al. 2001) (7)

_{2}The second and final step is to assign the trial to be predicted to one of the two classes, for example, by taking the one that has the minimum Euclidean distance in the direction of *v̄*. This procedure can be generalized to multiple number of classes (in our case corresponding to the number of images in each session: *C*), where instead of looking for the optimal separating line, one looks for an optimal (*C* − 1)-dimensional hyperplane and the trials to be predicted are assigned to the class whose center is the closest in the *C* − 1 hyperplane. Although in principle the dimensionality reduction performed with Fisher linear discriminants should improve decoding performances, in our case similar results were obtained with a Naïve Bayesian classifier and a Nearest Neighbor classifier (see Supplementary Material).

Decoding results were plotted in the form of “confusion matrices.” The values on a given row *i* and column *j* of a confusion matrix represent the (normalized) number of times a presentation of picture *i* is predicted to be picture *j*. If the decoding is perfect, *i* = *j* for all trials and the confusion matrix should have entries equal to one along the diagonal and zero everywhere else. Performance at chance levels should be reflected in a matrix in which each entry has equal probability 1/*n*, where *n* represents the number of pictures. Decoding performance was quantified as the percentage of correct predictions, which is the mean of the diagonal of the confusion matrix.

### Statistical analysis

For assessing statistical significance of the decoding results, two separate tests were performed. First, we tested whether the difference between the percentage of correct predictions and chance performances (one over the number of pictures) for the different sessions was larger than zero using a *t*-test. Second, we assessed significance of decoding performances for each session separately. Because the outcomes of the predictions of each picture presentation can be regarded as a sequence of Bernoulli trials, the probability of successes in a sequence of trials follows the binomial distribution (Soong 2004). Given a probability *p* of getting a hit by chance (*p* = 1/*M*, where *M* is the number of responsive pictures), the probability of getting *k* hits by chance in *n* trials is given by where is the number of possible ways *k* hits can happen in *n* trials. From this we can calculate a *p*-value by adding up the probabilities of getting *k* or more hits by chance [*p*-value = *∑ _{j=k}^{n} P*(

*j*)].

### Information analysis

Information theory offers an alternative quantification of how much information about the stimuli is contained in the firing of the neurons. This is usually done by calculating the mutual information between the stimulus and the neuronal responses (8) Here *s* is the stimulus, *r* is the response, *p*(*r*, *s*) is the joint distribution, and *p*(*r*) and *p*(*s*) represent the marginal distributions (Cover and Thomas 1991). If the logarithm is taken in base 2, the information is measured in bits and it specifies how many stimuli *M* can be encoded by the population of neurons (*M* = 2^{I}). In our case, we estimated information using the decoding outcomes from the confusion matrix by calculating the mutual information between the actual stimuli and the decoded stimuli (i.e., between the rows and the columns of the confusion matrix). The maximum information that can be extracted is limited by the number of stimuli *M* and is given by *I*_{max} = log_{2} *M*.

## RESULTS

In 34 experimental sessions with 11 patients, we recorded from 1,547 MTL units (552 single units and 995 multiunits; with an average of 45.5 units per session). Of these units, 265 (17.1%; 131 single units and 134 multiunits) had a significant response (see methods) to at least one picture. All these responses were very selective: on average only 3.3% of the presented pictures (range: 0.9–22.8%) evoked a significant activation. The distribution of responsive and nonresponsive units per session is shown in Fig. 2.

### Single-cell responses

Figure 3 shows five simultaneously recorded hippocampal units that selectively fired to at least one of the images. In this session, 19 out of a total of 53 simultaneously recorded units (including the five shown in Fig. 3) were responsive, and altogether they fired to 32 out of the 114 pictures viewed by the patient. The firing to all 32 pictures that elicited responses is shown in Figure S5. A common characteristic across all cells was that responses started 300 ms after stimulus onset and were mainly of three types: *1*) they occurred between 300 and 600 ms (e.g., picture 58 for unit 3); *2*) they lasted up to 1,000 ms (picture 51 for unit 1); and *3*) they continued for up to 2,000 ms after stimulus onset, i.e., even after the stimulus was removed from view (picture 20 for unit 3). All five units had very low baseline activities and a sharp increase in their firing after image onset. Based on their spike shape characteristics and interspike-interval distributions, units 3 and 5 were classified as multiunits and units 1, 2, and 4 as putative single units (see methods). The single units were nearly silent during baseline (average <0.01 spike/s) and fired with up to 40 spikes/s to only one or two pictures. The multiunits reached similar activation levels, but had a higher baseline activity (0.12 spike/s).

Units 1 and 2 were recorded from a single microwire and their differential activations could be separated only after appropriate spike sorting. Properly classified, the first unit responded to two basketball players and the second one to two landmark buildings. This stresses the importance of optimal spike sorting because, otherwise, the two units would have been grouped together as a less-selective multiunit. Likewise, units 3 and 4 were recorded from a single microwire. Unit 3 responded to the picture of a celebrity, an unknown person, and two animal pictures, and was classified as a multiunit (the spike shapes are displayed in Figure S4). We cannot discern whether the activity of this unit is composed of several, much more selective, single units.

Figure 4 illustrates the spike sorting of the channel containing units 1 and 2 of Fig. 3 (the red and green spikes corresponding to clusters 2 and 3, respectively). The top subplot is a 60-s segment of the continuous data and the threshold (*Thr*) used for spike detection (in red). The leftmost bottom panel shows the projection of the spike shapes onto the first two wavelet coefficients chosen by the algorithm. Note the presence of four clusters, three of which had quite an elongated shape. The remaining bottom panels illustrate the corresponding spike shapes of the sorted units, including the number of events in each cluster (e.g., 592 for cluster 2). There were 13 detected events not assigned to any cluster (not shown). The first (blue) cluster corresponds to a multiunit. Clusters 2, 3, and 4 were identified as single units, although for cluster 2 there were a few spikes with a different spike shape. Cluster 4 contained a total of only 48 spikes and had no significant responses. On the other hand, clusters 1, 2, and 3 had strong responses elicited by a different set of pictures. Responses of clusters 2 and 3 are shown in Fig. 3 (units 1 and 2, respectively) and those of cluster 1 are shown in Figure S5 (unit 6).

### Sparseness

The responses of Fig. 3 were very selective in the sense that each unit fired to only one to four of the 114 pictures shown, based on our criterion for responsiveness defined earlier. The selectivity index *S* (see methods) for the five units of Fig. 3 is represented in Fig. 5*A*. On the *left*, the median number of responses (across 6 trials) for all 114 images is plotted. Clearly, all five units responded to very few pictures. Figure 5*A*, *right* displays the relative number of pictures eliciting responses as a function of a variable threshold. For the five units, *S* had a value >0.9, compatible with a sparse representation.

Figure 5*B* summarizes the distribution of *S* values for the entire population of responsive and nonresponsive neurons. For this plot, however, we used only nonresponsive neurons with at least one response with a median (across trials) of two or more spikes (and without any response crossing the threshold of 5 SDs over the mean baseline activity). This was to avoid spurious results due to very low number of spikes (see Supplementary Material). The median of the distribution for the 265 responsive units was 0.71. As expected, the 527 nonresponsive units were nonselective and their *S* values (median: 0.26) were significantly lower than those for responsive units (*P* indistinguishable from zero, *t*-test), emphasizing the selectivity of the responsive cells. Using the measure of Rolls and Tovee, the median of *a* over the whole population of responsive cells was 0.39 (Fig. 5*C*). In agreement with the results for *S*, values for the nonresponsive units (median: 0.80) were significantly higher (*P* indistinguishable from zero, *t*-test). However, in contrast to *S* the distribution for the responsive units looks bimodal. Such behavior can be attributed to the inherent nonlinearity of *Eq. 5*.

### Population decoding

For each recording session, we predicted in each trial which stimulus was seen by the patient using the other five trials to train the decoder (leave-one-out decoding; see methods). In a first stage, we predicted the presentation of pictures eliciting responses. Figure 6*A* shows the decoding performance for the 32 responsive pictures of the session corresponding to Fig. 3 in the form of a confusion matrix (see methods). The inputs to the decoding algorithm were the number of spikes between 300 to 1,000 ms of all 19 responsive units for this session. The percentage of hits (mean across the diagonal) was 35.4%, which is significantly better than chance (1 of 32 images, i.e., 3.1%) with *P* < 10^{−49} (Bernoulli test; see methods). Considering the number of spikes either between 300 and 600 or between 300 and 2,000 ms after image onset gave very similar results (32% in both cases).

The best performance was achieved when decoding images that elicited a sparse response. For example, for the four pictures shown in Fig. 6*A* (*top*), four of six presentations of each picture were correctly decoded (66% performance). The presence of the spider (picture 20) could be predicted from the activity of unit 3 in Fig. 3, which also fired to the other three pictures but not as strongly. It was possible to infer in which trials picture 33 (actress Pamela Anderson) was present due to the firing activity of unit 5 because this unit did not respond to any other picture. Photos of the tower of Pisa (picture 86) could be accurately predicted from unit 2. In general, if a unit responds to several pictures, these may be confused by the decoding algorithm. A typical case was the confusion between pictures 29 and 26, which both elicited spikes only in unit 8 (see Figure S5). This confusion could have been resolved if additional selective units had been recorded. Results were basically the same when considering all 114 presented pictures and all 53 recorded neurons for this session. In this case, the percentage of hits (10%) was significantly better than chance (1/114 = 0.9%) with *P* < 10^{−50} (Bernoulli test; see methods).

### Dependence on the number of neurons

Next we studied how decoding performance varied with the number of neurons. For this, we calculated the decoding performance using different combinations of *k* (*k* = 1, …, *n*) out of *n* neurons. If there were >30 possible combinations, we randomly picked 30 of them. As illustrated in Fig. 6*B*, the decoding performance for this session increased linearly with the number of neurons. Indeed, a linear fit (thin red line) gave an *R*^{2} value of 0.99 (*R*^{2} = 1 for a perfect linear fit). A qualitatively similar result was obtained when considering all pictures and all neurons, as shown in Fig. 6*C*. In this case a linear fit also gave an *R*^{2} value of 0.99. For the other recording sessions, the decoding performances using responsive units to predict the presentation of responsive pictures generally increased linearly with the number of units considered. Linear fits yielded *R*^{2} values >0.9 for 28 of 34 sessions. Considering all neurons and all images, linear fits yielded *R*^{2} values >0.9 for 21 sessions.

### Results for all sessions

Figure 7 summarizes the average decoding performances across all sessions. In this case the presentations of the responsive images were predicted using the firing of responsive units in the 300- to 600-, 300- to 1,000-, and 300- to 2,000-ms poststimulus time windows. The horizontal line marks the average chance level. Decoding performances were significantly better than chance with *P* < 10^{−12} (*t*-test) for each of the three time windows analyzed. Considering each session separately, in 31, 33, and 32 of 34 sessions the decoding performance was significantly better than chance for the 300- to 600-, 300- to 1,000, and 300- to 2,000-ms windows, respectively (*P* < 0.05, Bernoulli test; see methods). There was a significant improvement when taking the 300- to 1,000-ms window in comparison with the 300- to 600-ms window (*P* < 0.05; *t*-test). In general, predictions using the 300- to 600-ms window were nearly as good as those with the other two larger windows. This is remarkable, considering that in this 300-ms-long interval an average of 4.14 spikes (SD: 4.47; average number of spikes between 300 and 600 ms poststimulation over all responsive units for their corresponding responsive pictures) were sufficient to specify which image was present. Results were similar when considering all neurons and all pictures. In fact, in 33 of 34 sessions the decoding performance was significantly better than chance when considering the 300- to 1,000-ms window (*P* < 0.05; Bernoulli test; see methods). Interestingly, decoding performances were statistically the same when using all units or only the responsive ones to predict the presentations of all pictures (*P* = 0.69, *t*-test). This result stresses the fact that nonresponsive units did not carry additional information for decoding.

As described earlier, units recorded from the same microwire may show different selectivities once their firings are separated after spike sorting. In line with this observation, spike sorting should improve decoding because, if the spikes of two neurons with different responses are mixed, the decoder will tend to confuse them. To test this, for each session we compared the decoding performances with those obtained without spike sorting; i.e., we considered all units from a responsive channel as a single multiunit. As expected, we found that spike sorting significantly improved the decoding performance (*P* < 0.01, *t*-test). The mean improvement in read-out was 9.15% (SD: 17.81%), reaching ≤50% for some sessions.

We also quantified decoding performances using the mutual information between the actual and the decoded pictures in the confusion matrices (see methods). Overall, the average information per session was 1.96 bits (range: 0.76–3.51), slightly more than half the information that could have been recovered (given by the logarithm of the number of pictures). By dividing the total information by the number of responsive neurons in each session, we obtained a mean value (across all sessions) of 0.25 bits per neuron.

### Dependence on the time from stimulus onset

Next we studied the time profile of the decoding performance. Because chance levels depend on the number of pictures to be predicted, and each session had a different number of pictures eliciting responses, we used a normalized measure to average results across all sessions. For each session we defined the normalized decoding performance as (9) where D is the average relative number of hits (i.e., the average of values along the diagonal in the confusion matrix) and chance is 1 over the number of responsive pictures. D_{n} = 0 whenever the performance is at chance. Figure 8 gives the time dependence of decoding averaged across all sessions. The band shows the 95% confidence intervals. Decoding was performed using a sliding window of 100-ms width and steps of 50 ms. Decoding performance was significantly different from zero (*P* < 0.05) between 300 and 900 ms after stimulus onset. In particular, it peaked between 400 and 500 ms, in agreement with the fact that the time window between 300 and 600 ms contained most of the selective spikes.

### Effect of noise correlations

To determine whether trial-to-trial correlations between our simultaneously recorded neurons (an average of 7.79 responsive units per session) carried any extra information, we implemented an approach similar to the Δ*I*_{shuffle} defined by Averbeck and colleagues (2006). For this, we used the same decoding strategy but pseudorandomly permuted the trials corresponding to the same picture presentation, independently for each neuron. That is, the response of, say, the *i*th unit to the first presentation of some image was matched to the response of the *j*th unit to the second presentation of the same image and so on. Then we tested whether the original decoding performance was larger than that of each of the 99 shuffled surrogates generated in this way, thus giving a significance level of *P* < 0.01. Figure 9 shows the original decoding performances and those for the 99 shuffled surrogates, using the number of spikes in the 300- to 1,000-ms time window as inputs to the decoding algorithm. Results using the other two time windows considered (300–600 and 300–2,000 ms) were qualitatively the same. In no case was the original decoding value larger than the values of all surrogates. This was true for all sessions in any of the three time windows considered. Consequently, any trial-to-trial correlations among disparate MTL neurons must have been minor. However, we cannot rule out the presence of correlations carrying relevant information that may be reflected in specific time patterns in the spike trains. The role of synchronization may be more evident if it would be possible to record units with responses to the same pictures, which was rarely the case.

### Invariance and generalization

In 15 of the 34 sessions, patients saw between three and eight different views of specific individuals or objects (on average 3.95 views of 13.53 individuals per session). From these data, we recently reported the presence of invariant units, in the sense that they fired selectively to different views (including line drawings and letter strings) of the same familiar individual, such as the actress Jennifer Aniston, the actress Julia Roberts, and so on (Quian Quiroga et al. 2005). From the 15 sessions of this study where invariance was tested, we had a total of 41 individuals or objects showing an invariant representation. Taking these different pictures as inputs to the decoding algorithm (one individual at a time), we found that in 8 of 41 cases, the decoding performance was significantly larger than chance with *P* < 0.05 (Bernoulli test; see methods), and only in four cases with *P* < 0.01. In general, the decoding algorithm could not distinguish between different presentations of the same individual, reinforcing the idea of an invariant representation by MTL neurons.

One compelling aspect of perception is the ability to deal with novel inputs by means of generalization. Humans can easily recognize a familiar person even though he or she may have a new haircut, wear new clothes, be viewed from a different angle, and so on. Given the invariant representation for individuals described earlier, we reasoned that it might be possible to predict pictures of these individuals even if one particular picture had never been seen by the decoder. To test for this, we grouped all but one picture of the individual with an invariant representation as a single class and checked whether the remaining pictures were correctly predicted to be of this class. For example, we had seven pictures of the actress Jennifer Aniston and established whether presentations of picture 1 of her was recognized as belonging to the same image class as pictures 2 to 7 (the same procedure was repeated for picture 2 and so on). Out of the 41 individuals or objects showing invariance, presentations of 21 of them were correctly predicted based on the unit responses to the other pictures of the same person or object (*P* < 0.05, Bernoulli test; see methods).

## DISCUSSION

We previously reported that responses of MTL neurons are highly selective, with an average of only about 3% of the presented pictures showing significant activations (Quian Quiroga et al. 2005). However, this result depends on the definition of what is considered a response and what is not. To avoid the dependence on the selection of a particular threshold to define what is a significant response, in the current study we quantified the degree of selectivity with a new measure: *S.* Two other measures that are independent of the definition of responsiveness had been proposed earlier (Olshausen and Field 2004; Rolls and Tovee 1995). In the first case, selectivity is assessed from the kurtosis of the distribution of responses. However, this measure is plausible only when the distribution of responses is symmetric, which is not the case for low-firing neurons. Moreover, distributions with different widths can give the same selectivity value (if they have the same kurtosis), something that it is not desirable. In the current study we used the measure proposed by Rolls and Tovee (1995) for comparison. This measure gives a nonlinear dependence with the number of responses and, as a consequence, very different configurations of responses can yield similar selectivity values. On the contrary, the measure we introduced is linear in the number of responses. Moreover, it gives an intuitive graphic representation of the selectivity of the neurons' responses, as shown in Figs. 1 and 5.

Previous decoding approaches studied how relevant information about the stimuli is represented by the activity of a population of neurons, for instance, predicting the position of a rat in its environment from recordings in hippocampus (Brown et al. 1998; Wilson and McNaughton 1993; Zhang et al. 1998), arm movements (Georgopoulos et al. 1986; Musallam et al. 2004; Quian Quiroga et al. 2006; Serruya et al. 2002; Taylor et al. 2002; Wessberg et al. 2000), and saccades (Quian Quiroga et al. 2006; Scherberger et al. 2005) from sensorimotor cortices in monkeys, and image presentations from spiking activity in monkey inferior temporal cortex (Hung et al. 2005).

Here we demonstrated the feasibility of decoding images from the activity of a few responsive neurons in the human MTL. This decoder had several interesting properties. First, it was largely based on an average of 4.47 spikes between 300 and 600 ms after stimulus onset. Second, in general its performance increased linearly with the number of neurons, within the range of the number of units we considered. Third, by using a simple shuffling procedure we established that trial-to-trial correlations in the response strength between the units did not play an important role in our data, in agreement with findings in animals (for reviews see Averbeck et al. 2006; Oram et al. 1998). It remains to be studied whether precise timing mechanisms may be present. Fourth, predictions of which pictures were presented in each trial were significantly better when considering units after spike sorting, compared with those predictions made when taking all detected events of each channel as multiunit (unsorted) activity. Fifth, decoding performances were statistically the same when considering all units or only the responsive ones. This shows that nonresponsive units did not carry any additional information for decoding. Sixth, it was in general not possible to decode which of the pictures with an invariant representation (e.g., different views of a well-known actress) were presented. This result further stresses our previous claims of invariance in MTL neurons (Quian Quiroga et al. 2005) but now using a decoding approach. Finally, also based on the invariant representation given by MTL neurons, the decoder was capable of generalization—i.e., neuronal responses to images of an individual could be used to decode other images of the same individual not previously seen.

A quantification of the decoding results using information theory showed that each responsive neuron carried on average 0.25 bits of information. Values around 0.3–0.5 bits per neuron have been reported in cortical visual areas in monkeys (Optican and Richmond 1987; Rolls et al. 1997). Higher values were found when considering temporal patterns instead of only the average firing (Optican and Richmond 1987). However, we point out that these values were obtained with different stimulus sets, thus having different saturation limits.

In contrast to distributed codes in which information is represented implicitly by the firing activity of large populations of cells (Abbott et al. 1996; Haxby et al. 2001), these data are in agreement with the existence of a sparse, invariant, and explicit representation in MTL, in the sense that the identity of individuals or objects may be represented by a small number of neurons (Koch 2004; Quian Quiroga et al. 2005). This sparse and invariant representation by MTL neurons is reminiscent of Barlow's theory of “cardinal cells” (Barlow 1972) and it is likely to play a key role in the transformation of visual percepts into long-term and abstract memories. This view is also supported by the long latency of the MTL responses. In particular, the peak decoding performance with MTL neurons was considerably later than the one of around 130 ms found in monkey IT (Hung et al. 2005). The possibility of reading-out information from simultaneously recorded neurons in humans is of considerable value for assessing the feasibility and constrains of brain–machine interfaces, so-called neuroprosthetic devices (Andersen et al. 2004; Musallam et al. 2004; Nicolelis 2001; Serruya et al. 2002; Taylor et al. 2002; Wessberg et al. 2000). Our study shows that such decoding is possible in patients, despite the nonoptimal clinical recording conditions, short experimental sessions, and lack of training.

## GRANTS

This work was supported by grants from the National Institutes of Health, National Science Foundation, Defense Advanced Research Projects Agency, The Engineering and Physical Sciences Research Council and the Life Sciences Interface Programme (UK), the Office of Naval Research, the MindScience Foundation, the Gordon Moore Foundation, the Sloan Foundation, and the Swartz Foundation for Computational Neuroscience.

## Acknowledgments

We thank all the patients who participated and E. Behnke, T. Fields, E. Ho, E. Isham, A. Kraskov, I. Viskontas, and C. Wilson for technical assistance.

## Footnotes

↵1 The online version of this article contains supplemental data.

The costs of publication of this article were defrayed in part by the payment of page charges. The article must therefore be hereby marked “

*advertisement*” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

- Copyright © 2007 by the American Physiological Society