Understanding how single cortical neurons discriminate between sensory stimuli is fundamental to providing a link between cortical neural responses and perception. The discrimination of sensory stimuli by cortical neurons has been intensively investigated in the visual and somatosensory systems. However, relatively little is known about discrimination of sounds by auditory cortical neurons. Auditory cortex plays a particularly important role in the discrimination of complex sounds, e.g., vocal communication sounds. The rich dynamic structure of such complex sounds on multiple time scales motivates two questions regarding cortical discrimination. How does discrimination depend on the temporal resolution of the cortical response? How does discrimination accuracy evolve over time? Here we investigate these questions in field L, the analogue of primary auditory cortex in zebra finches, analyzing temporal resolution and temporal integration in the discrimination of conspecific songs (songs of the bird's own species) for both anesthetized and awake subjects. We demonstrate the existence of distinct time scales for temporal resolution and temporal integration and explain how they arise from cortical neural responses to complex dynamic sounds.
How accurately do single cortical neurons discriminate between different sensory stimuli? Addressing this fundamental question is an important step toward understanding the relationship between cortical neural responses and perception (Parker and Newsome 1998). This question has been investigated intensively in the visual and somatosensory cortices, e.g., motion-direction discrimination in area MT (Parker and Newsome 1998) and flutter discrimination in somatosensory cortex (Romo and Salinas 2003). In the auditory system, this question has been probed extensively in the auditory periphery, e.g., intensity and frequency discrimination in the auditory nerve (Delgutte 1995). Currently, little remains known about neural discrimination in auditory cortex.
Auditory cortex plays an important role in the perception of complex sounds, e.g., vocal communication sounds and speech (Fitch et al. 1997; Nelken 2004; Rauschecker 1998; Wang 2000). Lesions of auditory cortex cause a deficit in the perception of speech in humans and vocal communication sounds in animals (Heffner and Heffner 1986; Penfield and Roberts 1959), suggesting an important role for auditory cortex in the discrimination of complex sounds. Yet remarkably little is known about the discrimination of such sounds by auditory cortical neurons. Many natural sounds including vocal communication sounds display striking time-varying structure over multiple time scales (Attias and Schreiner 1998; Escabi et al. 2003; Lewicki 2002; Nelken et al. 1999; Singh and Theunissen 2003). This rich dynamic structure motivates two questions regarding the time scales underlying cortical discrimination of complex stimuli. How does discrimination depend on the temporal resolution of cortical neural responses? How does discrimination evolve over time?
Here we address these questions in songbirds, a model system that offers unique advantages for studying the neural discrimination of complex sounds with particular relevance to human speech (Doupe and Kuhl 1999). We investigate neural discrimination of conspecific songs (songs of the bird's own species) in field L, the avian analogue of primary auditory cortex (ACx), which is likely to play a critical role in the discrimination of conspecific songs (Grace et al. 2003; Sen et al. 2001; Theunissen et al. 2000). Specifically, we quantify the temporal resolution and temporal integration in the discrimination of complex sounds.
We recorded from the field L region of anesthetized and awake adult male zebra finches (Taenopygia guttata). All procedures were in strict accordance with the National Institutes of Health guidelines as approved by the Boston University Charles River Campus Institutional Animal Care and Use Committee. Two days prior to the electrophysiological recording, the bird was anesthetized (0.1–4% isoflurane in 0.5–2.5 l/min O2) for a preparatory surgical procedure to mark the locations of electrode penetrations and fix a head-support pin. A reference point for electrode penetrations was marked with ink 1.5 mm lateral and 1.2 mm anterior to the bifurcation point of the midsagittal sinus, and a steel support pin was glued on the skull. The bird was allowed to recover for 2 days before performing the experiment.
All surgical procedures done for the awake recordings were similar to those performed for the anesthetized recordings except where noted as follows. After the ink mark was made at the estimated location of field L, a lightweight microdrive containing two extracellular tungsten electrodes (impedance: 2–4 MΩ, FHC, Bowdoinham, ME) was positioned above the marked dot. The microdrive was a slightly modified version of a previous microdrive used to record from awake zebra finches (Hessler and Doupe 1999). The skull and dura beneath the dot were removed, and the implant was positioned such that the electrodes just entered the brain. Then the implant was secured to the skull with epoxy. A reference ground electrode was inserted into the brain on the opposite hemisphere from the location of the implant. Finally, the steel support pin was fixed as described in the preceding text.
The stimulus ensemble consisted of 20 undirected, conspecific zebra finch songs recorded in a sound-attenuated chamber (Acoustic Systems, Austin, TX). The songs were sampled at 32 kHz, band-pass filtered to retain frequencies between 250 Hz and 8 kHz, and stored in datafiles for playback (Sen et al. 2001).
The methods were similar to those previously described (Sen et al. 2001; Theunissen et al. 2000). On the day of the experiment, the bird was anesthetized with three intramuscular injections of 20% urethan administered at half-hour intervals (75–90 μl total). The bird was then placed in a double-walled sound-attenuated chamber (Industrial Acoustics, Bronx, NY), facing the loudspeaker that was used for stimulus presentation. The speaker was located 20 cm away from the beak, and the bird was elevated to be at the same level as the center of the speaker cone. The bird's head position was fixed by attaching the steel pin to a frame located on the stereotaxic assembly. A reference electrode (Teflon-coated silver wire) was implanted in the brain close to the recording site. Extra-cellular tungsten electrodes (impedance: 2–4 MΩ) were lowered into the brain using a micromanipulator. Neural responses were recorded at ∼100 μm interval depths. The conspecific song stimuli were played at a peak intensity of 75 dB SPL and randomly interleaved to obtain 10 trials of responses to each song. The electrophysiological signal was amplified, filtered, digitized, and stored on disk for further analysis. In some experiments, the electrode was repositioned for multiple passes. After each recording pass, small lesions were made along the electrode track for later reconstruction of the recording sites. At the end of the recording, the bird was killed with isoflurane, and the brain was preserved in 3.7% formalin fixative for histology.
AWAKE RESTRAINED RECORDINGS.
On the day of the experiment, the bird was restrained in a small cloth jacket to restrict movement and reduce motion artifacts and was then placed into the stereotaxic assembly as described in the preceding text. Different single- and multiunit complexes in the same adult zebra finch were probed by manually advancing the two tungsten electrodes via the microdrive in ∼150 μm intervals. Prior to the creation of the implants, the microdrive was calibrated to determine the amount by which the electrodes advance with each turn of the screw; in this way, the depth of the electrodes relative to the surface of the brain could be estimated. The bird was given a 30-min break after each recording session, lasting 2–3 h, and released from its jacket into its cage. Data from a particular site was obtained within the same recording session. Different sites were sampled over several days to weeks. After the experiment was concluded, the bird was killed, and the brain was stored for histology as in the preceding text.
Histology and classification of recording sites
Prior to sectioning, the brains were stored in 30% sucrose buffer overnight. Parasagittal 50 μm sections of the brain were prepared using a cryo-microtome and stained with cresyl violet (Nissl stain). Electrode placement was verified by comparing electrode tracks and electrolytic lesions to histological markers that define the boundaries of field L (Fortune and Margoliash 1992). Sites were classified as field L sites based on a combination of histology, medial-lateral coordinates and depth of the site. The estimated spectral temporal receptive fields (STRFs) at each site were also consistent with their location in field L (Sen et al. 2001).
Of all the sites we probed, the sites that exhibited an average firing rate that was significantly different (P < 0.01, paired t-test) from the average spontaneous firing rate for at least one song stimulus, were included in the analysis (n = 38). Of these, single- and multiunit activity was recorded at 24 sites in field L from 11 anesthetized birds (6 single units, 18 multiunits). Multiunit activity was obtained from 14 sites in five awake-restrained birds. Spike event times were obtained from the spike waveforms using a window discriminator. Classification of sites into single and multiunits followed the scheme used in Sen et al. 2001. Cases where the spike waveform had a single reliable and stereotyped shape were classified as single units and confirmed using a custom made spike-sorting algorithm. Multiunits consisted of spike waveforms that could be easily distinguished from the background noise but not from each other and contained small clusters of approximately two to five neurons.
We quantified the dissimilarity between pairs of spike trains using a recently proposed Spike Distance Metric (SDM) (van Rossum 2001). First, spike trains were filtered using a decaying exponential kernel with time constant τ (1) where ti is ith spike time, M is the total number of spikes, and H(t) is the Heaviside step function. The spike distance was then computed as the Euclidean distance between a pair of filtered spike trains, f and g (2) A significant advantage of the SDM is that, by varying τ, we could quantify discrimination over different time scales of the neural response. At small time scales the metric acts like a “coincidence detector” with small differences in spike timing contributing to the distance, whereas at long time scales the metric acts like a “rate difference counter” where average firing rates contribute to the distance.
We then used a classification scheme based on the SDM to quantify the neural discrimination of songs (Machens et al. 2003). Ten trials of spike trains were obtained at each site for each of the 20 songs. A template spike train was chosen for each of the songs, and remaining spike trains were assigned to the song with the closest template based on the spike distance measure. This procedure was repeated 1,000 times for different templates. The percentage of correctly classified songs (% correct) was used as a measure of discrimination. The chance level for classification was 5% because a spike train could be assigned to 1 of 20 songs. The data points for percent correct versus spike train length (Fig. 3) were fit with single exponentials by minimizing the least-squares error. The resulting fits had mean rms error percentages of 6.7% (anesthetized multiunit sites), 3.4% (anesthetized single unit sites), and 5.9% (awake multiunit sites).
We recorded neural responses to an ensemble of conspecific songs in field L of anesthetized and awake-restrained adult male zebra finches. Figure 1 shows the neural responses to two songs from three different sites, illustrating different types of responses in the data. A striking feature of some of the responses is the precision of firing as seen in the alignment of spikes in the spike raster, and the sharp peaks in the peristimulus time histograph (PSTH; e.g., Fig. 1, site 1). Precise spiking is a prominent feature of auditory cortical responses (Elhilali et al. 2004; Heil 1997; Sen et al. 2001; Wehr and Zador 2003); however, no study has directly evaluated the contribution of precise timing in the cortical discrimination of complex sounds.
Optimal temporal resolution for discrimination
We quantified the discriminability of spike trains using the SDM (see methods). The distributions of spike distances within songs (a measure of response variability) and across songs (a cue for discrimination) vary with the temporal resolution (Fig. 2A). This results in discrimination accuracy depending critically on the temporal resolution at which spike trains are evaluated (Fig. 2B), reaching the optimal level at ∼10 ms, and falling significantly below this level both at 1 and 1,000 ms.
Temporal dynamics of discrimination
To characterize the evolution of discrimination as the neural response accumulates, we computed the percentage of songs correctly identified as a function of increasingly longer time windows of the spike train (Fig. 3). Discrimination begins at chance level, improves steadily after the onset of the songs, and reaches a plateau for long durations.
Time scales of temporal resolution and temporal integration
To quantify the temporal resolution and time course of integration, we calculated the optimal time resolution for discrimination, τopt (Fig. 2B) and the time constant for single-exponential fits to the temporal integration data, τC (Fig. 3) respectively. Figure 4 shows the distribution of these two time scales in our data. The median values for τopt and τC were 13 and 597 ms, respectively. The difference in these distributions is highly significant (P ≪ 0.001).
We compared three subgroups in the dataset: anesthetized multiunit, anesthetized single unit, and awake multiunit (see methods). Median values and interquartile ranges for the firing rates of the three groups, relative to the background firing rates, were 21.6 spikes/s (12.3–33.0), 12.9 spikes/s (10.5–24.9), and 21.8 spikes/s (10.7–31.3), respectively, and were not significantly different [Kruskal-Wallis (KW) test, P = 0.67]. These rates are higher than average firing rates reported in previous studies (Grace et al. 2003; Sen et al. 2001; Theunissen et al. 2000). However, those studies included sites from the caudal mesopallium, which would have decreased the overall mean rates. Moreover, in a diverse region such as field L, random differences in sampling could lead to substantial variability across different data sets. The median values of τopt for the anesthetized multiunit, anesthetized single-unit, and awake multiunit recordings were 13 ms (10–16), 16 ms (13–16), and 10 ms (8–20), respectively, and were not significantly different (KW test, P = 0.52). The median values of τC for the anesthetized multiunit, anesthetized single-unit, and awake multiunit recordings were 602 ms (550–732), 597 ms (511–724), and 558 ms (484–678), respectively. These values were not significantly different (KW test, P = 0.65). The median maximal accuracy for the anesthetized multiunit, anesthetized single-unit, and awake multiunit recordings were also not significantly different (KW test, P > 0.05).
Cortical discrimination of complex sounds
Although discrimination by single neurons has been probed extensively in the visual and somatosensory cortices (Parker and Newsome 1998) and the auditory periphery (Delgutte 1995), surprisingly little is known about neural discrimination of sounds in auditory cortex. The critical role of auditory cortex in the perception of complex sounds suggests that auditory cortical neurons may play an important role in discriminating between complex sounds. Yet to our knowledge, only one previous study has examined neural discrimination of complex sounds based on the average cortical population activity in primary auditory cortex (Orduña et al. 2005). In songbirds, previous studies at the cortical level have quantified information transmitted by single neurons (Hsu et al. 2004; Wright et al. 2002), discriminability between different categories of natural and synthetic sounds based on mean firing rates (Amin et al. 2004; Grace et al. 2003), and discriminability between different categories of natural sounds by ensembles of neurons (Woolley et al. 2005). This is the first study to quantify the contribution of spike timing to the discrimination of complex sounds by single neurons at the cortical level and the time scales underlying such discrimination. Although the number of single units in this study was relatively small, we obtained similar results from the multiunits, which consisted of small clusters of neurons dominated by a single unit (see methods).
Spike timing versus firing rates in cortical discrimination
The issue of whether spike timing information can contribute to cortical discrimination of sensory stimuli has been controversial in the visual and somatosensory cortices (Parker and Newsome 1998; Romo and Salinas 2003). An important distinction between most previous studies of discrimination in the visual cortex and this study is that the previous studies employed stimuli with constant amplitude, e.g., motion stimuli with a fixed velocity over time, whereas this study employed stimuli with time-varying structure. Although our results do not rule out or support rate-based neural codes for discrimination of constant stimuli, they raise the possibility that the use of dynamic stimuli, e.g., movies of natural scenes may reveal an important contribution of spike timing in cortical discrimination. Indeed, previous studies demonstrating an increase in the reliability of cortical neural responses in response to dynamic stimuli (de Ruyter van Steveninck et al. 1997; Mainen and Sejnowski 1995) are consistent with this idea because an increase in reliability of cortical responses would be expected to improve discrimination. Previous studies in auditory cortex have demonstrated the high degree of temporal precision in auditory cortical responses (DeWeese et al. 2003; Elhilali et al. 2004; Heil 1997; Heil and Irvine 1997) and the contribution of spike timing in the information transmitted by cortical neurons about time-varying stimuli (Lu and Wang 2004; Nelken et al. 2005; Wright et al. 2002) and sound location (Furukawa and Middlebrooks 2002). This study demonstrates the significant contribution of spike timing to the discrimination of complex sounds by single cortical neurons.
Although initial studies of flutter discrimination in the somatosensory system appeared to suggest a neural code based on fine temporal structure, i.e., differences in the periodicity of neural responses, subsequent studies have challenged this view reporting cortical neurons that strongly modulate their firing rates in response to the different stimuli and could thereby mediate accurate discrimination (Romo and Salinas 2003). Previous studies in the auditory cortex of awake animals have revealed a class of neurons that appear to use a rate code to encode time-varying stimuli (Lu and Wang 2004; Lu et al. 2001). We also recorded neural responses in awake birds in this study. However, we did not find neurons where the discrimination performance based on average rates was comparable to the performance at the optimal temporal resolution. Although we cannot rule out the possibility that we did not sample such neurons, there is an alternative explanation. The auditory cortical neurons employing a rate code in the studies by Lu et al. encoded amplitude modulations at relatively high frequencies, whereas vocal communication sounds, such as those used in this study, typically contain relatively slow modulations. As pointed out in the studies by Lu et al., coding based on spike timing and firing rates can complement each other in different stimulus regimes. Vocal communication sounds with relatively slow modulations may be particularly well suited to exploit the temporal precision of cortical responses.
Optimal temporal resolution for discrimination
Much of the debate on neural coding has focused on the question of whether neurons encode stimuli in their average firing rates or precise spike times (Abeles 1991; Rieke et al. 1997; Shadlen and Newsome 1994; Softky and Koch 1993). However, a more unified view of neural coding may emerge if we pose a more general question: what is the optimal temporal resolution for the neural code? Despite its fundamental nature, only a handful of studies have addressed this question (Chichilnisky and Kalmar 2003; Machens et al. 2003; Rieke et al. 1997). These studies have been performed at the sensory periphery. Using a classification method based on the SDM, we found that the optimal temporal resolution for cortical discrimination was ∼10 ms (Fig. 2). This time scale, which is intermediate to the “coincidence detector” and “rate-difference counter” regimes, permits averaging of the neural response over a time window that is sufficiently large to reduce noise due to neuronal jitter but is small enough to avoid loss of significant temporal structure. Surprisingly, the optimal temporal resolution we found at the cortical level was comparable to the optimal temporal resolution at the sensory periphery. Previous studies in auditory cortex have demonstrated the high degree of temporal precision in auditory cortical responses (DeWeese et al. 2003; Elhilali et al. 2004; Heil 1997; Heil and Irvine 1997), and the contribution of spike timing in the information transmitted by cortical neurons about time-varying stimuli (Lu and Wang 2004; Nelken et al. 2005) and sound location (Furukawa and Middlebrooks 2002). It is important to emphasize that the notion of the optimal temporal resolution for a detection or discrimination task is distinct from the temporal precision of neural responses, and these two quantities need not be the same for a neural code (Chichilnisky and Kalmar 2003). Our study extends and complements previous work on cortical coding by quantifying the optimal temporal resolution for cortical discrimination.
Temporal dynamics of discrimination and time scale of integration
Although previous studies have typically examined discrimination performance averaged over the entire stimulus duration, there is growing interest in the temporal dynamics of cortical detection, discrimination, and information transmission (Cook and Maunsell 2002; Gold and Shadlen 2001; Osborne et al. 2004). Such analyses provide information about the speed with which neural performance accuracy accumulates and can ultimately be related to measures of the speed of perception such as reaction times. Our analysis of the temporal dynamics of discrimination revealed a distinct range for the time scale of integration. This range was relatively long on the order of hundreds of milliseconds and significantly different from the optimal temporal resolution for discrimination ∼10 ms (Fig. 4).
Neural computations underlying distinct time scales
Our results suggest two classes of neural computations underlying cortical discrimination of complex sounds: neural computations that provide precise temporal information and neural computations that extract information accumulated over hundreds of milliseconds, while maintaining the information present in precise timing. Primary auditory areas are likely to contribute to the temporal precision of responses through mechanisms such as delayed inhibition (Wehr and Zador 2003) and/or synaptic depression (Wehr and Zador 2005). Less remains known about the second class of computations, which may occur in downstream areas. These computations may involve integration of sensory information over long time scales and could potentially contribute to cortical decision making (Gold and Shadlen 2001). Alternatively, such computations may extract information from a vector of multiple samples or “looks” of sensory information stored in working memory, without explicit integration of sensory signals (Viemeister and Wakefield 1991).
Distinct time scales in auditory perception
A long-standing puzzle in auditory psychophysics is the “resolution-integration” paradox, which refers to the discrepancy in the time scales underlying two types of perceptual tasks (de Boer 1985; Green 1985; Viemeister and Wakefield 1991). Tasks probing temporal integration of sounds have typically found relatively long time scales of a few hundred milliseconds, whereas tasks probing temporal resolution have uncovered a much finer time scale around tens of milliseconds. Cognitive theoretical proposals for distinct cortical time scales underlying the perception of complex sounds, e.g., speech and music have recently received experimental support from functional magnetic resonance imaging experiments (Boemio et al. 2005; Poeppel 2003; Zatorre et al. 2002).
In this study we also found distinct time scales of temporal resolution and temporal integration in discrimination of birdsongs by single cortical neurons. It is difficult to directly compare our results to the perceptual studies for several reasons. First, our study focused on the performance on single cortical neurons in a discrimination task. The analysis of discrimination by neural populations will allow a more complete assessment of the relationship between cortical responses and perception of complex sounds. However, a rigorous quantitative analysis of single neuron performance is critical to understanding the link between neural and behavioral levels, and our results provide constraints for candidate neural codes underlying perception based on single neurons or pooling models based on the most sensitive neurons, e.g., the lower envelope principle (Parker and Newsome 1998). Second, previous studies used different stimuli, which could lead to quantitative differences in the time scales for temporal resolution and temporal integration. Nevertheless, our analysis reveals how distinct time scales of temporal resolution and temporal integration can arise from cortical neural responses to complex dynamic sounds. These disparate time scales provide a cortical analogue of the resolution-integration paradox, i.e., the apparent contradiction that systems integrating information over long time scales nevertheless appear to maintain sensitivity to fine temporal resolutions.
This work was supported by National Institute on Deafness and Other Communication Disorders Grant 1R01 DC-007610-01A1.
We thank L. Abbott and J. Fritz for comments on the manuscript and S. Colburn, R. Gütig, M. Shamir, and H. Sompolinsky for discussions.
The costs of publication of this article were defrayed in part by the payment of page charges. The article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
- Copyright © 2006 by the American Physiological Society