|
|
||||||||
The Journal of Neurophysiology Vol. 87 No. 4 April 2002, pp. 1723-1737
Copyright ©2002 by the American Physiological Society
1Department of Bioengineering, University of Utah, Salt Lake City, Utah 84112-9458; 2Coleman Memorial Laboratory and W. M. Keck Center for Integrative Neuroscience, Department of Otolaryngology, University of California, San Francisco, California 94143-0732; and 3Departments of Neuroscience and Otolaryngology, University of Florida, Gainesville, Florida 32610-0244
| |
ABSTRACT |
|---|
|
|
|---|
Nagarajan, Srikantan S.,
Steven W. Cheung,
Purvis Bedenbaugh,
Ralph E. Beitel,
Christoph E. Schreiner, and
Michael M. Merzenich.
Representation of Spectral and Temporal Envelope of Twitter
Vocalizations in Common Marmoset Primary Auditory Cortex.
J. Neurophysiol. 87: 1723-1737, 2002.
Cortical sensitivity in representations of behaviorally relevant
complex input signals was examined in recordings from primary auditory
cortical neurons (AI) in adult, barbiturate-anesthetized common
marmoset monkeys (Callithrix jacchus). We studied the
robustness of distributed responses to natural and degraded forms of
twitter calls, social contact vocalizations comprising several
quasi-periodic phrases of frequency and AM. We recorded neuronal
responses to a monkey's own twitter call (MOC), degraded forms of
their twitter call, and sinusoidal amplitude modulated (SAM) tones with
modulation rates similar to those of twitter calls. In spectral
envelope degradation, calls with narrowband channels of varying
bandwidths had the same temporal envelope as a natural call. However,
the carrier phase was randomized within each narrowband channel. In temporal envelope degradation, the temporal envelope within narrowband channels was filtered while the carrier frequencies and phases remained
unchanged. In a third form of degradation, noise was added to the
natural calls. Spatiotemporal discharge patterns in AI both within and
across frequency bands encoded spectrotemporal acoustic features in the
call although the encoded response is an abstract version of the call.
The average temporal response pattern in AI, however, was significantly
correlated with the average temporal envelope for each phrase of a
call. Response entrainment to MOC was significantly correlated with
entrainment to SAM stimuli at comparable modulation frequencies.
Sensitivity of the response patterns to MOC was substantially greater
for temporal envelope than for spectral envelope degradations. The distributed responses in AI were robust to additive continuous noise at
signal-to-noise ratios
10 dB. Neurophysiological data reflecting
response sensitivity in AI to these forms of degradation closely
parallel human psychophysical results on the intelligibility of
degraded speech in quiet and noisy conditions.
| |
INTRODUCTION |
|---|
|
|
|---|
The
processing of behaviorally relevant species-specific communication
sounds such as speech, monkey calls, and birdsong, both under quiet and
naturalistic noisy background conditions, is an important aspect of
auditory behavior in vocalizing species. Complex species-specific
vocalizations can be parametrically decomposed into acoustic and
perceptual features such as intensity, spectral envelope, temporal
envelope, carrier frequencies, phase, and pitch. Behavioral and
neuronal sensitivities, at different levels of the central auditory
pathway, to these features and combinations thereof that are present in
complex vocalizations are now beginning to be understood. For example,
results from human psychophysical experiments with speech stimuli have
indicated that temporal envelope modulation features, rather than
spectral envelope features, are extremely critical for identification
and recognition of speech under both quiet and noisy background
conditions (Drullman 1995a
; Greenberg and Arai
2001
; Kingsbury et al. 1998
; Shannon et
al. 1995
).
Neurophysiological experiments in bird and bats with complex stimuli
have elucidated some organizing principles in the auditory forebrain
such as spectral, temporal combination sensitivity in discrete cortical
areas and maps. For example, in zebra finches, high-level central
neurons in the auditory forebrain responsible for vocal learning and
adult vocalization discrimination are sensitive to temporal rather than
spectral cues in their stereotypical calls (Margoliash and
Fortune 1992
; Theunissen and Doupe 1998
). In
mustached bats, cortical areas that are involved in echolocation also
participate in the processing of species-specific communication sounds;
neurons are sensitive to a specific combination of features within
complex stimuli (Esser et al. 1997
; Kanwal et al.
1994
; Ohlemiller et al. 1996
). This higher-order
combination sensitivity to features within complex stimuli emerges
hierarchically within the auditory system (Lewicki and Arthur
1996
). Generalizations of these organizing principles to other
mammals remain to be established (Rauschecker 1997
;
Rauschecker et al. 1995
; Tian et al.
2001
).
The basic organizational and functional features of simpler acoustic
stimuli in the primary auditory cortex (AI) of ferrets, cats, and
primates are now well understood (Calhoun and Schreiner 1998
; Cheung et al. 2001
; Eggermont
1991
; Heil et al. 1992
; Imig et al.
1977
; Lu et al. 2001
; Mendelson et al.
1997
; Recanzone et al. 1999
; Schreiner
1991
; Schwarz and Tomlinson 1990
). In these species, the representation and processing of complex spatiotemporal stimuli
especially stimuli that are behaviorally relevant to the animal
are less well understood, especially in primates. Early studies
in squirrel monkeys have demonstrated that although AI neurons respond
vigorously to species-specific vocalizations, they do not exhibit a
high degree of specificity or selectivity to these stimuli
(Glass and Wollberg 1979
, 1983a
,b
; Manley and Muller-Preuss 1978
; Newman and Wollberg 1973a
,b
;
Pelleg-Toiba and Wollberg 1991
; Winter and
Funkenstein 1973
; Wollberg and Newman 1972
).
Other studies suggest that the encoding of complex species-specific
vocalizations is represented in discharge patterns of distributed
neuronal populations (Creutzfeldt et al. 1980
;
Gehr et al. 2000
; Rauschecker 1997
,
1998a
,b
; Rauschecker et al. 1995
; Rotman
et al. 2001
; Tian et al. 2001
; Wang et
al. 1995
). Evidence in favor of this "distributed-encoding"
hypothesis has been demonstrated perhaps most clearly in the common
marmoset (Wang et al. 1995
), where the
spectrotemporal discharge patterns of spatially distributed neuronal
populations in the AI was correlated with the spectrotemporal acoustic
patterns of complex natural vocalizations. Interestingly, a majority of
neurons in AI exhibit a preference for the natural time scale or
modulation frequency of complex vocalizations (Wang et al.
1995
) and cortical neurons that are highly effectively excited
by vocalizations presented in the forward (natural) direction respond
very poorly to reversed-direction forms. More recently, direct evidence
for distributed encoding of species-specific vocalizations in AI has
also been obtained in cats (Gehr et al. 2000
;
Rotman et al. 2001
).
In the current study, we further examine this distributed encoding hypothesis where coherent neuronal subpopulations contribute to the representation of complex stimuli in the marmoset AI. The goals of the current study are to compare quantitatively neural responses to vocalizations with responses to sinusoidal amplitude modulated (SAM) tones: both stimuli have comparable periodicity; to evaluate the robustness of patterns of distributed response profiles to natural and degradated vocalizations, synthesized by changing either the spectral or temporal envelopes; and to measure the response sensitivity of distributed populations of AI neurons to natural and degraded vocalizations in the presence of background noise.
| |
METHODS |
|---|
|
|
|---|
Vocalization recordings and degradations
Vocalizations were recorded similar to those described earlier
(Wang et al. 1995
). Briefly, 1-2 weeks before each
experiment, the marmoset under study was placed in a separate
individual cage in the colony. A digital tape recorder with a sampling
rate of 48 kHz, and a 16-bit A/D converter was used to record 2 h
of this monkey's vocalizations each day. These recordings were then
scanned for loud and clear calls, and candidate vocalizations were
subsequently transferred to a computer for editing using MATLAB.
Twitter calls, the specific vocalizations under study here, were
commonly produced by the monkeys in the colony. Calls vocalized by
individual marmosets were highly stereotypical with stable
spectrotemporal features. A stereotypical twitter call for each animal
was chosen from a set of 10-30 vocalizations. The representative
twitter call for each animal was referred to as the natural call or the
"monkey's own call" (MOC). For each MOC, the vocalization
"phrase-frequency" fv, was
computed using methods similar to those described by Wang and
colleagues (1995)
. Briefly, the Hilbert transform of the call was computed to obtain the analytical signal. The absolute value of the
analytical signal was low-pass filtered with a finite-impulse response
(FIR) filter with a cutoff of 100 Hz to obtain an estimate of the
envelope of the call. The power spectrum of the envelope was then
estimated using multi-taper spectral estimation methods. The
frequencies at which multi-tapered spectrum had a maximum was chosen to
be the phrase frequency. The MOC was then systematically digitally
degraded in two forms using filter-bank analysis-by-synthesis algorithms. A filter bank was constructed from FIR filters designed with >50 dB attenuation in the stop-band and <0.001 dB ripple in the
pass-band. The filter bank was designed to give perfect reconstruction
of stimuli between frequencies of 4-20 kHz, which covered the spectral
energy of all naturally occurring twitter calls (see Fig.
2A). The number of filters in the filter-bank varied from 2 to 16 bands, corresponding to a bandwidth decrease in each filter from
8 to 0.5 kHz, respectively. Each narrowband signal from the filter bank
was decomposed into an envelope and a carrier waveform.
In spectral envelope degradation, calls were resynthesized with intact envelopes within narrowband channels of varying bandwidths but with the carrier signals modified to band-limited noise for each channel. Multi-tapered spectrogram examples of such a set of degraded calls is shown in the left column of Fig. 1 for 16-band (Fig. 1B), 8-band (Fig. 1C), 4-band (Fig. 1D), and 2-band (Fig. 1E) filters in the filter bank. Stimuli were generated parametrically as a function of the number of bands in the filters. In another form of degradation, within each of 32 narrowband channels, the temporal envelope was modified. However, the carrier frequencies and phases were kept identical to the MOC. This form of MOC manipulation was referred to as "temporal envelope degradation." Multi-tapered spectrogram examples of these degraded calls are shown in the right column of Fig. 1, F-J. The temporal envelope was band-pass filtered between 2-30 Hz (BP2-30 call) in Fig. 1F, low-pass filtered at 4 Hz (LP4 call) in Fig. 1H, low-pass filtered at 10 Hz (LP10 call) in Fig. 1G, and high-pass filtered at 60 Hz (HP60) in Fig. 1I.
|
Stimulus generation and delivery
In addition to natural and degraded calls, the stimulus ensemble in our experiments included tone pips, SAMs, clicks, and calls presented in continuous background noise. Tone pips (50-ms duration, 3-ms linear ON-OFF ramp) were used to derive frequency response areas for each recording site. Tone pips were generated on a digital signal processor (TMS 32010). The characteristic frequency (CF), the frequency of a tone to which the neuron is most sensitive, was first estimated using manually adjusted tonal frequencies and intensities. Subsequently, frequency-intensity "response area" was derived by randomly presenting tones at 15 different intensities and 45 different frequencies (ranging over 2-4 octaves) centered at the estimated CF at an inter-stimulus interval of 400-1,000 ms. SAM tones (500-ms duration, 55 dB-A, modulation depth = 100%) were generated by using the estimated CF of each recorded unit or cluster as the carrier frequency and were presented at 10 modulation rates between 2 and 20 Hz, randomly interleaved. Periodic click train sequences of constant 500-ms duration were also generated and presented to one animal in our study. The number of clicks ranged from a single click to 19 clicks over 500 ms to generate stimulus rates of 2-38 Hz in 4-Hz steps. Click stimuli were biphasic, with 200 µs per phase. A particular click train sequence was delivered 15 times with a pause of 1-2 s between successive click train sequences.
Vocalization stimuli (55 dB-A) were generated using a Silicon-Graphics
workstation. Natural and degraded vocalizations were also presented in
continuous white noise. Continuous background noise was generated by a
General Radio Model 1390-B noise generator at fixed intensities of 57 to 27 dB-A corresponding to signal-to-noise ratios (SNR) of
2 to 28 dB. All experiments were conducted in a double-walled soundproof room
(IAC). Stimuli were presented through a STAX-headphone enclosed in a
small chamber that was connected through a sealed tube into the
external acoustic meatus of the contralateral ear. Both vocalizations
and SAM tones were randomly interleaved and presented with a silent
period of 1 s plus a jitter of
500 ms that was uniformly distributed.
Surgical preparation
All procedures described in the following text were conducted in
accordance to protocols approved at the University of California at San
Francisco that followed animal guidelines established by the National
Institutes of Health. Adult marmosets were anesthetized with a mixture
of halothane (2%)-oxygen (48%)-nitrous oxide (50%) to induce a
surgical level of anesthesia. The skin overlying the trachea,
stereotaxic pin sites, and the scalp was injected with a local
anesthetic (lidocaine 2%). Tracheotomy was performed to secure the
airway. An intravenous (iv) cannula was placed into the saphenous vein
for delivery of pentobarbital sodium (15-30 mg/kg iv). The level of
anesthesia was titrated to effect throughout the experiment. Lactated
Ringer solution with 5% dextrose and 20 meq/l KCl was infused (6-8
ml · kg
1 · h
1) to maintain body hydration, metabolic
homeostasis, and cardiovascular function. A third generation
cephalosporin that crosses the blood brain barrier was
administered intravenously to prevent infection. A warming blanket with
feedback control was used to maintain proper body temperature. During
the entire experiment, the animal's level of anesthesia, fluid status,
urine output, core temperature, and cardio-pulmonary functions were monitored.
A craniotomy was performed to expose the dura around auditory cortex. The overlapping cortex was then exposed by reflecting the dural flap and was maintained under a thin layer of silicone oil. A video image of the cortical zone was captured and stored in a computer to guide the positioning of microelectrode penetrations relative to the vasculature on the surface of the cortex.
Recordings
Double-barrel tungsten microelectrodes coated with parylene
(FHC) with impedances of 1-2 M
at 1 kHz were introduced orthogonal to the cortical surface with a hydraulic microdrive (Kopf). In each
animal, responses were sampled from ~50-75 cortical sites per
hemisphere. The microelectrodes were separated by 250-300 µm and
controlled by a single microdrive. All recordings were obtained from
the tonotopically organized marmoset AI. Neuronal activity of small
groups of neurons (multi-units) were recorded from each electrode at
depths of 700-900 µm from the cortical surface, corresponding to
cortical layers IIIb and IV. A window discriminator (BAK DIS-1) was
used to isolate action potentials relative to background noise. Spike
times were recorded in a computer and raw waveforms were stored on
digital tapes for off-line analysis. The goal of these experiments was
to obtain responses from a large number of AI neurons in a single
animal, both for accurate distributed reconstructions and for efficient
usage of animals. Therefore we placed less emphasis on the fact that
all our recordings were multi-units, not single units. Off-line spike
sorting of our data has also been difficult because of the considerable
overlap between spikes from the neurons within our clusters. We
attribute this to the fact that neurons in each cluster appear to
encode similar information about stimuli.
Data analysis
Data collected from 589 sites recorded in left and right
hemispheres in four marmosets (M121: left, 101 sites, and
right, 64 sites; M213: left, 79 sites, and right, 40 sites;
M403: left, 82 sites, and right, 35 sites; M379:
left, 102 sites, and right, 86 sites) are presented. No systematic
differences were observed between hemispheres and data were pooled
across hemispheres in each animal. From the multi-unit response to tone
pips at varying intensities and frequencies, FRA were obtained at all
recording sites from which the CF, minimum response threshold and
tuning bandwidths (Q10 and Q40) of each site were measured
(Schreiner and Mendelson 1990
). The distribution of
CF's sampled from the four animals in this study is shown in Fig.
2B. This sampling distribution
is well matched to the average spectral energy of all vocalizations
used in the study (Fig. 2A).
|
Response properties to vocalization stimuli were first analyzed by
accumulating the spike counts for each electrode penetration in 1- and
2-ms bins to form peristimulus time histograms (PSTHs). Subsequently,
multi-taper spectral estimation techniques were applied to determine
spectral energies in the PSTHs. This method offers an improvement over
direct FFT transforms applied to the PSTH data to determine estimates
of spectral power (Thomson 1982
). An example of this
analysis is shown in Fig. 3. Figure
3A, top, shows the spike rasters from a multi-unit recording
with CF = 6.97 kHz in response to MOC. Figure 3,
bottom, shows the PSTH, which illustrates the phasic firing
response to each phrase of the twitter call. Figure 3B shows
a spectral estimate of the PSTH computed for a time-bandwidth factor of
1.5 at frequencies ranging from 0 to 50 Hz, from which a response
strength measure (Rs) was calculated
by averaging the magnitude spectrum over a small band of frequencies
around the vocalization phrase rate,
fv. The vocalization phrase rate
indicates the average inter-phrase frequency of each call and was
computed using procedure similar to those described in Wang et
al. (1995)
. Other peaks in the spectral estimate appear at
harmonic frequencies to the vocalization phrase frequency and were not
used in this analysis. The magnitude spectrum is expressed in a dB
scale relative to 1 spikes2/s. For this example,
the response strength (Rs) to a call
is 32.21 dB re. 1 spike2/s. A similar
response strength measures was also derived from the PSTHs to SAM and
click-train stimuli wherein the response strength
(Rs) at each modulation rate was
derived from the mean spectral estimate of the PSTH at the modulation
rate of the SAM stimulus and click train sequence, respectively. The
response synchronization measure used here is similar to the
"synchronized discharge rate" measure developed by Wang et
al. (1995)
. Two differences between our analysis and that of
Wang and colleagues is first, we use multi-taper spectral estimation
methods to estimate the spectral-power at the phrase frequency. Second,
we express this measure in dB (log) units, typical for any measure of
spectral power.
|
Two additional measures were used to characterize the entrainment response to vocalizations. First, the mean driven spike rate (MDR) over the entire epoch of the stimulus was determined. Second, the mean driven spike-rate/phrase (MDP) of the stimulus was computed. The MDP was the average number of spikes within a 15- to 40-ms window following the onset of each phrase of the twitter call. To compare these response measures across conditions, one-way ANOVA statistical tests were performed with stimulus condition as a factor and the response measures as the dependent variable, Data for these analyses were pooled across animals and electrode penetrations.
Population PSTHs, neurograms, and mean-spectral representations were also computed by integrating the responses across electrode penetrations. Population PSTHs were computed by averaging the PSTHs from individual multi-unit recordings across the entire set of recordings obtained from a single animal. Neurograms are linear time-frequency maps of the distributed responses to a particular stimulus. The vertical axis is characteristic frequency, the horizontal axis is time, and the gray scale level encodes the response strength. Neurograms were computed by ordering PSTHs from individual electrode penetrations according to CF. Linear interpolation followed by a one-step three-tap median filtering was then performed at each time bin along the CF dimension to obtain a response map with uniform time-frequency tiling. No smoothing was performed along the time axis. Mean-spectral representation was computed by averaging the neurograms across time. Examples of population PSTHs, neurograms and mean-spectral representations for MOC are shown in Figs. 5 and 6. Finally, in addition to computing a neurogram, we also compute correlation-matrices of the responses in AI to MOC. This matrix comprises the correlation-coefficients between the responses of different neurons with different characteristic frequencies (Fig. 7, A-D). Analogous calculations were also performed on the multi-tapered spectrogram of the stimuli to compute the cross-frequencies correlations present in a call (see Fig. 7, E-H).
| |
RESULTS |
|---|
|
|
|---|
Comparison of responses to SAM stimuli, click-trains sequences, and MOC
First, we address the relationship between responses to MOC and
responses to SAM stimuli of comparable modulation rates. The vocalization phrase frequencies were comparable across animals at
7.7 ± 0.53 (SE) Hz. Figure
4B shows the distribution of
strength of response synchronization to MOC (mean
Rs = 13.2 dB), and Fig. 4A
shows a corresponding distribution for the strength of response synchronization to SAM stimuli with a modulation frequency of 8 Hz
(mean Rs = 7.7 dB). A scatter plot
between the responses to MOC and the responses to SAM stimuli across
the population is shown in Fig. 4D. Responses to
vocalizations and to SAM stimuli were correlated (r = 0.45, P < 0.001). At the same time, these data
indicate that only 22% of the variance in the responses to MOC across
the population of neurons recorded in AI can be accounted for by the
temporal modulation transfer function characteristics of the neurons.
Responses to SAM stimuli of comparable modulation frequencies typically
underestimate the strength of the response to vocalizations.
Interestingly, the correlation between the best modulation frequency to
SAM and the strength of its response to vocalizations was not
significant (r =
0.03, data not shown). To account for
whether the lack of spectral content in SAM tones when compared with
MOC could account for this moderate correlation between SAM stimuli and
MOC, the correlation between the responses to click trains at
comparable periodicities and the responses to MOC were computed. Figure
4C shows the entrainment response strength to click trains
(mean Rs =
0.938). This data
indicated that only 13% of the variance in MOC responses could be
accounted for by the responses to clicks (Fig. 4E).
Furthermore, it is seen that the responses to click trains
significantly underestimate the responses to MOC.
|
Distributed responses to MOC
Several features of the distributed response to a MOC in individual animals were elucidated in population PSTHs, neurograms, and mean-spectral-representations derived from >100 multi-unit recordings in each individual marmoset. Examples of data each from two different monkeys (M213 and M403) are shown in Figs. 5 and 6. Population PSTHs, neurograms, and the mean-spectral-representations are plotted in A (for a complete MOC) and C (for the 1st 2 phrases of the MOC). The time-aligned spectrograms, mean temporal envelope and spectral envelopes of the MOC are shown in B (complete call) and D (1st 2 phrases). An examination of the neurograms reveals that the distributed cortical responses represent the main spectrotemporal features of the calls albeit in an abstracted manner (Figs. 5A and 6A). In particular, frequency shifts of peak energy regions and details of intra-phrase and inter-phrase timing are reflected in the neurograms. A more detailed examination indicates that some of the features of the call spectra are distorted in the distributed cortical encoding through long- and short-range response synchronization. For example, the first phrase of the spectrogram of the MOC in Fig. 5D exhibits a chirp. In the neurograms, the timing of this chirp is encoded in the latency of the peak firing-rate across corresponding recording sites (see Fig. 5C). However, the chirp rate representation is compressed in time, an indication of increased short-range synchronization. In general, the dispersion in firing latencies is smaller than the duration of FM sweeps (Figs. 5 and 6). Similarly, piecewise-linear segments of the FM sweep in MOCs appear to be encoded in the synchronous firing of different populations of neurons in AI (Figs. 5C and 6C).
|
|
We also compute the correlation between the responses to MOC of different neurons with different characteristic frequencies. This is shown for the four animals of our study in Fig. 7, A-D. From this analysis, it is clear that long-range synchronizations (>1 octave in CF difference) and short-range correlations are observed across the distributed population of AI in response to a call. The short-range correlations are expressed by the widening of the diagonal in the correlation comparison. Long-range correlations are seen as separate regions off the diagonal in Fig. 7, A-D. Interestingly, both the short-range and long-range correlations between the responses are rather different from across-frequency stimulus-based correlations for three of the four animals studied (see Fig. 7, E-H).
|
In addition to comparisons between the spectrogram and neurogram and to examining the cross-frequency correlations, we also compared the average temporal envelope of the calls with the population PSTH and the average spectral envelope with the mean-spectral representation. The goal here was to examine the encoding of the stimuli in the average temporal and spatial firing patterns across AI, quantified by examining the correlation coefficients as outlined in the following text. For each phrase of a MOC, correlation-coefficients were computed between the population PSTH and the average temporal envelope, and the mean-spectral representation and the spectral envelope of the call, respectively. Given that the correlation-coefficient represents only zero-time-lagged correlations and the fact that a time lag exists for the temporal response to stimulus due to propagation of information to auditory cortex, the maximum of nonzero time lagged correlations between the population PSTHs and the temporal envelope were also computed (see Fig. 8). Typical lag times at which the correlations were maximum were ~20 ms, consistent with conduction and propagation delays in the auditory system. The correlation analyses were conducted on a per-phrase basis as well as on a cumulative-phrase basis. Individual phrase analysis was conducted for a fixed duration of the call and the evoked responses, corresponding to each phrase of the call (Fig. 8A). For cumulative-phrase analysis, the duration of the calls and the responses were incremented to include additional phrases (Fig. 8B). In all animals, the lagged-temporal correlations between the temporal envelope and the population PSTH are significantly higher than the nonlagged correlation both for the individual per phrase and the cumulative phrase analysis conditions. The correlation between the spectral envelope and mean spectral representation (spectral profile of the neurograms) showed low correlation coefficients. For later phrases or when including later phrases into the cumulative analysis, the correlation coefficients were either near zero or even negative (Fig. 8). Therefore over the time scale of the call, the temporal envelope of the call is better represented by the population response of AI neurons than the spectral envelope.
|
Responses to spectral envelope degradations
The relevance of details in the spectral envelope for the coding of complex sounds was further tested by gradually degrading the spectral information present in the calls. The synchronous firing of responses in AI to marmoset calls did not decrease when spectral envelopes were degraded by reducing the number of bands from 16 to 2 and by randomization of the carrier phase. Figure 9 illustrates examples of population PSTHs and corresponding neurograms from the entire sample of neurons recorded in one animal (M121). Qualitatively, despite variations in the responses of individual neurons across the population, the spectral pattern of the neurograms and in particular, across the population the synchronized temporal discharge pattern did not differ considerably as the number of bands was reduced (Fig. 9, A-D).
|
Three measures were used to quantify the responses
MDR across the
entire duration of the call, response strength
(Rs), and mean driven-rate/phrase
(MDP; Fig. 10). These measures are
shown both for the MOC and when the number of bands of the spectral envelope degradations was reduced from 16 to 2. Although two of the
three measures, the MDR and Rs, were
significantly different across these stimulus conditions
(P < 0.01 and P < 0.05, respectively), mean differences across these conditions were small. For
example the mean difference in MDR was only 2-3 spikes/s, and the mean difference in Rs was ~2-3
spikes2/s.
|
Responses to temporal envelope degradations
In contrast to responses to calls with spectral envelope degradations, the neural representations was altered dramatically by degradations in the temporal envelope when the carrier and phase structure is unchanged. An example of the effects of temporal envelope degradations is shown in Fig. 11 (M379). Qualitatively, the synchronized responses was significantly diminished for calls where the temporal envelope is high-pass filtered at 60 Hz (HP60, Fig. 11D), low-pass filtered at 10 Hz (LP4, Fig. 11C), and low-pass filtered at 4 Hz (LP10, Fig. 11B). However, there is no reduction in the synchronous response profile to each phrase of the call when the temporal envelope is band-pass filtered between 2 and 30 Hz (BP2-30, Fig. 11A), compared with MOC.
|
To quantify the responses across these different conditions, the three measures MDR (Fig. 12A), Rs (Fig. 12B) and MDP (Fig. 12C) were computed. MOC responses are greater than the responses to LP04, LP10, and HP60 calls (P < 0.00001 for all 3 measures). However, the responses to BP2-30 calls were found to be slightly greater than the responses to MOC (P < 0.0001 for all 3 measures).
|
Response to MOC in the presence of background noise
Responses to MOC under continuous noise background conditions
(SNR
10 dB) does not degrade the spectrotemporal relationships across units. Population data from the four animals studied are illustrated in Figs. 13 and
14 for calls presented in the
presence of background noise at 20 (Figs. 13, A and
C, and 14, A and C), 10 (Figs. 13,
B and D, and 14B), and 0 dB SNR (Fig.
14D). Qualitatively, the population PSTHs do not show a
significant reduction under noise levels of 20 or 10 dB SNR (see Figs.
13, B and D, and 14B). However, the
responses at 0 dB SNR clearly indicate a marked and significant
reduction (P < 0.001) in the synchronous response to
each phrase of the call (Fig. 14D). However, even at the
lowest tested SNRs, the across-frequency short- and long-range
correlation structure is only moderately altered (Fig.
15).
|
|
|
A closer examination of the neurograms revealed that individual differences exist in the responses of the AI neurons to MOC under background noise. Surprisingly, some neurons respond stronger to calls under noisy conditions when compared with silence as illustrated by units with CFs >10 kHz (compare Fig. 5 with Fig. 13, C and D). About 10-15% of units in every animal have greater synchronous firing rates evoked by calls presented in noise than by calls in silence.
Comparisons of responses to spectral and temporal envelope degradations and additive noise
Comparisons among responses to degraded forms of vocalizations and additive noise to MOC (SNR 20 dB) are quantified using normalized variant of the three measures, MDR, MDP, and Rs. These summary measures of parametrically varied experimental conditions are normalized to the MOC responses in silence. Figure 16 illustrates the differences. Three observations are revealed from this normalized data analysis. First, there is a small but significant reduction (Fig. 16, A-C, P < 0.00001), for all three measures in the response within each recording site to degradations of the spectral envelope when compared with the response to MOC. Second, the reduction in responses to temporal-envelope degraded calls (LP04, LP10, and HP60) are similar to corresponding reductions in Rs, and MDP, to MOC presented at SNR = 20 dB (P > 0.05 for both measures, Fig. 16, B and C). Third, the response to the BP2-30 Hz call is consistently and significantly greater than the response to MOC (P < 0.001 for all measures, Fig. 16, A-C).
|
| |
DISCUSSION |
|---|
|
|
|---|
Six major findings of this paper clarify and extend our
understanding of the distributed cortical encoding of complex
species-specific vocalizations within AI of barbiturate-anesthetized
marmosets. These findings can be summarized as follows. First,
responses to twitter vocalizations are correlated with responses to
amplitude-modulated tonal stimuli of comparable modulation rates.
Second, both long- and short-range synchronization across the frequency
axis are observed in the spectrotemporal responses to complex
vocalizations. Third, the temporal envelope of complex vocalizations is
significantly correlated with the temporal response patterns of AI
neurons. In contrast, the spectral envelope of a call is poorly
represented in the mean-spectral representation of AI responses.
Fourth, response magnitudes in AI neurons are relatively insensitive to
the fidelity of the spectral-envelope characteristics of the call.
Fifth, responses of AI neurons and their distributed synchronizations
are significantly reduced when the temporal envelope is degraded. These
data indicate that AI neurons in the barbiturate-anesthetized marmoset
monkeys mainly respond to, and represent by their synchronized
discharges, the temporal cues in complex stimuli. Finally, responses of
AI neurons to MOC and con-specific calls are robust in the presence of
additive background noise (SNR
0 dB).
Neurophysiological data in this study parallel human behavioral data on
speech intelligibility with degraded forms of speech. Several studies
of human speech perception have revealed that slow fluctuations in
speech spectrum for intelligibility manifested as low-frequency
temporal modulations (<16 Hz) in narrowband are sufficient
(Drullman 1995b
; Drullman et al. 1994a
,b
;
Shannon et al. 1995
). Use of relatively slow modulations
reduces both the sensitivity of listeners and artificial speech
recognition devices to background noise, reverberation, and transient
interference (Kingsbury et al. 1998
). Such human
behavioral data are similar to the sensitivity of marmoset AI neurons
to slow modulation of the temporal envelope of the twitter call.
Shannon et al. (1995)
performed degradations in speech
similar to our spectral envelope degradations. Their studies have shown that speech intelligibility is preserved as long as temporal envelope cues are present despite gross spectral degradation to as few as four
broad frequency bands. A significant reduction in speech intelligibility was observed when only two bands were used. This behavioral correlate is not directly evident in our recordings from AI
neurons potentially due to differences in the acoustics and perception
between monkey calls and speech. However, a small but consistent tuning
in the response measures is observed across the spectral degradation
continuum, with a nadir for a four-band stimulus form as indicated in
the normalized analysis of Fig. 15. The temporal structure of the
twitter call is overwhelmingly dominant for perception. Whether this
tuning is sufficient to account for behavioral data of speech
intelligibility in humans remains to be further investigated. It should
be kept in mind, though, that the spectral degradation has other
effects than response magnitude reduction with potential perceptual
consequences. Most notably (see Fig. 9), fine details of the onset
timing of phrases across CF are eliminated by limiting the spectral
information to a small number of bands. This loss of fine timing
distinctions between information in different frequency channels may
have a greater perceptual impact than differential changes in firing rate.
From Fig. 16 it is clear that, although the response magnitudes show
moderate sensitivity to spectral envelope features, the dominant
feature of sensitivity within the twitter call is the temporal
envelope. A reduction in the number of channels reduces temporal
differentiation among different subpopulations in accordance with human
behavioral data (Shannon et al. 1995
) that suggest nonsynchronous temporal details in more than two channels are necessary
for intelligibility. The lack of sensitivity to spectral envelope
features reported here might lead to a misconception that these results
are in conflict with ripple frequency tuning observed in auditory
cortex of cats and ferrets (Calhoun and Schreiner 1998
;
Kowalski et al. 1996a
,b
; Versnel and Shamma
1998
). Although the dynamic range of ripple tuning remains to
be established in monkeys, it should be noted that spectral envelope
degradations can be conceived as re-sampling the spectral envelope.
Such a spectral envelope resampling will reduce higher ripple densities in the spectral envelope by aliasing to low-ripple densities while not
changing the ripple phase or peak frequencies. Therefore our results do
not reflect aspects of ripple-frequency tuning of AI neurons.
The observed temporal envelope sensitivity exhibited by AI neurons is
analogous to Drullman et al.'s (1994)
psychophysical results on speech intelligibility to degraded temporal envelopes. For
example, the reduction in cortical responses for LP4 and HP60 calls are
consistent with the behavioral data indicating decrement in speech
intelligibility when similar degradations are performed with speech.
Furthermore, a significant increase in the response to BP2-30 calls in
comparison to MOC suggest that these envelope frequencies are indeed
most relevant for the robustness and fidelity of the representation of
phrase synchronous responses. Similar band-pass filtering of the
temporal envelope of speech has been found to contribute to significant
learning and enhanced of speech comprehension in
language-learning-impaired children (Tallal et al.
1996
).
However, there are interesting differences between physiological and psychophysical observations as well. Drullman and colleagues report no significant reduction in speech intelligibility to LP10 and BP2-30 modifications, whereas we observe a significant reduction in AI responses for the LP10 condition. It was indeed surprising that the LP10 calls contribute to significant reductions in AI responses because the call has energy well within the modulation transfer function characteristics of most AI neurons. These observations, in combination with our findings with SAM stimuli, further underscore the sensitivity of the AI responses to higher-frequency spectral components and nonlinear onset characteristics of species-specific vocalizations.
Responses to SAM stimuli account for only 22% of the variance of the responses to calls, and responses to click stimuli only accounts for 13% of the variance in the response to calls. Further, the fact that responses to MOC were significantly greater than to SAM stimuli and to click-train sequences at comparable periodicities indicate that repetition rate coding and onset spectrum effects are separable aspects of temporal envelope processing.
Twitter vocalizations appear to have two time scales in their structure. A fast time scale (10-30 ms) marks individual phrases or syllables; a slower time scale marks inter-phrase intervals. The data from this study indicate that AI neurons are sensitive to the slower time scales in twitter calls. However, the representation of faster time scales appears to be in the form of short and long-range synchronizations between neuronal populations within AI.
The robustness of the AI representations of vocalizations to additive
background noise
0 dB SNR is consistent with psychophysical observations of speech comprehension under these low signal-to-noise conditions. Furthermore, the additive noise data serve as a calibration to determine the sensitivity of responses to spectral and temporal envelope degradations. Response synchronization was generally attenuated due to additive noise, although response patterns were preserved. Surprisingly, subpopulations of neurons did not change their
responses to additive noise and some neurons (10-15% in each animal)
increased their response synchronization to the call during increased
background noise. The mechanisms underlying this surprising robustness
to additive noise remain poorly understood and will be the focus of
future investigations.
The data reported here were obtained from animals anesthetized with
pentobarbital, and one must be cautious in generalizing these results
to cortical activity in awake marmosets (Gaese and Ostwald
2001
; Zurita et al. 1994
). The evidence so far
indicates that the spatial organization of auditory cortex is similar
in anesthetized and awake animals (Brugge and Merzenich
1973
; Lu et al. 2001
; Pelleg-Toiba and
Wollberg 1989
; Recanzone 2000
; Recanzone et al. 1999
, 2000
). In the temporal domain, the frequencies at which responses of AI neurons can follow AM stimuli were found to be
similar in anesthetized and awake animals, although awake animals had a
higher cutoff for synchronous responses (Creutzfeldt et al.
1980
; Lu and Wang 2000
; Schreiner and
Urbas 1988
). Recordings of SAM responses from awake squirrel
monkeys show similar range of following although upper cutoff
frequencies of the modulation transfer functions (MTFs) may be shifted
to higher values (Bieser 1998
; Bieser and
Müller-Preuss 1996
). Our findings that the
best-modulation frequency of neurons was not correlated with the
responses to vocalizations further suggest that similar findings could
be obtained in awake recordings. Also, in awake animals, the
spontaneous firing rate of AI neurons are generally higher and
sustained responses are more apparent (deCharms et al.
1998
; Lu and Wang 2000
; Recanzone 2000
; Recanzone et al. 2000
), but the
synchronous and coherent distributed responses shown here should still
be evident in both awake and barbiturate-anesthetized animals
(De Ribaupierre et al. 1972
; Goldstein et al.
1968
; Schwarz and Tomlinson 1990
). These
findings further support the conclusion that in AI, single neurons do
not exhibit specificity to the monkey's own call; rather, they
indicate sensitivity to specific acoustic features of the temporal
envelope within the call. It is clear that, across the AI neuronal
population, stimulus features such as FM sweeps are abstracted in
representation by synchronous, discharging cell assemblies. The degree
of synchronization depends on the idiosyncratic spectrotemporal
acoustic features of the vocalization. These abstractions constitute a
form of the "neuronal" translation of the acoustic features of a
call. Presumably these abstracted representations may lead to
downstream hierarchical organization of "call-specific" detectors
or "cell assemblies," analogous to higher-order neurons in birds
and bats and monkeys (Doupe and Konishi 1991
;
Fitzpatrick et al. 1993
; Margoliash and Fortune
1992
; Suga 1972
, 1989
; Suga et al. 1983
,
1990
; Theunissen and Doupe 1998
; Tian et
al. 2001
).
In summary, this study further elucidates the specific aspects of the
encoding in AI and forms of transmission of such information downstream
from AI. Specifically, we provide direct evidence that AI response
profiles to complex species-specific vocalizations are robust to
additive noise, sensitive to temporal envelope features, and relatively
insensitive to the details of spectral envelope features. Interestingly
in humans, speech encoding with primarily temporal envelope features
are also robust at moderate levels of background noise (Drullman
1995a
; Kingsbury et al. 1998
). These data
suggest that cortical responses to vocal stimuli may contribute to the
recognition of complex stimuli, including speech, under naturalistic
noisy conditions.
| |
ACKNOWLEDGMENTS |
|---|
We gratefully acknowledge support from the National Institutes of Health (NS-10414, NS-34835, NSF-SBR 9720398, and National Research Service Award fellowship F32-DC00285), Veterans Affairs Medical Research (S. W. Cheung), Deafness Research Foundation (S. S. Nagarajan), the Coleman Fund, and Hearing Research Inc.
| |
FOOTNOTES |
|---|
Address for reprint requests: S. S. Nagarajan, Dept. of Bioengineering, University of Utah, 20S 2030 East BPRB 506D, Salt Lake City, UT 84112-9458 (E-mail: sri{at}utah.edu).
Received 31 July 2001; accepted in final form 3 December 2001.
| |
REFERENCES |
|---|
|
|
|---|
neural responses to amplitude-modulated sounds.
Exp Brain Res
108:
273-284, 1996[ISI][Medline].