Journal of Neurophysiology

Acoustic Features of Rhesus Vocalizations and Their Representation in the Ventrolateral Prefrontal Cortex

Yale E. Cohen, Frédéric Theunissen, Brian E. Russ, Patrick Gill

Abstract

Communication is one of the fundamental components of both human and nonhuman animal behavior. Auditory communication signals (i.e., vocalizations) are especially important in the socioecology of several species of nonhuman primates such as rhesus monkeys. In rhesus, the ventrolateral prefrontal cortex (vPFC) is thought to be part of a circuit involved in representing vocalizations and other auditory objects. To further our understanding of the role of the vPFC in processing vocalizations, we characterized the spectrotemporal features of rhesus vocalizations, compared these features with other classes of natural stimuli, and then related the rhesus-vocalization acoustic features to neural activity. We found that the range of these spectrotemporal features was similar to that found in other ensembles of natural stimuli, including human speech, and identified the subspace of these features that would be particularly informative to discriminate between different vocalizations. In a first neural study, however, we found that the tuning properties of vPFC neurons did not emphasize these particularly informative spectrotemporal features. In a second neural study, we found that a first-order linear model (the spectrotemporal receptive field) is not a good predictor of vPFC activity. The results of these two neural studies are consistent with the hypothesis that the vPFC is not involved in coding the first-order acoustic properties of a stimulus but is involved in processing the higher-order information needed to form representations of auditory objects.

INTRODUCTION

Communication is one of the fundamental components of both human and nonhuman animal behavior (Hauser 1997). Although the benefits and importance of language in human evolution are obvious (Carruthers 2002; Hauser 1997; Lieberman 2002), other nonhuman communication systems are also important. These communication systems are important because for most, if not all, species they are critical to the species' survival (Andersson 1996; Bennett et al. 1997; Greenfield 2002; Hauser 1997; Lau et al. 2000; Mech and Boitani 2003).

For example, auditory communication signals (i.e., species-specific vocalizations) are especially important in the socioecology of several species of nonhuman primates (Cheney and Seyfarth 1985; Eimas 1994; Eimas et al. 1971; Hauser 1997; Jusczyk 1997; Jusczyk et al. 1983; Miller and Eimas 1995), such as rhesus monkeys (Macaca mulatta). Vocalizations convey information about the identity and the age of the caller and often provide information about sex and emotional or motivational state (Cheney and Seyfarth 1990; Hauser 1997). Some vocalizations transmit information about objects and events in the environment (Gifford 3rd et al. 2003; Hauser 1998; Seyfarth and Cheney 2003).

In rhesus monkeys, the ventrolateral prefrontal cortex (vPFC) plays an important role in processing vocalizations (Hackett et al. 1999; Romanski and Goldman-Rakic 2002; Romanski et al. 1999, 2005). The vPFC is thought to be part of a circuit involved in representing auditory objects (Cohen et al. 2004b; Rauschecker 1998; Romanski et al. 1999, 2005). In particular, the vPFC may be part of a circuit that processes socially meaningful signals (Cohen et al. 2006; Deacon 1992; Gifford 3rd et al. 2005).

A fuller and more comprehensive understanding of vocalization processing in the vPFC requires that we understand the acoustic features of the rhesus vocalizations and how these acoustic features relate to neural activity. We first characterized the acoustic structure of rhesus vocalizations by calculating their modulation spectra; the modulation spectrum quantifies the spectral and temporal features of an auditory stimulus as seen in a spectrographic representation. The structure of these spectra was similar to that found in other ensembles of natural stimuli. Next, we tested whether the tuning of vPFC neurons is designed to maximize the acoustic differences that exist between vocalizations; this type of tuning is hypothesized to facilitate an animal's capacity to discriminate between different vocalizations (Woolley et al. 2005). Finally, using vocalizations, we estimated the spectrotemporal receptive field (STRF) of vPFC neurons to test whether the responses of vPFC neurons are modulated preferentially by the first-order (linear) acoustic features of an auditory stimulus. The results of these two neurophysiological studies suggest that vPFC neurons are not modulated preferentially by these features.

METHODS

First, we describe the analyses that quantified the acoustic properties of rhesus vocalizations. Next, we describe the experimental procedures (i.e., stimulus arrays, stimulus sets, behavioral tasks, recording procedures, and data analysis) that we used 1) to test how vPFC neurons respond to noise with different band-limited spectrotemporal modulations and 2) to characterize the STRFs of vPFC neurons. When the experimental procedures differ, we demarcate those procedures related to the band-limited study as the “Noise procedures” and those procedures related to the STRF study as the “STRF procedures.”

Acoustic properties of rhesus vocalizations: modulation spectrum

The spectrotemporal modulations that are present in complex sounds, such as vocalizations, can be characterized by generating a modulation power spectrum (or modulation spectrum) (Singh and Theunissen 2003; Theunissen et al. 2004). Analogous to decomposing an acoustic waveform into a series of sine waves, a (log) spectrographic representation of an auditory stimulus can be decomposed into a series of sinusoidal gratings that characterizes the temporal modulations (in Hertz, Hz) and the spectral modulations (in cycles per Hz or octave) of the stimulus. The two-dimensional plot that illustrates the squared amplitude of the temporal- and spectral-modulation rates of a sound is the modulation spectrum. Modulation spectra can be calculated for a single stimulus or can be averaged with other stimuli to characterize the statistics of a particular class of stimuli.

In this study, we calculated the modulation spectra of rhesus vocalizations. The vocalizations were recorded and digitized as part of an earlier set of studies (Hauser 1998). Each vocalization was assigned to one of 10 major classes. These classes are defined based on both their acoustic similarities and their behavioral significance. Our data set contains exemplars from each of these classes: 57 aggressives, 23 coos, 32 copulation screams, 24 gekkers, 42 grunts, 25 girneys, 19 harmonic arches, 46 screams, 20 shrill barks, and four warbles. We did not attempt to cluster independently the sound exemplars using their acoustical signature, as defined by their modulation spectrum, because we wanted to preserve the behavioral information, although our acoustical analyses could in theory also be used for such a classification task.

The first step in the estimation of the modulation spectrum is to calculate the spectrographic representation for each vocalization exemplar. The spectrographic representation was obtained with a filter bank of Gaussian-shaped filters whose gain function had a bandwidth of 32 Hz (measured as a SD). We used 299 filters that had center frequencies ranging from 32 Hz to 10 kHz. The corresponding Gaussian-shaped windows in the time domain had a temporal bandwidth of 5 ms. These parameters defined the time–frequency scale of the spectrogram and the upper limits of the spectral- and temporal-modulation frequencies that could be characterized by the spectrogram: 16.25 cycles/kHz and 100.5 Hz, respectively (Singh and Theunissen 2003). These filter parameters resulted in very little energy at the edge of the modulation spectrum as determined by the time–frequency scale of the spectrogram. This observation suggests that we were able to capture most of both the temporal and spectral fluctuations in the sounds with a single time–frequency scale.

These time–frequency scales differ somewhat from our previous studies (Singh and Theunissen 2003). In our initial characterization of the statistics of natural sounds, we used time–frequency scales corresponding to wider frequency filters (62, 125, and 250 Hz). These scales are appropriate for characterizing sounds with fast temporal modulations, such as zebra finch song and other environmental sounds. However, these scales do not allow for a characterization of fine spectral modulations. In the current study, we used a narrower filter (32 Hz) because a finer spectral modulation resolution was needed to discriminate between the characteristic spectrotemporal features of rhesus vocalizations and also to compare them with those found in human speech.

To calculate the modulation spectrum of a vocalization, we calculated the two-dimensional Fourier transform of each vocalization's log spectrogram. If a vocalization was <1 s, it was zero padded until its length was 1 s. If a vocalization was >1 s, the two-dimensional Fourier transform was calculated for nonoverlapping 1-s segments. Each 1-s segment was windowed with a Hamming window (Oppenheim et al. 1983). The modulation spectrum was calculated by averaging the power (amplitude squared) of the two-dimensional Fourier transform. We report the class-based spectra, which were calculated by averaging the individual modulation spectra from each exemplar within each acoustic class. We also calculated the “composite” modulation spectrum, which is the average of the 10 class-based spectra, and the coefficient of variance between the class-based spectra.

For comparison, we calculated the modulation spectrum for zebra finch song, human speech, and environmental noise. Zebra finch song was recorded for adult zebra finch males in the laboratory of Dr. F. E. Theunissen. The human-speech exemplars were obtained from a database of 100 short English sentences spoken by native male and female American English speakers (Tyler and Preece 1990). Environmental sounds were natural sounds that were not produced by animal vocalizations but by weather, water, or fire; these sounds were obtained from commercial audio CD recordings. The same sounds were previously analyzed in more detail (Singh and Theunissen 2003). Similarities between modulation spectra were estimated using the correlation coefficient (r). Classical multidimensional scaling that used 1 − r as a distance metric was used to visualize the pairwise distances between the modulation spectra. All calculations were done in the MATLAB (The MathWorks, Natick, MA) programming environment.

Neural recordings and analysis

SUBJECTS.

Two female rhesus monkeys (Macaca mulatta) were used in these experiments. Both monkeys were trained on the task described in this study. They weighed between 8.0 and 9.0 kg. All surgical, recording, and training sessions were in accordance with the National Institutes of Health's Guide for the Care and Use of Laboratory Animals and were approved by the Dartmouth Institutional Animal Care and Use Committee. Neither monkey had been operantly trained to make behavioral responses to auditory stimuli.

SURGICAL PROCEDURES.

Surgical procedures were conducted under aseptic, sterile conditions, using general anesthesia (isoflurane). These procedures were performed in a dedicated surgical suite operated by the Animal Resource Center at Dartmouth College.

In the first procedure, titanium bone screws were implanted in the skull and a methylmethacrylate implant was constructed. A Teflon-insulated, 50-gauge stainless steel wire coil was also implanted between the conjunctiva and the sclera; the wire coil allowed us to monitor the monkey's eye position (Judge et al. 1980). Finally, a head-positioning cylinder (FHC-S2; Crist Instruments, Hagerstown, MD) was embedded in the implant. This cylinder connected to a primate chair and stabilized the monkey's head during behavioral-training and recording sessions.

After the monkeys learned the passive-listening task (see following text), a craniotomy was performed and a recording cylinder (ICO-J20, Crist Instruments) was implanted. This surgical procedure provided chronic access to the vPFC for neurophysiological recordings.

EXPERIMENTAL SETUP.

Behavioral training and recording sessions were conducted in a room with sound-attenuating walls. The walls of the room were covered with anechoic foam insulation (Sonomatt; Auralex, Indianapolis, IN). While inside the room, the monkeys were seated in the primate chair and placed in front of a stimulus array; because the room was darkened, the speaker producing the auditory stimuli was not visible to the monkeys. The primate chair was placed in the center of a 1.2-m-diameter, two-dimensional, magnetic coil (C-N-C Engineering, Seattle, WA) that was part of the eye-position monitoring system (Judge et al. 1980). Eye position was sampled with an A/D converter (PXI-6052E; National Instruments, Austin, TX) at a rate of 1.0 kHz. The monkeys were monitored during all sessions with an infrared camera.

STIMULUS ARRAY.

Auditory stimuli were presented from a speaker (PLX32; Pyle, Brooklyn, NY) that was 1.2 m in front of the monkey; the speaker was 1.2 m above the floor, which was at the monkeys' eye level. A red light-emitting diode (LED; model 276–307; Radio Shack, Fort Worth, TX) was also mounted on the face of this speaker. This “central” LED served as a fixation point for the monkeys during the passive-listening task (see following text). The central LED subtended <0.2° of visual angle and had a luminance of 12.6 cd/m2.

BEHAVIORAL (PASSIVE-LISTENING) TASK.

During the passive-listening task, 1,000–1,500 ms after fixating the central LED, an auditory stimulus was presented from the speaker. To minimize any potential changes in neural activity arising from changes in eye position (Cohen and Andersen 1998; Groh et al. 2001; Mullette-Gillman et al. 2005), the monkeys maintained their gaze at the central LED during auditory-stimulus presentation and for an additional 1,000–1,500 ms after auditory-stimulus offset to receive a juice reward.

AUDITORY STIMULI.

The auditory exemplars from the Noise and the STRF procedures were recorded to disk. Each exemplar was filtered to compensate for the transfer-function properties of the speaker and the acoustics of the room. Each filtered exemplar was presented through a D/A converter (DA1; Tucker-Davis Technologies, Alachua, FL), an amplifier (SA1, Tucker Davis Technologies; MPA-250, Radio Shack), and transduced by the speaker. Each exemplar was presented at an average sound level of 65 dB SPL (sound pressure level, relative to 20 μPa).

Noise procedures.

The band-limited noise was designed to cover the range of spectrotemporal modulations found in the rhesus vocalizations (see results). To cover this range, we generated 16 classes of noise with different band-limited spectrotemporal modulations (see Fig. 1). Along the temporal-modulation axis, the band-limited noise covered the range from 0 to 20 Hz in 5-Hz steps. Along the spectral-modulation axis, the noise covered the range from 0 to 4 cycles/kHz in steps of 1 cycle/kHz. Note that all of the noise stimuli contained both positive and negative modulations corresponding to a mixture of upward and downward sweeps (Singh and Theunissen 2003). The band-pass filters were applied simultaneously in both the temporal and spectral dimension. For example, a class of noise might consist of sounds with temporal modulations between 10 and 15 Hz (as well as between −10 and −15 Hz) and spectral modulations between 2 and 3 cycles/kHz (see Fig. 1G). Within each class of noise, we generated 10 exemplars. The duration of each exemplar was 500 ms, which approximated the mean duration of our digitized collection of rhesus vocalizations. Each of the 10 exemplars from each of the 16 noise classes was generated before the recording sessions and recorded to disk.

FIG. 1.

Band-limited noise. Each panel illustrates the average modulation spectrum of one of 16 classes of band-limited noise. Each class consisted of 10 exemplars of noise and the 16 classes tile the region of acoustic space encompassed by rhesus vocalizations. Power density is indicated by color using a decibel (dB) scale, with red showing the spectrotemporal modulations with the most energy, as shown on the color bar.

STRF procedures.

The vocalizations were recorded and digitized as part of an earlier set of studies (Hauser 1998). We presented a single exemplar from each of the 10 major acoustic classes of rhesus vocalizations (Hauser 1998). The spectrograms of these 10 exemplars are shown in Fig. 2. These vocalizations came from the same data set that was used to quantify the acoustic properties of rhesus vocalizations (see Acoustic properties of rhesus vocalizations: modulation spectrum).

FIG. 2.

Spectrographic representations of an exemplar from each of the 10 classes of rhesus' species-specific vocalizations. A: aggressive. B: copulation scream. C: grunt. D: coo. E: gecker. F: girney. G: harmonic arch. H: shrill bark. I: scream. J: warble.

RECORDING PROCEDURES.

Single-unit extracellular recordings were obtained with tungsten microelectrodes (FHC) seated inside a stainless steel guide tube. The electrode and guide tube were advanced into the brain with a hydraulic microdrive (MO-95; Narishige, East Meadow, NY). The electrode signal was amplified (MDA-4I; Bak Electronics, Mount Airy, MD) and band-pass filtered (model 3700; Krohn-Hite, Brockton, MA) between 0.6 and 6.0 kHz. Single-unit activity was isolated using a two-window, time-voltage discriminator (Model DDIS-1; Bak Electronics). Neural events that passed through both windows were classified as originating from a single neuron. The time of occurrence of each action potential was stored for on- and off-line analyses.

The vPFC was identified by its anatomical location and its neurophysiological properties. The vPFC is located anterior to the arcuate sulcus and area 8a and lies below the principal sulcus (Cohen et al. 2004b; Romanski and Goldman-Rakic 2002). vPFC neurons were further characterized by their strong responses to auditory stimuli (Cohen et al. 2004b; Gifford 3rd et al. 2005; Newman and Lindsley 1976; Romanski and Goldman-Rakic 2002).

RECORDING STRATEGY.

An electrode was lowered into the left vPFC; the left hemisphere of rhesus monkeys is thought to be specialized for processing vocalizations (Hauser and Andersson 1994; Poremba et al. 2004). To minimize sampling bias, any neuron that was isolated was tested.

Noise procedures.

Once a neuron was isolated, we randomly chose three classes of the band-limited noise; in some of the earlier recording sessions, we used only two classes of noise. Next, the monkeys participated in blocks of trials of the passive-listening task. In each block, the noise exemplars were presented in a balanced pseudorandom order. The intertrial interval is 1–2 s. Neural data were collected from ≥10 presentations of each of the 10 exemplars in each class.

STRF procedures.

Once a neuron was isolated, the monkeys participated in blocks of trials of the passive-listening task. In each block, the vocalization exemplars were presented in a balanced pseudorandom order. Because we needed to collect neural data from ≥40 s (Sen et al. 2001; Theunissen et al. 2000) of auditory-stimulus presentation, data were collected from ≥200 successful trials at an intertrial interval of 1–2 s.

NEURAL DATA ANALYSIS.

Neural activity recorded during the passive-listening task was tested during the “baseline” and “stimulus” periods. The baseline period began 50 ms after the monkey fixated the central LED and ended 50 ms before auditory-stimulus onset; neural activity was aligned relative to the time when the monkey fixated the central LED. The stimulus period began at auditory-stimulus onset and ended at its offset; neural activity was aligned relative to the onset of the auditory stimulus. Data were analyzed in terms of a vPFC neuron's firing rate (i.e., the number of action potentials divided by task-period duration).

A two-tailed t-test determined whether a vPFC neuron's response was modulated by the band-limited noise or the vocalizations. A neuron was classified as “auditory” if the results of the t-test indicated that the neuron's mean baseline- and stimulus-period firing rates were reliably different at a level of P < 0.05.

Noise procedures.

The selectivity of a vPFC neuron to the spectrotemporal modulations of each noise exemplar was quantified with three analyses: a z-score (response-strength) analysis, an information analysis, and a response-selectivity analysis.

  • z-score analysis

The z-score (Grace et al. 2003) was calculated on a neuron-by-neuron basis and class-by-class basis. This measure quantifies the normalized difference between a neuron's stimulus-period firing rate and the baseline-period firing rate. The z-score was defined as Math(1) where μs is the mean firing rate during the stimulus period, μb is the mean firing rate during the baseline period, σs2 is the variance of the response during the stimulus period, and σb2 is the variance of the response during the baseline period. Cov (s, b) is the covariance between the mean stimulus- and baseline-period firing rates.

  • Information analysis

The amount of information was calculated on a neuron-by-neuron basis and a class-by-class basis. “Band-limited noise information” was the amount of information carried in the stimulus-period firing rates of vPFC neurons regarding differences between the noise exemplars.

We quantified the amount of information (Cover and Thomas 1991; Shannon 1948a,b) carried in the firing rate of vPFC neurons using a formulization analogous to the one we described previously (Cohen et al. 2004b; Gifford 3rd and Cohen 2004). Information (I) was defined as Math(2) where s is the index of each noise exemplar in a particular class of band-limited noise, r is the index of the stimulus-period firing-rate, P(s, r) is the joint probability, and P(s) and P(r) are the marginal probabilities.

To eliminate biases in the amount of information arising from small sample sizes, information rates were bias-corrected (Cohen et al. 2004b; Gifford 3rd and Cohen 2004; Grunewald et al. 1999; Panzeri and Treves 1996). On a neuron-by-neuron basis, we first calculated the amount of information from the original data and from bootstrapped trials. In bootstrapped trials, the relationship between a neuron's firing rate and the noise exemplar was randomized and then the amount of information was calculated. This process was repeated 500 times and the median value from this distribution of values was determined. The bias-corrected information was calculated by subtracting the median amount of information obtained from bootstrapped trials from the value obtained from the original data.

  • Response-selectivity index

The response-selectivity index (Grunewald and Skoumbourdis 2004; Rolls and Tovee 1995; Vinje and Gallant 2002) was calculated on a neuron-by-neuron basis and a class-by-class basis. The response-selectivity index (RI) was defined as Math(3) where ri is the average stimulus-period firing rate for a band-limited noise exemplar and n is the number of band-limited noise exemplars in a noise class. An RI of 0 indicates that a neuron's firing rate is the same for each exemplar in a class. A value of 100 indicates that neuron responds preferentially to only one exemplar in a class.

STRF procedures.

The STRF of a neuron was generated using STRFPAK (http://strfpak.berkeley.edu), a MATLAB toolbox developed by Theunissen and colleagues (Grace et al. 2003; Sen et al. 2001; Theunissen and Doupe 1998; Theunissen et al. 2000, 2001). A STRF was constructed from a neuron's responses to vocalizations.

We defined the STRF as the first-order Volterra kernel that relates the spectrographic representation of an auditory-stimulus exemplar to a peristimulus time histogram (PSTH) (DePireux et al. 2001; Eggermont et al. 1983; Escabí and Schreiner 2002; Theunissen et al. 2000). A neuron's STRF is hi(t) such that Math(4) The predicted firing rate of a neuron is rpre(t). Each si(t) is the log of the amplitude envelope of an auditory-stimulus exemplar in frequency band i. The set of si(t) is a time-varying spectrographic representation of an auditory-stimulus exemplar. For the STRF estimation, we used a spectrographic representation with nf = 191 frequency bands of width 62.5 Hz, each of which was centered on frequencies spanning the range 0.15–12.5 kHz. t varied from −300 to +300 ms to capture the memory of the neural system and the correlation time of the stimulus.

r0(t), the time-varying mean firing rate average across all stimuli, was obtained by smoothing the average PSTH to all stimuli except for the sound being analyzed with a 40-ms Hamming window. r0(t) is thus the expected mean response to all sounds irrespective of the actual spectrotemporal content. The STRF model then captures the deviations in the response that result from the specific spectrotemporal features of the sound being processed. We called this model the “time-varying mean” STRF. We also estimated the STRF with a fixed mean rate, with r0(t) replaced by its mean value over time. We called this more classical model the “fixed-mean” STRF.

The estimated and predicted neural responses and the spectrographic representation of the sound si(t) were all sampled at 1 kHz. This sampling rate was more than sufficient to capture the observed spiking precision as well as the temporal variations in the spectrographic representation, which are band-limited by the bandwidth of the filter or 62.5 Hz (Flanagan 1980).

The STRF [hi(t)] is estimated by minimizing the mean-square difference between the predicted [rpre(t)] and the actual [ract(t)] firing rates. The solution for h{i}(t) is given by a set of linear equations for each temporal frequency ωt in the Fourier domain of t.

In vector notion, this set of equations is: Aω·Ĥω = Ĉω. The autocorrelation matrix of the auditory stimulus (Aω) in the frequency domain is defined as Math(5) where Si(ω) is the Fourier transform of si(t), “*”denotes the complex conjugate, and 〈·〉 denotes the cross-moments calculated by averaging over the samples. Ĥ{i}(ω) is the Fourier transform of the set of h{i}(t) Math(6) Ĉω is the Fourier transform of the cross-correlation between the stimulus envelope and the spike train Math(7) where Ract(ω) is the Fourier transform of ract(t) − r0(t).

A neuron's STRF was calculated by first inverting Aω at each temporal frequency ω: Ĥω = Aω−1·Ĉω. This inversion was performed using the pseudoinverse methodology. That is, the cross-correlation vector is projected into the subspace spanned by the eigenvectors of the stimulus autocorrelation with the largest eigenvalues. In this fashion, the inversion performs both a normalization step (normalizing for nonuniform power distribution found in the vocalizations) and a regularization step (limiting the estimation of STRF parameters for the spectrotemporal modulations that have significant power). The actual extent of the stimulus subspace with significant power is determined by cross-validation: STRFs are estimated using different numbers of eigenvalues and the STRF that yields the best prediction on a validation trial is chosen (Theunissen et al. 2001; Woolley et al. 2006).

Goodness-of-fit estimates were determined in two different ways: 1) coherence and 2) cross-correlation. The first goodness-of-fit estimate, I, was determined by integrating the coherence function (γ2) over all ω. Coherence was defined as Math(8) where Rpre(ω) is the Fourier transform of rpre(t). The closer that γ2 is to a value of 1, the more linearly related are Ract and Rpre. The second goodness-of-fit estimate was calculated by the cross-correlation coefficient (CC) between the predicted and actual firing rates Math(9) Resampling methods were then used to obtain bias corrected values and confidence intervals for both the coherence and the cc measures (Hsu et al. 2004).

For both goodness-of-fit estimates, rpre was calculated by convolving the STRF with an auditory-stimulus exemplar, adding r0(t), and rectifying the result. ract was generated by smoothing the PSTH with a window of variable sizes. To compare data across the population of neurons, we report the cross-correlation coefficient from a constant window size (31 ms).

VERIFICATION OF RECORDING LOCATIONS.

Magnetic resonance images were used to visualize the recording microelectrode in each monkey's brain. These images were obtained at the Dartmouth Brain Imaging Center using a GE 1.5T scanner (3-D T1-weighted gradient echo pulse sequence with a 5-in. receive-only surface coil) (Cohen et al. 2004a,b). These images confirmed that our recordings were from the vPFC.

RESULTS

Acoustic properties of rhesus vocalizations: modulation spectrum

The modulation spectra of rhesus vocalizations were calculated from a corpus of recordings that identified 10 classes of vocalizations (Hauser 1998). We calculated the modulation spectrum for each exemplar within a vocalization class. The average modulation spectrum for each of the 10 classes of rhesus vocalizations is shown in Fig. 3. As can be seen, these “class-based” spectra had most of their energy at low spectral and temporal frequencies. A similar pattern was observed when we calculated the composite modulation spectrum (Fig. 3K). The composite modulation spectrum was calculated by averaging together the 10 class-based spectra (Fig. 3, AJ). In general, the composite modulation spectrum had high power at low modulation frequencies that rapidly decreased at higher frequencies. This pattern is characteristic of all natural sounds (Singh and Theunissen 2003). In addition, most of the power for the medium to higher spectral modulation frequencies was found only at the very lowest temporal modulations, a characteristic feature of animal vocalizations (Singh and Theunissen 2003). This property reflects the fact that sounds in vocalizations that have the most spectral structure (i.e., harmonic sounds with clear pitch percept such as vowels in human speech) tend to be slower.

FIG. 3.

Modulation power spectra for 10 classes of rhesus vocalizations and their composite modulation spectra. A: aggressive. B: copulation scream. C: grunt. D: coo. E: gecker. F: girney. G: harmonic arch. H: shrill bark. I: scream. J: warble. K: composite modulation spectrum. Composite modulation spectrum was calculated by averaging together the 10 class-based modulation spectra (AJ). For the class-based and composite modulation spectra, power density is indicated by color using a dB scale, with red showing the spectrotemporal modulations with the most energy; this scale is shown on the color bar next to C. L: coefficient of variance of the modulation spectra calculated from the class-based modulation spectra. Differences in the coefficient of variance are indicated by color using a dB scale, with red showing the greatest variance, as shown on the color bar next to the panel. For all of the panels, values along the x-axis show the frequency of the temporal modulations (ωt) and values along the y-axis show the frequency of the spectral modulations (ωf). Black lines in each plot are contour lines enclosing 50–90% of the power.

To put these modulation spectra into context, we compared the modulation spectra of rhesus vocalizations with three other classes of natural sound: zebra finch song, speech (American English), and environmental sounds (i.e., nonvocalizations). The modulation spectra for these classes are shown in Fig. 4 A. To quantify the similarity between the rhesus-vocalization, zebra finch song, speech, and environmental modulation spectra, we pairwise correlated the individual modulation spectra. Table 1 shows the correlation-coefficient matrix for these comparisons, and Fig. 4B shows the two-dimensional representation of this matrix obtained using multidimensional scaling. The correlation coefficients among the different rhesus vocalization classes are all very high (average = 0.89), illustrating the fact that the modulation spectra for the different vocalizations have a similar shape. In other words, similar spectrotemporal features (e.g., similar pitch, tempo) tend to dominate all rhesus vocalizations. The modulation spectra of rhesus vocalizations are similar to those of other animal vocalizations but to a lesser extent (zebra finch song: r = 0.85; human speech: r = 0.82). The modulation spectrum of environmental sounds has similar energy at the very low spectrotemporal modulations but lacks the higher spectral modulation characteristic of harmonic sounds and is thus less similar to the rhesus vocalizations (r = 0.79) The multidimensional scaling representation also illustrates these relationships (Fig. 4B): most of the rhesus vocalizations are clustered relatively close together, whereas zebra finch song, human speech, and environmental sounds are located at the edge of the cloud of points and further away.

FIG. 4.

Modulation spectrum of zebra finch song, human speech, environmental sounds, and a comparison with rhesus vocalizations. A: modulation power spectra of zebra finch song, American English as spoken by both adult male and female speakers, and environmental sounds. Axis, color scale, and contour lines are the same as in Fig. 3. B: a multidimensional scaling representation of the similarity between the modulation spectra of each of the 10 classes of rhesus vocalizations and the modulation spectra of zebra finch song, human speech, and environmental sounds. Scaling analysis used the correlation coefficient (i.e., 1 − r) obtained through all pairwise comparisons of the spectra; the actual correlation values are shown in Table 1. Squared distances in this plot make up 96% of the actual squared distances validating the 2-dimensional representation.

View this table:
TABLE 1.

Correlation coefficients between the modulation spectra of different rhesus vocalization classes and between vocalization classes, zebra finch song, human speech, and environmental sounds

On average, the class-based modulation spectra have similar shapes, but there are clear differences between them. For example, the more harmonic vocalizations such as the coo and the warble have strong power at higher and specific spectral modulations. In contrast, the modulation spectrum for the harmonic arch has power at intermediate spectral and temporal modulations. Correspondingly, one can see that the warble and coo are clustered together in the multidimensional scaling representation with the harmonic arch located further away (Fig. 4B). The copulation scream and the grunt also stand out as different from the other vocalization classes to a greater degree.

To examine more systematically which spectrotemporal modulations varied across different rhesus-vocalization classes, we calculated the coefficient of variance between the class-based spectra (Fig. 3L). At low spectral- and temporal-modulation frequencies, the coefficient of variance was low. However, at spectral-modulation frequencies between 2 and 5 cycles/kHz, the amount of variance was high. Similarly, at high temporal-modulation frequencies (between 5 and 20 Hz), the coefficient of variance was also high. The average between-class coefficient of variance of the modulation spectra was 1.98.

Within each vocalization class, we also observed variability between the different exemplars. The within-class variability, though, was smaller than the between-class variance and was concentrated in the modulations that were the most characteristic of that vocalization type (data not shown). In other words, the modulation frequencies with the largest within-class variability did not necessarily overlap with the modulation frequencies with high between-class variability. The average within-class coefficient of variance was 1.01.

The intermediate spectrotemporal modulations, shown in Fig. 3L, may therefore be more informative for determining the class of the vocalization. We thus hypothesized that vPFC neurons may be particularly sensitive to these spectrotemporal modulations. To test this hypothesis, we recorded the responses of vPFC neurons to band-limited noise with spectrotemporal modulations similar to those found in vocalizations (see responses of vpfc neurons to band-limited noise).

Auditory responses of vPFC neurons

We recorded from the left vPFC of two rhesus monkeys; 33 neurons were collected from one monkey and 24 from the second. Neural data were pooled for presentation because the results of our analyses were not reliably different (P > 0.05) between the two monkeys.

RESPONSES OF VPFC NEURONS TO BAND-LIMITED NOISE.

We report data from 57 auditory neurons in which we were able to record data in response to ≥10 presentations of each of the 10 exemplars in a tested noise class. In preliminary studies, we tested only two noise classes (n = 11), but for most of the reported data, we tested three noise classes.

vPFC activity was modulated strongly by band-limited noise. An example neuron is shown in Fig. 5 A. This vPFC neuron responded robustly to each of the three classes of band-limited noise. A different neuron that also responded robustly to band-limited noise is shown in Fig. 5B. On visual inspection, the response profiles for these two example neurons to the three different noise classes were very similar. This result was quantified for the entire data set in the following analyses.

FIG. 5.

A and B: response profile of 2 ventrolateral prefrontal cortex (vPFC) neurons to 3 different classes of band-limited noise. Modulation spectrum of each class of noise is shown next to the raster plots. Rasters and histograms are aligned relative to onset of the noise stimulus; the solid black line indicates stimulus onset. Histograms were generated by binning spike times into 40-ms bins. Insets, each panel: results of the z-score analysis calculated for each of these neurons (see text for more details).

To test whether the tuning in the response of a vPFC neuron to each noise class matched the variance seen in the composite modulation spectrum, we conducted z-score, information, and a response-selectivity index analyses. The details of these analyses can be found in methods.

Results of the z-score analysis for each individual vPFC neuron are shown in Figs. 5 and 6A. On a neuron-by-neuron basis and as a function of each noise class tested, we calculated a z-score value from the stimulus-period firing rate. Next, using a color map, we transformed these z-score values into a color representation, with red representing the highest z-score and blue representing the lowest z-score. Colored squares were then plotted as a function of the spectral and temporal modulations of each z-score's corresponding noise class. The insets in Fig. 5 show results of this z-score analysis for each of the two single-neuron examples. Each of the small panels in Fig. 6A represents the results of this analysis for a subset of the recorded neurons.

FIG. 6.

Responses and mean z-score values of vPFC responses for different classes of band-limited noise. A: each of the smaller panels illustrates the z-score analysis conducted for 24 of the vPFC neurons. Within each of these panels, the calculated z-score is plotted as a function of the spectrotemporal modulations of the tested classes of band-limited noise. Value of the z-score is indicated by the color bar to the right of each of the panels, with red corresponding to the highest stimulus-period firing rate. First 2 panels on the first row are the z-score values calculated from the 2 neurons shown in Fig. 5. B: mean (a) and SD of the z-scores (b) are plotted as a function of the spectrotemporal modulations of the test classes of band-limited noise. Within each panel, the values are indicated by color, with red corresponding to the highest mean (SD) z-score; see scale bars. For all of the panels, values along the x-axis show the frequency of the temporal modulations (ωt) and values along the y-axis show the frequency of the spectral modulations (ωf). These data are plotted under the assumption that the neurons respond to negative and positive temporal modulations in a comparable manner; we do not have enough data to fully characterize positive and negative temporal modulations. Beneath these 2 panels, we replot the composite modulation spectrum (c) and the coefficient of variance of the modulation spectrum (d) from Fig. 3.

Using an analogous method, we calculated the mean z-score (Fig. 6Ba) and the SD of the z-scores (Fig. 6Bb) from the population of recorded neurons. These values were transformed into a color representation and plotted as a function of the spectral and temporal modulations of the tested noise classes. As can be seen, the tuning pattern in the z-score mean and SD does not seem to match the pattern of the coefficient of variation in the composite modulation spectrum; compare the distribution of the z-score mean (Fig. 6Ba) and SD values (Fig. 6Bb) with the composite modulation spectrum (Fig. 6Bc) and the coefficient of variance of the modulation spectrum (Fig. 6Bd); Bc and Bd of Fig. 6 were duplicated from Fig. 3, K and L, respectively.

Results of the information analysis and the response-selectivity index analysis are shown in Fig. 7, top and bottom, respectively. The mean amount of band-limited noise information ranged from 0.1 to 0.2 bits; the theoretical maximum value is 3.3 bits. The SD of the information ranged from 0.02 to 0.09 bits. Using a variant of this information analysis (Chechik et al. 2006), we tested the amount of band-limited noise information at different temporal resolutions from 50 to 500 ms. Because the duration of each noise burst was 500 ms, this resolution was equivalent to calculating the information value using a rate code. We could not identify any structure at any of the resolutions nor could we identify a trend as the resolution increased from 50 to 500 ms (data not shown). The mean response-selectivity index values ranged from 2.7 to 9.0 and the SD ranged from 1.6 to 7.9. Similar to the z-score analysis, tuning in the response of vPFC activity for both analyses does not seem to match the variance of the acoustic features of the vocalizations.

FIG. 7.

Population response of vPFC neurons to band-limited noise. Mean and SD of band-limited noise information (top) and mean and SD of the response-selectivity-index value (bottom) are plotted as a function of the spectrotemporal modulations of the tested classes of band-limited noise. Within each panel, the values are indicated by color, with red showing corresponding to the highest mean (SD) value obtained in these experiments; see scale bars. Values along the x-axis show the frequency of the temporal modulations (ωt) and values along the y-axis show the frequency of the spectral modulations (ωf). Data are plotted under the assumption that the neurons respond to negative and positive temporal modulations in a comparable manner; we do not have enough data to fully characterize positive and negative temporal modulations.

STRFS OF VPFC NEURONS.

We report data from the 45 auditory neurons in which we were able to record data from ≥10 presentations of each of the 10 vocalization exemplars. An example of a vPFC neuron's response to these exemplars is shown in Fig. 8. As seen, this neuron responded to each of the 10 exemplars. For comparison, the maximum firing rate, independent of vocalization exemplar, in our population varied between 0.89 and 40.6 spikes/s. Based on its stimulus-period firing rate, the selectivity of this neuron to the different vocalizations is comparable to that seen in our previous studies (Cohen et al. 2004b).

FIG. 8.

Response profile of a vPFC neuron to each of the 10 vocalization exemplars: aggressive (A), copulation scream (B), grunt (C), coo (D), gecker (E), girney (F), harmonic arch (G), shrill bark (H), scream (I), and warble (J). Rasters and histograms are aligned relative to onset of the vocalization; the solid gray line indicates stimulus onset. Histograms were generated by binning spike times into 40-ms bins.

The STRF is an estimate of the best linear transfer function between the spectrographic representation of a sound and the deviations of a neuron's firing rate from an expected mean rate. We estimated two STRFs models. In the fixed-mean rate STRF model, the expected mean rate is constant in time. In the time-varying mean STRF model, the expected mean rate is allowed to vary slowly in time to capture the phasic nature of the response that would be common to all sounds irrespective of their spectrotemporal content. STRFs were estimated and validated from neural responses to vocalization exemplars.

An example of a fixed-mean rate STRF is shown in Fig. 9. For this neuron, as well as for all of the tested neurons, there was very little identifiable structure (i.e., clearly bounded regions of excitation or inhibition) in its STRF. The STRF shown in Fig. 9A appears to have some small regions of excitation and inhibition. However, as shown in the bottom and right panels, the gain factors at these time delays and frequency bands are not significantly different from zero, indicating that structure in the STRF is not statistically reliable. Consistent with this observation, when we used this fixed-mean rate STRF to predict the neuron's peri-stimulus time histogram, we found that it was not a good predictor of this neuron's firing pattern (Fig. 9B). That is, a linear function that relates the spectrographic representation of a sound to firing rate is not a good model of vPFC activity. Another fixed-mean rate STRF is shown in Fig. 10. The pattern seen here is similar to that seen in Fig. 9. The STRF does not have any discernable structure and does not have reliable frequency or temporal tuning. None of our tested neurons had statistically reliable spectral or temporal tuning.

FIG. 9.

Spectrotemporal receptive field (STRF) of a vPFC neuron, its tuning profiles, and its predictive capacity. A: STRF (top left) is generated from the data shown in Fig. 8. Time is plotted along the x-axis relative to stimulus onset, which occurs at time = 0. Color scale in the STRF indicates the probability that a frequency elicits an action potential: red indicates a high probability and blue indicates a low probability. Spectral and temporal tuning profiles of this STRF taken at the time and frequency band of peak gain are shown to the right of and below, respectively, the STRF. Thick black line in each profile is the mean value and the dotted lines indicate the 95% confidence interval of the mean values. B: prediction of firing rate. Data in each panel illustrate the predictive capacity of the STRF. Black lines are the neuron's actual peristimulus time histograms (PSTHs) to a particular vocalization, whereas the gray lines are the PSTHs predicted from the STRF.

FIG. 10.

STRF of a vPFC neuron, its tuning profiles, and its predictive capacity. Same format as Fig. 9.

This lack of reliable structure cannot be attributed to the lack of responsivity of the vPFC neuron to auditory stimuli: the STRF shown in Fig. 9 was derived from the neural responses shown in Fig. 8. Clearly, this neuron responded strongly to all of the vocalization exemplars.

Another possibility that we considered was that the neural responses to sound were dominated by a characteristic invariant phasic response to all sounds (e.g., a strong onset response followed by a slower decay). In this scenario, the neurons could still respond to specific spectrotemporal features in individual sounds, although this selective response would ride above the stronger invariant response and could disappear in the classical STRF analysis. For this reason we also estimated the time-varying mean rate STRF as explained above and in methods.

The time-varying mean STRF for the neuron in Fig. 8 is shown in Fig. 11; the fixed-mean rate STRF is shown in Fig. 9. The time-varying mean STRF still did not have any reliable structure as seen by its frequency and temporal tuning. Similarly, we did not find any reliable structure in any of the time-varying mean STRFs for the other neurons in our data set. However, in all cases, the time-varying mean STRF was a somewhat better predictor of the neuron's response than the fixed-mean STRF model (compare Fig. 11B with Fig. 9B).

FIG. 11.

STRF of a vPFC neuron, its tuning profiles, and its predictive capacity. Same format as Fig. 9.

Prediction of neural response.

The capacity of these STRFs to predict the PSTHs of a vPFC neuron was quantified using two different measures (see methods). The first measure was the correlation coefficient between the predicted neural response and the actual response. The second measure calculated the coherence between the predicted and actual responses; we integrated the coherence over all tested frequencies and report this value as an information value.

For the fixed-mean STRF model, the mean correlation-coefficient ratio was 0.13 (SD = 0.13). The mean information value was 0.34 (SD = 0.21). As one would predict from the data shown in Fig. 11, the cross-correlation and the information values were higher when the time-varying mean STRF model was used. In this instantiation, the mean correlation-coefficient ratio was 0.64 (SD = 0.32), which was reliably higher (t-test, P < 0.05) than the mean value generated using the fixed-mean STRF model. Similarly, the time-varying mean information value (0.56, SD = 0.2) was reliably larger (t-test, P < 0.05) than the mean value generated using the fixed-rate technique. A neuron-by-neuron analysis was also consistent with these analyses. Significantly more neurons had higher correlation-coefficient (information) values when calculated using the time-varying mean STRF model than when calculated using the fixed-mean STRF model (Wilcoxon, P < 0.05).

Because the STRFs were qualitatively similar for both the fixed-mean and time-varying mean firing rate models and because the structure of the STRFs was not reliable in both models, the difference in prediction cannot be attributed to differences in the predictive nature of the STRF. Instead, it suggests that increased predictive nature is based on the fact that our time-varying model was able to capture the transient response properties of vPFC to the presence of a sound irrespective of its nature. Nevertheless, the fact that the mean correlation-coefficient ratio was 0.64 shows that there remains a significant fraction of the neural response that is indeed sensitive to the stimulus properties but cannot be captured by the linear STRF. This is clearly seen, for example, in the response to the scream exemplar shown in Figs. 9 and 11 as well as the response seen in Fig. 8.

DISCUSSION

In this study, we found that the acoustic structure of rhesus vocalizations characterized with the modulation spectra had both general features that are found in other animal vocalizations including human speech and unique features that could be used to identify each vocalization type. We then tested whether the responses of vPFC neurons are tuned to the differences in the modulation spectra that were observed between different classes of rhesus vocalizations; such tuning would putatively facilitate an animal's capacity to discriminate between different vocalizations. The results of this neural study suggested that the tuning of vPFC neurons does not maximize the acoustic differences that exist between vocalizations. In addition, we found that a first-order model, a spectrotemporal receptive field, is not a good predictor of vPFC activity. Below, we compare the acoustics features of rhesus vocalizations with other classes of natural stimuli, discuss the relationship between vPFC activity and the acoustic features of vocalizations, discuss possible confounds, and compare our results with other vPFC studies.

Acoustic features of rhesus vocalizations

The modulation spectra for the 10 acoustic classes of rhesus vocalizations had relatively similar profiles. On average, most of the energy of these spectra was found at low spectral and temporal frequencies with higher spectral modulations concentrated at the lowest temporal modulation frequencies. This pattern is similar to that found in other types of vocalizations such as those produced by zebra finch (Singh and Theunissen 2003) and in human speech as shown here (Fig. 4). However, whereas the general patterns are similar and are a signature of animal vocalizations, the specific distribution of spectral modulations and temporal modulations can vary appreciably for different type of vocalizations. Therefore the modulation spectrum could be used to characterize, cluster, and classify different type of vocalizations.

Relationship between vPFC activity and the acoustic features of vocalizations

This between-class variability, and other types of variability, may be an important feature underlying an animal's capacity to discriminate between different classes or exemplars of sounds (Singh and Theunissen 2003; Woolley et al. 2005). Specifically, animals (or neurons) could learn to ignore regions of low variance because these areas do not convey any information about differences between different types of sounds. In contrast, animals (or neurons) could attend to regions of high variance because these regions do convey information about differences between sound types.

This hypothesis predicts that the response profiles of auditory neurons may be matched to low- and high-variance regions of acoustic space. One form of this matching may be that the variance in neural activity correlates with the variance in acoustic space. Our examination of vPFC sensitivity to band-limited noise was a direct test of this matching hypothesis (see responses of vpfc neurons to band-limited noise). In this study, we created different exemplars of band-limited noise that tiled the regions of acoustic space that are encompassed by rhesus vocalizations and measured the response of vPFC neurons to these stimuli. As shown in Figs. 6 and 7, the match between the tuning in vPFC activity and the variance in acoustic space appears to be quite low; this issue is discussed more below.

A second prediction of this matching hypothesis is that the composite modulation-transfer function (i.e., the two-dimensional transformation of a STRF) would match the modulation spectra of a stimulus class; the modulation-transfer function shows the spectrotemporal modulation frequencies that activate a neuron (Chi et al. 1999; Miller et al. 2002; Singh and Theunissen 2003; Woolley et al. 2005). Because our STRFs did not show any significant structure (see Figs. 911) we did not test this form of matching.

However, a recent study in zebra finches provided evidence for this type of matching (Woolley et al. 2005). In that study, the investigators found that modulation-transfer functions of midbrain and forebrain auditory neurons in the zebra finch match the modulation spectra of finch songs and other natural classes of sound but are not matched to the modulation spectra of artificial sounds. This neural-stimulus matching further supports an evolutionary-based hypothesis for brain function: neural circuitry is not “all purpose” but is designed to detect and to discriminate those features that exist in the real world (Felsen and Dan 2005; Rieke et al. 1995; Woolley et al. 2005).

It is important to comment further on the relationship between the band-limited noise classes and the species-specific vocalizations. The band-limited noise was based on a statistical analysis of the spectrotemporal properties found in the species-specific vocalizations. They were not designed to mimic vocalizations—and indeed, they do not sound anything like a vocalization. The purpose of these noise classes was to create a set of stimuli that tiled the informative (i.e., high-variance) and noninformative (i.e., low-variance) regions of the composite modulation spectrum and to determine whether vPFC neurons are differentially sensitive to stimuli with these spectrotemporal properties. As discussed above, we did not find evidence for this sensitivity. Future studies should examine the capacity of monkeys to discriminate between vocalizations when informative or noninformative regions of their modulation spectra have been filtered.

Potential confounds

There are many different possible alternatives that might explain the results of our two neurophysiological studies (see strfs of vpfc neurons and responses of vpfc neurons to band-limited noise). Below, we highlight some of these possible alternatives.

One possibility may be related to the brain area itself. For instance, Woolley et al. (2005) found that neurons in the finch midbrain and in early auditory forebrain areas were tuned to the modulation spectrum of finch vocalizations. Similarly, STRFs with significant structure were reported in the auditory cortex of nonhuman primates (deCharms et al. 1998) and other mammals (Kowalski et al. 1996; Linden et al. 2003; Shamma et al. 1993). That is, these auditory areas code acoustic features. In contrast, we were testing activity in the prefrontal cortex. Consequently, independent of the important differences between mammalian and avian neural organization (Jarvis et al. 2005), the differences between the current study and previous studies may simply reflect the fact that midbrain and early forebrain auditory areas are involved in acoustic-feature extraction (Mendelson and Grasse 1992; Middlebrooks et al. 1980; Shamma et al. 1993), whereas more central areas are not involved in such feature extraction. However, this hypothesis may not be fully explanative: other studies suggest that the auditory cortex is not involved in feature extraction but may be involved in more complex types of auditory-object processing (Barbour and Wang 2003; Machens et al. 2004; Nelken et al. 2003).

A second possibility might relate to the nature of the behavioral task. In our task, the monkeys' only behavioral requirement was to fixate the central LED. We did not require them to overtly or even covertly process the auditory stimuli. Is it possible that these behavioral requirements “interfered” with vPFC processing? For example, it may be possible that if the monkeys were attending to some aspect of the sound, we might have found that the STRFS of vPFC neurons were predictive. Similarly, vPFC neurons might show a different pattern of selectivity to the band-limited noise if the monkeys were engaged in a task that required them to discriminate between the different classes of noise.

Several lines of evidence suggest that the vPFC might have been affected by the demands of our task. For instance, because attending a fixation light is known to suppress (or even eliminate) auditory responses in the parietal cortex of rhesus monkeys (Gifford 3rd and Cohen 2004), it is possible that vPFC responses may have been suppressed, relative to some other paradigm. On the other hand, because species-specific vocalizations are behaviorally relevant stimuli, the rhesus might have been (covertly) attending (Snyder et al. 2000) to these interesting stimuli, causing an increase in responsivity (Maunsell and Treue 2006). Similarly, when human subjects are engaged in tasks that require them to attend to the spatial or nonspatial attributes of an auditory stimulus, the frontal and parietal areas that are part of the anterior (nonspatial) and posterior (spatial) pathways are differentially activated (Ahveninen et al. 2006; Alain et al. 2001; Hart et al. 2004; Maeder et al. 2001; Rämä et al. 2004), but when subjects listen passively to a stimulus, these frontal and parietal areas are not engaged (Hart et al. 2004; Warren and Griffiths 2003). Thus it is reasonable to suggest that vPFC activity is dependent on whether the monkeys are passively or actively attending to auditory stimuli (Gifford 3rd et al. 2005; Hung et al. 2005; Sereno and Maunsell 1998).

The specific effect that our (passive-listening) fixation task would have on vPFC activity is unclear. Previous studies showed that the coding format of a neuron, including changes in a neuron's STRF, can be altered by the demands of a behavioral task (Fritz et al. 2003; Gibson and Maunsell 1997; Merzenich et al. 1988). The tasks in these studies, however, were different from ours in that they did not involve passive listening: these studies used paradigms in which the behavioral relevance of a stimulus was altered systematically or tasks in which animals used different strategies to map stimuli to an action. Regardless, it seems unlikely that passive or active listening would fundamentally change the fundamental coding properties of a neuron (i.e., from not coding the acoustic features of an auditory stimulus to coding them). On the other hand, neural activity from passively viewing or listening monkeys is capable of coding high-order features of stimuli, such as their membership in a category (Cohen et al. 2006; Gifford 3rd et al. 2005; Hung et al. 2005). Future work should address how passive versus active listening would modulate vPFC activity.

A third possibility might be related to the stimuli themselves. For example, it is well known that the effect that a tone burst has on the response of a neuron to a proceeding tone burst is complex and can last upwards of several seconds (Barbour and Wang 2003; Brosch and Schreiner 1997; Ulanovsky et al. 2003; Werner-Reiss et al. 2006). Because our interstimulus interval was between 1 and 2 s, the possibility exists that this forward-masking effect might have introduced a confound. However, it seems likely that the effect of the preceding stimuli would be to increase or decrease a neuron's response and not fundamentally alter the neuron's coding properties. Indeed, in a related study that examined the effect of stimulus-presentation rate on STRF structure (Valentine and Eggermont 2004), it was found that the effect was subtle and altered factors such as the bandwidth of the excitatory region of the STRF. Our long-duration stimuli with complex spectrotemporal structure might also introduce an “inherent” forward-masking effect. However, our analyses were designed specifically to take these spectrotemporal structures into account (Theunissen et al. 2004). Also, because the vast majority (roughly 90%) of the spectrotemporal energy of the rhesus vocalizations was found at temporal frequencies between 0 and 20 Hz (see Fig. 3), we limited our band-limited noise bursts to temporal frequencies in that range. However, it is possible that a relationship between vPFC activity and the band-limited noise would have been found if we had tested at higher temporal frequencies.

Finally, a more abstract possibility is that information about the signaler was being coded by these neurons. Gekkers and girneys are elicited by juveniles (Hauser 1998). Copulation screams are elicited only by males (Hauser 1998). Caller identity was not available to the tested monkeys because the vocalizations came from monkeys that were not known to those used in this study. If this information was indeed coded, it would be consistent with the hypothesis discussed in the following text of a role for vPFC in auditory-object processing and not feature extraction.

Comparison with other vPFC studies

The results of this study indicate that vPFC neurons did not differentiate reliably or meaningfully between different classes of band-pass noise (see responses of vpfc neurons to band-limited noise) and did not have STRFs that were strongly predictive of vPFC activity (see strfs of vpfc neurons). Although our data do not support the original hypotheses, the results appear to be consistent with previous work from our laboratory. In previous work, we argued that the vPFC is involved in categorization and is part of a neural circuit that processes socially meaningful signals into distinctive categories (Cohen et al. 2006; Gifford 3rd et al. 2005). For instance, vPFC neurons respond similarly to vocalizations that transmit information about related items (i.e., food quality) but respond quite differently to vocalizations that transmit unrelated information. In a separate study, we showed that vPFC activity reflects transitions between functionally meaningful categories. We found that vPFC neurons are modulated by transitions between vocalizations that transmit different types of information regarding food quality. They are not, however, modulated by transitions between vocalizations that transmit the same type of information even though the vocalizations have distinct acoustic features. Nor are vPFC neurons modulated by transitions between different classes of band-pass noise; presumably these artificial stimuli belong to the same category of “not functionally meaningful” (Gifford 3rd et al. 2005).

Not all studies are suggestive of a role for vPFC in such abstract representations as categorization. An important study to consider is that by Romanski and colleagues (2005), in which they recorded vPFC activity while monkeys listened passively to species-specific vocalizations and attended to a central LED—a behavioral task that is quite similar to ours. There were several findings in that study that differed from our results. Importantly, results of a series of analyses suggested that the neural activity in their population reflects the acoustic properties of the vocalizations more than the functional information (e.g., food quality) that is transmitted by the vocalizations. Romanski et al. also demonstrated that vPFC neurons were far more selective for vocalizations than what we typically encountered. Using a monkey call-preference index (Tian et al. 2001), they report that roughly 55% of their population responded preferentially to only one or two vocalizations. When we applied the same index to our data, we found that nearly 12% of our population responded preferentially to one or two vocalizations, whereas >35% responded preferentially to 10 vocalizations (data not shown). Finally, the population of neurons that Romanski et al. describe tends to have more tonic responses, whereas ours are more phasic.

Presently, we cannot readily reconcile all of our results with those reported by Romanski et al. These differences may be attributable to subtle, yet significant, differences in recording locations, differences in analyses techniques, or differences in behavioral paradigms. However, a recent report from Romanski (Averbeck and Romanski 2006) is in agreement with the results of our STRF study: In their study, the STRFs of vPFC neurons were generated and used to assess how well they predict vPFC activity. Similar to our results, Averbeck and Romanski also demonstrated that STRFs are not a good predictor of vPFC activity. However, it is still clear that more work is needed to more fully characterize the role of the vPFC in auditory processing.

GRANTS

Y. E. Cohen was supported by grants from the Whitehall Foundation, National Institutes of Health (NIH), and a Burke Award. F. E. Theunissen was supported by NIH grants.

Acknowledgments

The authors are grateful to A. Underhill for superb animal care; to K. MacLean, R. Kiringoda, and D. Jung for help with data collection; and to J. Groh, Y.-S. Lee, and H. Hersh for constructive comments. M. Hauser generously provided recordings of the rhesus vocalizations.

Footnotes

  • * These authors contributed equally to this work.

  • Address for reprint requests and other correspondence: Y. E. Cohen, Department of Psychological and Brain Sciences, Dartmouth College, Hanover, NH 03755 (E-mail: yec{at}dartmouth.edu).

    The costs of publication of this article were defrayed in part by the payment of page charges. The article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

REFERENCES

View Abstract