|
|
||||||||
1Eaton-Peabody Laboratory, Massachusetts Eye and Ear Infirmary, Boston; 2Speech and Hearing Bioscience and Technology Program, Harvard-Massachusetts Institute of Technology Division of Health Sciences and Technology; and 3Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, Massachusetts
Submitted 26 October 2004; accepted in final form 17 March 2005
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
Investigating the neural mechanisms underlying the perception of the pitch of harmonic complex tones is of great importance for a variety of reasons. Changes in pitch convey melody in music, and the superposition of different pitches is the basis for harmony. Pitch has an important role in speech, where it carries prosodic features and information about speaker identity. In tone languages such as Mandarin Chinese, pitch also cues lexical contrasts. Pitch plays a major role in auditory scene analysis: differences in pitch are a major cue for sound source segregation, while frequency components that share a common fundamental tend to be grouped into a single auditory object (Bregman 1990
; Darwin and Carlyon 1995
).
Pitch perception with missing fundamental stimuli is not unique to humans; it also occurs in birds (Cynx and Shapiro 1986
) and nonhuman mammals (Heffner and Whitfield 1976
; Tomlinson and Schwartz 1988
), making animal models suitable for studying neural representations of pitch. Pitch perception mechanisms in animals may play a role in processing conspecific vocalizations, which often contain harmonic complex tones.
The neural mechanisms underlying pitch perception of harmonic complex tones have been at the center of a debate among scientists for over a century (Ohm 1843
; Seebeck 1841
). This debate arises because the peripheral auditory system provides two types of cues to the pitch of complex tones: place cues dependent upon the frequency selectivity and tonotopic mapping of the cochlea and temporal cues dependent on neural phase locking.
The peripheral auditory system can be thought of as containing a bank of band-pass filters representing the mechanical frequency analysis performed by the basilar membrane. When two partials of a complex tone are spaced sufficiently apart relative to the auditory filter bandwidths, each of them produces an individual local maximum in the spatial pattern of basilar membrane motion. In this case, the two harmonics are said to be "resolved" by the auditory periphery. On the other hand, when two or more harmonics fall within the pass-band of a single peripheral filter, they are said to be "unresolved." Because the bandwidths of the auditory filters increase with their center frequency, only low-order harmonics are resolved. Based on psychophysical data, the first 610 harmonics are thought to be resolved in humans (Bernstein and Oxenham 2003b
; Plomp 1964
).
When a complex tone contains resolved harmonics, its pitch can be extracted by matching the pattern of activity across a tonotopic neural map to internally stored harmonic templates (Cohen et al. 1994
; Goldstein 1973
; Terhardt 1974
; Wightman 1973
). This type of model accounts for many pitch phenomena, including the pitch of the missing fundamental, the pitch shift associated with inharmonic complexes, and the pitch ambiguity of complex tones comprising only a few harmonics. However, a key issue in these models is the exact nature of the neural representation upon which the hypothetical template matching mechanism operates.
Pitch percepts can also be produced by complex tones consisting entirely of unresolved harmonics. In general, though, these pitches are weaker and more dependent on phase relationships among the partials than the pitch based on resolved harmonics (Bernstein and Oxenham 2003b
; Carlyon and Shackleton 1994
; Houtsma and Smurzynski 1990
). With unresolved harmonics, there are no spectral cues to pitch, and therefore harmonic template models are not applicable. On the other hand, unresolved harmonics produce direct temporal cues to pitch because the waveform of a combination of unresolved harmonics has a period equal to that of the complex tone. These periodicity cues, which are reflected in neural phase locking, can be extracted by an autocorrelation-type mechanism (Licklider 1951
; Meddis and Hewitt 1991
; Moore 1990
; Yost 1996
), which is mathematically equivalent to an all-order interspike-interval distribution for neural spike trains. The autocorrelation model also works with resolved harmonics, since the period of the F0 is always an integer multiple of the period of any of the harmonics; this common period can be extracted by combining (e.g., summing) autocorrelation functions from frequency channels tuned to different resolved harmonics (Meddis and Hewitt 1991
; Moore 1990
).
Previous neurophysiological studies of the coding of the pitch of complex tones in the auditory nerve and cochlear nucleus have documented a robust temporal representation based on pooled interspike-interval distributions obtained by summing the interval distributions from neurons covering a wide range of characteristic frequencies (Cariani and Delgutte 1996a
,b
; Palmer 1990
; Palmer and Winter 1993
; Rhode 1995
; Shofner 1991
). This representation accounts for a wide variety of pitch phenomena, such as the pitch of the missing fundamental, the pitch shift of inharmonic tones, pitch ambiguity, the pitch equivalence of stimuli with similar periodicity, the relative phase invariance of pitch, and, to some extent, the dominance of low-frequency harmonics in pitch. Despite its remarkable effectiveness, the autocorrelation model has difficulty in accounting for the greater pitch salience of stimuli containing resolved harmonics compared to stimuli consisting entirely of unresolved harmonics (Bernstein and Oxenham 2003a
; Carlyon 1998
; Carlyon and Shackleton 1994
; Meddis and O'Mard 1997
). This issue was not addressed in previous physiological studies because they did not have a means of assessing whether individual harmonics are resolved or not. Moreover, the upper F0 limit over which the interspike-interval representation of pitch is physiologically viable has not been determined. The existence of such a limit is expected due to the degradation in neural phase locking with increasing frequency (Johnson 1980
).
In contrast to the wealth of data on the interspike-interval representation of pitch, possible rate-place cues to pitch that might be available when individual harmonics are resolved by the peripheral auditory system have rarely been investigated. The few studies that provide relevant information (Hirahara et al. 1996
; Sachs and Young 1979
; Shamma 1985a
,b
) show no evidence for rate-place cues to pitch, even at low stimulus levels where the limited dynamic range of individual neurons is not an issue. The reason for this failure could be that the stimuli used had low fundamental frequencies in the range of human voice (100300 Hz) and therefore produced few, if any, resolved harmonics in typical experimental animals, which have a poorer cochlear frequency selectivity compared to humans (Shera et al. 2002
). Rate-place cues to pitch might be available in animals for complex tones with higher F0s in the range of conspecific vocalizations, which corresponds to about 5001,000 Hz for cats (Brown et al. 1978
; Nicastro and Owren 2003
; Shipley et al. 1991
). This hypothesis is consistent with a report that up to 13 harmonics of a complex tone could be resolved in the rate responses of high-CF units in the cat anteroventral cochlear nucleus (Smoorenburg and Linschoten 1977
).
In this study, we investigated the resolvability of harmonics of complex tones in the cat auditory nerve and compared the effectiveness of rate-place and interval-based representations of pitch over a much wider range of fundamental frequencies (1103,520 Hz) than in previous studies. We found that the two representations are complementary with respect to the F0 range over which they are effective, but that neither representation is entirely satisfactory in accounting for human psychophysical data. Preliminary reports of our findings have been presented (Cedolin and Delgutte 2003
, 2005a
).
| METHODS |
|---|
|
|
|---|
Methods for recording from auditory nerve (AN) fibers in anesthetized cats are as described by Kiang et al. (1965
) and Cariani and Delgutte (1996a)
. Cats were anesthetized with Dial in urethane (75 mg/kg), with supplementary doses given as needed to maintain an areflexic state. The posterior portion of the skull was removed, and the cerebellum was retracted to expose the auditory nerve. The tympanic bullae and the middle-ear cavities were opened to expose the round window. Throughout the experiment, the cat was given injections of dexamethasone (0.26 mg/kg) to prevent brain swelling and Ringer solution (50 ml/d) to prevent dehydration.
The cat was placed on a vibration-isolated table in an electrically shielded, temperature-controlled, soundproof chamber. A silver electrode was positioned at the round window to record the compound action potential (CAP) in response to click stimuli, in order to assess the condition and stability of cochlear function.
Sound was delivered to the cat's ear through a closed acoustic assembly driven by an electrodynamic speaker (Realistic 401377). The acoustic system was calibrated to allow accurate control over the sound-pressure level at the tympanic membrane. Stimuli were generated by a 16-bit D/A converter (Concurrent DA04H) using sampling rates of 20 or 50 kHz. Stimuli were digitally filtered to compensate for the transfer characteristics of the acoustic system.
Spikes were recorded with glass micropipettes filled with 2 M KCl. The electrode was inserted into the nerve and mechanically advanced using a micropositioner (Kopf 650). The electrode signal was band-pass filtered and fed to a custom spike detector. The times of spike peaks were recorded with 1-µs resolution and saved to disk for subsequent analysis.
A click stimulus at
55 dB SPL was used to search for single units. Upon contact with a fiber, a frequency tuning curve was measured by an automatic tracking algorithm (Kiang et al. 1970
) using 100-ms tone bursts, and the characteristic frequency (CF) was determined. The spontaneous firing rate (SR) of the fiber was measured over an interval of 20 s. The responses to complex-tone stimuli were then studied.
Complex-tone stimuli
Stimuli were harmonic complex tones whose F0 was stepped up and down over a two-octave range. The harmonics of each complex tone were all of equal amplitude, and the fundamental component was always missing. Depending on the fibers CF, one of four presynthesized stimuli covering different F0 ranges was selected so that some of the harmonics would likely be resolved (Table 1). For example, for a fiber with a 1,760-Hz CF, we typically used F0s ranging from 220 to 880 Hz so that the order of the harmonic closest to the CF would vary from 2 to 8. In each of the four stimuli, the harmonics were restricted to a fixed frequency region as F0 varied (Table 1). For each fiber, the stimulus was selected so that the CF fell approximately at the center of the frequency region spanned by the harmonics. In some cases, data were collected from the same fiber in response to two different stimuli whose harmonics spanned overlapping frequency ranges.
|
We used mostly low and moderate stimulus levels in order to minimize rate saturation, which would prevent us from accurately assessing harmonic resolvability by the cochlea. Specifically, the sound pressure level of each harmonic was initially set at 1520 dB above the fiber's threshold for a pure tone at CF and ranged from 10 to 70 dB SPL, with a median of 25 dB SPL. Because our stimuli contain many harmonics, overall stimulus levels are about 510 dB higher than the level of each harmonic, depending on F0. In some cases, responses were measured for two or more stimulus levels differing by 1020 dB.
To compare neural responses to psychophysical data on the phase dependence of pitch, three versions of each stimulus were generated with different phase relationships among the harmonics: cosine phase, alternating (sine-cosine) phase, and negative Schroeder phase (Schroeder 1970
). The three stimuli have the same power spectrum and autocorrelation function, but differ in their temporal fine structure and envelope: while the cosine-phase and alternating-phase stimuli have very "peaky" envelopes, the envelope of the Schroeder-phase stimulus is nearly flat (Fig. 1). Moreover, the envelope periodicity is at F0 for the cosine-phase stimulus, but at 2 x F0 for the alternating- phase stimulus. Alternating-phase stimuli have been widely used in previous studies of neural coding (Horst et al. 1990
; Palmer and Winter 1992
, 1993
).
|
For each step in the F0 sequence, spikes were counted over a 180-ms window extending over the stimulus duration but excluding the transition period between F0 steps. Spikes counts from the two stimulus segments having the same F0 (from the ascending and descending parts of the F0 sequence) were added together because response to both directions were generally similar. The spike counts were converted to units of discharge rate (spikes/s) and plotted either as a function of F0 for a given fiber or as a function of fiber CF for a given F0 to form a "rate-place profile" (Sachs and Young 1979
).
To assess the statistical reliability of these discharge rate estimates, "bootstrap" resampling (Efron and Tibshirani 1993
) was performed on the data recorded from each fiber. One hundred resampled data sets were generated by drawing with replacement from the set of spike trains in response to each F0. Spike counts in the ascending and descending part of the F0 sequence were drawn independently from each other. Spike counts from each bootstrap data set were converted to discharge rate estimates as for the original data, and the standard deviation of these estimates was used as an error bar for the mean discharge rate.
Simple phenomenological models were used to analyze average- rate responses to the complex-tone stimuli. Specifically, a single-fiber model was fit to responses of a given fiber as a function of stimulus F0 to quantify harmonic resolvability, while a population model was used to estimate pitch from profiles of average discharge rate against CF for a given F0.
The single-fiber model (Fig. 2) is a cascade of three stages. The linear band-pass filtering stage, representing cochlear frequency selectivity, is implemented by a symmetric rounded exponential function (Patterson 1976
). The model of Sachs and Abbas (1974
) of rate- level functions is then used to derive the mean discharge rate r from the r.m.s. amplitude p at the output of the band-pass filter
![]() | (1) |
was fixed at 1.77 to obtain a dynamic range of about 20 dB (Sachs and Abbas 1974
|
While the population model has no free parameters, five fixed (i.e., stimulus-independent) parameters still need to be specified for each fiber in the modeled population. These parameters were selected so as to meet two separate requirements: 1) the model's normalized driven rate must vary smoothly with CF, and 2) the model must completely specify the Poisson distribution of spike counts for each fiber so as to be able to apply the maximum-likelihood method. To meet these requirements, three of the population-model parameters were directly obtained from the corresponding parameters for the single-fiber model: the center frequency of the band-pass filter (effectively the CF), the spontaneous rate rsp, and the maximum driven rate rdmax. The sensitivity parameter p50 in the population model was set to the median value of this parameter over our fiber sample. Finally, the bandwidth of the band-pass filter was derived from its center frequency by assuming a power law relationship between the two (Shera et al. 2002
). The parameters of this power function were obtained by fitting a straight line in double logarithmic coordinates to a scatter plot of filter bandwidth against center frequency for our sample of fibers.
Interspike-interval analysis
As in previous studies of the neural coding of pitch (Cariani and Delgutte 1996a
,b
; Rhode 1995
), we derived pitch estimates from pooled interspike-interval distributions. The pooled interval distribution is the sum of the all-order interspike-interval distributions for all the sampled auditory-nerve fibers and is closely related to the summary autocorrelation in the model of Meddis and Hewitt (1991
). The single-fiber interval distribution (bin width 0.1 ms) was computed for each F0 using spikes occurring in the same time window as used in the rate analysis.
To derive pitch estimates from pooled interval distributions, we used "periodic templates" that select intervals at a given period and its multiples. Specifically, we define the contrast ratio of a periodic template as the ratio of the weighted mean number of intervals for bins within the template to the weighted mean number of intervals per bin in the entire histogram. The estimated pitch period is the period of the template that maximizes the contrast ratio. In computing the contrast ratio, each interval is weighted by an exponentially decaying function of its length to give greater weight to short intervals. This weighting implements the idea that the lower F0 limit of pitch at about 30 Hz (Pressnitzer et al. 2001
) implies that the auditory system is unable to use very long intervals in forming pitch percepts. A 3.6-ms decay time constant was found empirically to minimize the number of octave and suboctave errors in pitch estimation. The statistical reliability of the pitch estimates was assessed by generating 100 bootstrap replications of the pooled interval distribution (using the same resampling techniques as in the rate analysis) and computing a pitch estimate for each bootstrap replication.
| RESULTS |
|---|
|
|
|---|
Single-fiber cues to resolved harmonics
Figure 3 shows the average discharge rate as a function of complex-tone F0 (harmonics in cosine phase) for two AN fibers with CFs of 952 (A) and 4,026 Hz (B), respectively. Data are plotted against the dimensionless ratio of fiber CF to stimulus F0, which we call harmonic number (lower horizontal axis). Because this ratio varies inversely with F0, F0 increases from right to left along the top axis in these plots. The harmonic number takes an integer value when the CF coincides with one of the harmonics of the stimulus, while it is an odd integer multiple of 0.5 (2.5, 3.5, etc.) when the CF falls halfway between two harmonics. Thus resolved harmonics should appear as peaks in firing rate for integer values of the harmonic number, with valleys in between. This prediction is verified for both fibers at lower values of the harmonic number (higher F0s), although the oscillations are more pronounced and extend to higher harmonic numbers for the high-CF fiber than for the low-CF fiber. This observation is consistent with the higher quality factor (Q = CF/Bandwidth) of high-CF fibers compared with low-CF fibers (Kiang et al. 1965
; Liberman 1978
).
|
Figure 4 shows how F0min varies with CF for our entire sample of fibers. To be included in this plot, the variance of the residuals after fitting the single-fiber model to the data had to be significantly smaller (P < 0.05, F-test) than the variance of the raw data so that Nmax (and therefore F0min) could be reliably estimated. Thirty-five of 122 measurements were thus excluded; 23 of these had CFs <2,000 Hz. On the other hand, the figure includes data from fibers (shown by triangles) for which F0min was bounded by the lowest F0 presented and was therefore overestimated. F0min increases systematically with CF, and the increase is well fit by a power function with an exponent of 0.63 (solid line). This increase is consistent with the increase in tuning curve bandwidths with CF (Kiang et al. 1965
).
|
To more directly address the level dependence of responses, we held 24 fibers long enough to record the responses to harmonic complex tones at two or more stimulus levels differing by 1020 dB. In 23 of these 24 cases, the maximum resolved harmonic number Nmax decreased with increasing level. One example is shown in Fig. 5 for a fiber with CF at 1,983 Hz. For this fiber, Nmax decreased from 7.1 at 20 dB SPL to 4.9 at 30 dB SPL.
|
Pitch estimation from rate-place profiles
Having characterized the limits of harmonic resolvability in rate responses of AN fibers, the next step is to determine how accurately pitch can be estimated from rate-place cues to resolved harmonics. For this purpose, we fit harmonic templates to profiles of average discharge rate against CF and derive pitch estimates by the maximum likelihood method, assuming that the spike counts from each fiber are random variables with statistically independent Poisson distributions. In our implementation, a harmonic template is the response of a peripheral auditory model to a complex tone with equal-amplitude harmonics. The estimated pitch is therefore the F0 of the complex tone most likely to have produced the observed response if the stimulus-response relationship were defined by the model.
Figure 6 shows the normalized driven discharge rate of AN fibers as a function of CF in response to two complex tones (harmonics in cosine phase) with F0s of 541.5 (A) and 1,564.4 Hz (C). The rate is normalized by subtracting the spontaneous rate and dividing by the maximum driven rate (Sachs and Young 1979
), and these parameters are estimated by fitting the single-fiber model to the rate-F0 data. As for the single-fiber responses in Figs. 3 and 5, responses are plotted against the dimensionless harmonic number CF/F0, with the difference that F0 is now fixed while CF varies, instead of the opposite. Resolved harmonics should again result in peaks in firing rate at integer values of the harmonic number. Despite considerable scatter in the data, this prediction is verified for both F0s, although the oscillations are more pronounced for the higher F0. Many factors are likely to contribute to the scatter, including the threshold differences among fibers with the same CF (Liberman 1978
), pooling data from two animals, intrinsic variability in neural responses, and inaccuracies in estimating the minimum and maximum discharge rates used in computing the normalized rate.
|
To assess the reliability of the maximum-likelihood pitch estimates, estimates were computed for 100 bootstrap resamplings of the data for each F0 (see METHODS). Figure 7A shows the median absolute estimation error of these bootstrap estimates as a function of F0 for complex tones with harmonics in cosine phase. With few exceptions, median pitch estimates only deviate by a few percent from the stimulus F0 above 500 Hz. Larger deviations are more common for lower F0s. The number and CF distribution of the fibers had to meet certain constraints for each F0 to be included in the figure because, to reliably estimate F0, the sampling of the CF axis has to be sufficiently dense to capture the harmonically related oscillations in the rate-CF profiles. This is why Fig. 7 shows no estimates for F0s below 220 Hz and for a small subset of F0s (12 of 56) above 220 Hz.
|
Harmonic templates were fit to rate-place profiles obtained in response to complex tones with harmonics in alternating phase and in Schroeder phase as well as in cosine phase to test whether the pitch estimates depend on phase. Figure 8 shows an example for an F0 of 392 Hz. The numbers of data points differ somewhat for the three phase conditions because we could not always "hold" a unit sufficiently long to measure responses to all three conditions. Despite these sampling differences, the pitch estimates for the three phase conditions are similar to each other (Fig. 8, AC) and similar to the pitch estimate obtained by combining data across all three phase conditions (Fig. 8D).
|
This test was performed for three different values of F0 (612, 670, and 828 Hz) in addition to the 392-Hz case shown in Fig. 8. 1 In three of these four cases, the results were as in Fig. 8 in that the differences in maximum likelihoods for the two models did not reach statistical significance (P < 0.05). For 612 Hz, the comparison did reach significance (P = 0.007), but for this F0, the rate-place profiles for harmonics in alternating and Schroeder phase showed large gaps in the distribution of data points over harmonic numbers, making the reliability of the F0-estimates for these two phases questionable. When the actual pitch estimates for the different phase conditions were compared, there was no clear pattern to the results across F0s, i.e., the pitch estimate for any given phase condition could be the largest in one case and the smallest in another case. These results indicate that phase relationships among the partials of a complex tone do not seem to greatly influence the pitch estimated from rate-place profiles, consistent with psychophysical data on the phase invariance of pitch based on resolved harmonics (Houtsma and Smurzynski 1990
).
Pitch estimation from pooled interspike-interval distributions
Pitch estimates were derived from pooled interspike-interval distributions to compare the accuracy of these estimates with that of rate-place estimates for the same stimuli. Figure 9, A and B, shows pooled all-order interspike-interval distributions for two complex-tone stimuli with F0s of 320 and 880 Hz (harmonics in cosine phase). For both F0s, the pooled distributions show modes at the period of F0 and its integer multiples. However, these modes are less prominent at the higher F0 for which only the first few harmonics are located in the range of robust phase locking.
|
We therefore modified our pitch estimation method to make use of all pitch-related modes in the pooled interval distribution rather than just the first one. Specifically, we used periodic templates that select intervals at a given period and its multiples and determined the template F0 which maximizes the contrast ratio, a signal-to-noise ratio measure of the number of intervals within the template relative to the mean number of intervals per bin (see METHODS). When computing the contrast ratio, short intervals were weighted more than long intervals according to an exponentially decaying weighting function of interval length. This weighting implements the psychophysical observation of a lower limit of pitch near 30 Hz (Pressnitzer et al. 2001
) by preventing long intervals to contribute significantly to pitch. Figure 9, C and D, shows the template contrast ratio as a function of template F0 for the same two stimuli as on top. For both stimuli, the contrast ratio reaches an absolute maximum when the template F0 is very close to the stimulus F0, although the peak contrast ratio is larger for the lower F0. The contrast ratio also shows local maxima one octave above and below the stimulus F0. In Fig. 9C, these secondary maxima are small relative to the main peak at F0, but in Fig. 9D, the maximum at F0/2 is almost as large as the one at F0. Despite the close call, F0 was correctly estimated in both cases of Fig. 9, and overall, our pitch estimation algorithm produced essentially no octave or sub-octave errors over the entire range of F0 investigated (1103,520 Hz).
Figure 10 shows measures of the accuracy and strength of the interval-based pitch estimates as a function of F0 for harmonics in cosine phase. The accuracy measure is the median absolute value of the pitch estimation error over bootstrap replications of the pooled interval distributions. The estimates are highly accurate below 1,300 Hz, where their medians are within 12% of the stimulus F0 (Fig. 10A). However, the interval-based estimates of pitch abruptly break down near 1,300 Hz. While the existence of such an upper limit is consistent with the degradation in phase locking at high frequencies, the location of this limit at 1,300 Hz is low compared with the 4- to 5-kHz upper limit of phase locking, a point to which we return in the DISCUSSION.
|
For a few F0s, interval-based estimates of pitch were derived for complex tones with harmonics in alternating phase and in Schroeder phase as well as for harmonics in cosine phase. Figure 11 compares the pooled all-order interval distributions in the three phase conditions for two F0s: 130 (left) and 612 Hz (right). Based on the rate-place results, the harmonics of the 130-Hz F0 are not resolved, whereas some of the harmonics of the 612-Hz F0 are resolved. This is because we obtained a reliable pitch estimate based on rate-place profiles at 612 Hz but not at 130 Hz (Fig. 7).
|
The interval-based pitch estimates are nearly identical for all three phase conditions, but the maximum contrast ratio is substantially lower for harmonics in alternating phase than for harmonics in cosine or in Schroeder phase (Fig. 11D). In addition, for harmonics in alternating phase, the contrast ratio of the periodic template at the envelope frequency 2 x F0 is almost as large as the contrast ratio at F0. In contrast, for the higher F0 (612 Hz), there are no obvious differences between phase conditions in the pooled all-order interval distributions (Fig. 11, EG). In particular, the secondary peaks at half the period of F0, which were found at 130 Hz for the alternating-phase stimulus, are no longer present at 612 Hz. Moreover, the maximum contrast ratios are essentially the same for all three phase conditions (Fig. 11H).
Overall, these results show that, while phase relationships among harmonics have little effect on the pitch values estimated from pooled interval distributions, which are always close to the stimulus F0, the salience of these estimates can be significantly affected by phase when harmonics are unresolved. These results are consistent with psychophysical results showing a greater effect of phase on pitch and pitch salience for stimuli consisting of unresolved harmonics than for stimuli containing resolved harmonics (Houtsma and Smurzynski 1990
; Shackleton and Carlyon 1994
). However, these results fail to account for the observation that the dominant pitch is often heard at the envelope frequency 2 x F0 for unresolved harmonics in alternating phase.
| DISCUSSION |
|---|
|
|
|---|
We examined the response of cat AN fibers to complex tones with a missing fundamental and equal-amplitude harmonics. We used low and moderate stimulus levels (1520 dB above threshold) to minimize rate saturation that would prevent us from accurately assessing cochlear frequency selectivity and therefore harmonic resolvability from rate responses. In general, the average-rate of a single AN fiber was stronger when its CF was near a low-order harmonic of a complex tone than when the CF fell halfway in between two harmonics (Fig. 3). This trend could be predicted using a phenomenological model of single-fiber rate responses incorporating a band-pass filter representing cochlear frequency selectivity (Fig. 2). The amplitude of the oscillations in the response of the best-fitting single-fiber model, relative to the typical variability in the data, gave an estimate of the lower F0 of complex tones whose harmonics are resolved at a given CF (Fig. 3). This limit, which we call F0min, increases systematically with CF, and this increase is well fit by a power function with an exponent of 0.63 (Fig. 4). That the exponent is less than 1 is consistent with the progressive sharpening of peripheral tuning with increasing CF when expressed as a Q factor, the ratio CF/Bandwidth. The exponent for Q would be 0.37, which closely matches the 0.37 exponent found by Shera et al. (2002
) for the CF dependence of Q10 in pure-tone tuning curves from AN fibers in the cat.
Our definition of the lower limit of resolvability F0min is to some extent arbitrary because it depends on the variability in the average discharge rates, which in turn depends on the number of stimulus repetitions and the duration of the stimulus. Nevertheless, our results are consistent with those of Wilson and Evans (1971
) for AN fibers in the guinea pig using ripple noise (comb-filtered noise), a stimulus with broad spectral maxima at harmonically related frequencies. These authors found that the number of such maxima that can be resolved in the rate responses of single fibers (equivalent to our Nmax) increases with CF from 23 at 200 Hz to about 10 at 10 kHz and above. Similarly, Smoorenburg and Linschoten (1977
) reported that the number of harmonics of a complex tone that are resolved in the rate responses of single units in the cat anteroventral cochlear nucleus (AVCN) increases from 2 at 250 Hz to 13 at 10 kHz. Despite the different metrics used to define resolvability, both studies are in good agreement with the data of Fig. 4 if we use the conversion F0min = CF/Nmax.
Consistent with a previous report for AVCN neurons (Smoorenburg and Linschoten 1977
), we found that the ability of AN fibers to resolve harmonics in their rate response degrades rapidly with increasing stimulus level. This degradation could be due either to the broadening of cochlear tuning with increasing level or to saturation of the average rate. Saturation seems to be the most likely explanation because a single-fiber model with level-dependent bandwidth did not fit the data significantly better than a model with fixed bandwidth. However, the level dependence of cochlear filter bandwidths might have a greater effect on responses to complex tones if level were varied over a wider range than the 1020 dB used here (Cooper and Rhode 1997
; Ruggero et al. 1997
).
Rate-place representation of pitch
A major finding is that the pitch of complex tones could be reliably and accurately estimated from rate-place profiles for fundamental frequencies above 400500 Hz by fitting a harmonic template to the data (Figs. 6 and 7, A and B). The harmonic template was implemented as the response of a simple peripheral auditory model to a harmonic complex tone with equal-amplitude harmonics, and the estimated pitch was the F0 of the complex tone most likely to have produced the rate-place data assuming that the stimulus-response relationship is characterized by the model. Despite the nonuniform sampling of CFs and the moderate number of fibers sampled at each F0 (typically 2040), these pitch estimates were accurate within a few percent.
Pitch estimation became increasingly less reliable for F0s below 400500 Hz, with large estimation errors becoming increasingly common. Nevertheless, some reliable estimates could be obtained for F0s as low as 250 Hz. This result is consistent with the failure of previous studies to identify rate-place cues to pitch in AN responses to harmonic complex tones with F0s below 300 Hz (Hirahara et al. 1996
; Sachs and Young 1979
; Shamma 1985a
,b
), although Hirahara et al. did find a weak representation of the first two to three harmonics in rate-place profiles for vowels with an F0 at 350 Hz.
In interpreting these results, it is important to keep in mind that the precision of the rate-based pitch estimates depends on many factors such as the number of fibers sampled, the CF distribution of the fibers, pooling of data from two animals, the number of stimulus repetitions, and the particular method for fitting harmonic templates. For example, since the lowest CF sampled was 450 Hz, the second harmonic and, in some cases, the third could not be represented in the rate-place profiles for F0s <220 Hz, possibly explaining why we never obtained a reliable pitch estimate in that range. In fact, because our stimuli had missing fundamentals, we cannot rule out that the fundamental might always be resolved when it is present.
In one respect, our method may somewhat overestimate the accuracy of the rate-based pitch estimates because we only included data from measurements for which the rate response as a function of F0 oscillated sufficiently to be able to reliably fit a single-fiber model. This constraint was necessary because, for responses that do not oscillate, we could not reliably estimate the minimum and maximum discharge rates that are essential in fitting harmonic templates to the rate-place data. Thirty-five of 122 responses were thus excluded. Because our design minimizes rate saturation, and because 23 of these 35 excluded responses were from fibers with CFs <2 kHz, we infer that insufficient frequency selectivity for resolving harmonics rather than rate saturation was the primary reason for the lack of F0-related oscillations in these measurements.
A factor whose effect on pitch estimation performance is hard to evaluate is that the rate-place profiles included responses to stimuli presented at different sound levels. At first sight, pooling data across levels might seem to increase response variability and therefore decrease estimation performance. However, because the stimulus level was usually selected to be 1520 dB above the threshold of each fiber so that responses would be robust without being saturated, our procedure might actually have reduced the variability due to threshold differences among fibers. The rationale for this procedure is that an optimal central processor would focus on unsaturated fibers because these fibers are the most informative. Because level (re. threshold) rather than absolute level is the primary determinant of rate responses, we are effectively invoking a form of the "selective listening hypothesis" (Delgutte 1982
, 1987
; Lai et al. 1994
), according to which the central processor attends to low-threshold, high-spontaneous rate fibers at low levels and to high-threshold, low-spontaneous rate fibers at high levels.
Our harmonic template differs from those typically used in pattern recognition models of pitch in that it has very broad peaks at the harmonic frequencies. Most pattern recognition models (Duifhuis et al. 1982
; Goldstein 1973
; Terhardt 1974
) use very narrow templates or "sieves," typically a few percent of each harmonic's frequency. One exception is the model of Wightman (1973
), which effectively uses broad cosinusoidal templates by performing a Fourier transform operation on the spectrum. Our method also resembles the Wightman model and differs from the other models in that it avoids an intermediate, error-prone stage that estimates the frequencies of the individual resolved harmonics; rather, a global template is fit to the entire rate-place profile. Broad templates are well adapted to the measured rate-place profiles because the dips between the harmonics are often sharper than the peaks at the harmonic frequencies (Figs. 6 and 8). On the other hand, the templates are the response of the peripheral model to complex tones with equal-amplitude harmonics, which exactly match the stimuli that were presented. It remains to be seen how well such templates would work when the spectral envelope of the stimulus is unknown or when the amplitudes of the individual harmonics are roved from trial to trial, conditions that cause little degradation in psychophysical performance (Bernstein and Oxenham 2003a
; Houtsma and Smurzynski 1990
).
Given the uncertainties about HOW the various factors discussed above may affect our pitch estimation procedure, a comparison of the pitch estimation performance with psychophysical data should focus on robust overall trends as a function of stimulus parameters rather than on absolute measures of performance. Both the precision of the pitch estimates (Fig. 7A) and their salience (as measured by the Fisher information; Fig. 7B), improve with increasing F0 as the harmonics of the complex become increasingly resolved. This result is in agreement with psychophysical observations that both pitch strength and pitch discrimination performance improve as the degree of harmonic resolvability increases (Bernstein and Oxenham 2003b
; Carlyon and Shackleton 1994
; Houtsma and Smurzynski 1990
; Plomp 1967
; Ritsma 1967
). However, the continued increase in Fisher information with F0 beyond 1,000 Hz conflicts with the existence of an upper limit to the pitch of missing-fundamental stimuli, which occurs at about 1,400 Hz in humans (Moore 1973b
). This discrepancy between the rapid degradation in pitch discrimination at high frequencies and the lack of a concomitant degradation in cochlear frequency selectivity is a general problem for place models of pitch perception and frequency discrimination (Moore 1973a
).
We also found that the relative phases of the resolved harmonics of a complex tone do not greatly influence rate-based estimates of pitch (Fig. 8). This result is consistent with expectations for a purely place representation of pitch, as well as with psychophysical results for stimuli containing resolved harmonics (Houtsma and Smurzynski 1990
; Shackleton and Carlyon 1994
; Wightman 1973
).
The restriction of our data to low and moderate stimulus levels raises the question of whether the rate-place representation of pitch would remain robust at the higher stimulus levels typically used in speech communication or when listening to music. Previous studies have used signal detection theory to quantitatively assess the ability of rate-place information in the AN to account for behavioral performance in tasks such as intensity discrimination (Colburn et al. 2003
; Delgutte 1987
; Viemeister 1988
; Winslow and Sachs 1988
; Winter and Palmer 1991
) and formant-frequency discrimination for vowels (Conley and Keilson 1995
; May et al. 1996
). These studies give a mixed message. On the one hand, the rate-place representation generally contains sufficient information to account for behavioral performance up to the highest sound levels tested. On the other hand, because the fraction of high-threshold fibers is small compared to low-threshold fibers, predicted performance of optimal processor models degrades markedly with increasing level, whereas psychophysical performance remains stable. Thus while a rate-place representation cannot be ruled out, it fails to account for a major trend in the psychophysical data. Extending this type of analysis to pitch discrimination for harmonic complex tones is beyond the scope of this paper. Given the failure of the rate-place representation to account for the level dependence of performance in the other tasks, a more productive approach may be to explore alternative spatio-temporal representations that would rely on harmonic resolvability like the rate-place representation, but would be more robust with respect to level variations by exploiting phase locking (Heinz et al. 2001
; Shamma 1985a
). Preliminary tests of one such spatio-temporal representation are encouraging (Cedolin and Delgutte 2005b
).
Interspike-interval representation of pitch
Our results confirm previous findings (Cariani and Delgutte 1996a
,b
; Palmer 1990
; Palmer and Winter 1993
), that fundamental frequencies of harmonic complex tones are precisely represented in pooled all-order interspike-interval distributions of the AN. These interval distributions have prominent modes at the period of F0 and its integer multiples (Fig. 9, A and B). Pitch estimates derived using periodic templates that select intervals at a given period and its multiples were highly accurate (often within 1%) for F0s up to 1,300 Hz (Fig. 10). The determination of this upper limit to the interval-based representation of pitch is a new finding. Moreover, the use of periodic templates for pitch estimation improves on the traditional method of picking the largest mode in the interval distribution by greatly reducing suboctave errors.
While the existence of an upper limit to the representation of pitch in interspike intervals is expected from the degradation in phase locking at high frequencies, the location of this limit at 1,300 Hz is low compared with the usually quoted 4- to 5-kHz limit of phase locking in the AN (Johnson 1980
; Rose et al. 1967
). Of course, both the limit of pitch representation and the limit of phase locking depend to some extent on the signal-to-noise ratio of the data, which in turn depends on the duration of the stimulus, the number of stimulus repetitions and, for pooled interval distributions, the number of sampled fibers. However the discrepancy between the two limits appears too large to be entirely accounted for by differences in signal-to-noise ratio. Fortunately, the discrepancy can be largely reconciled by taking into account harmonic resolvability and the properties of our stimuli. For F0s near 1,300 Hz, all the harmonics within the CF range of our data (4509,200 Hz) are well resolved (Fig. 4), so that information about pitch in pooled interval distributions must depend on phase locking to individu