|
|
||||||||
The Journal of Neurophysiology Vol. 88 No. 1 July 2002, pp. 438-454
Copyright ©2002 by the American Physiological Society
1Centre for Neuroscience Studies, Department of Physiology, Queen's University, Kingston, Ontario K7L 3N6, Canada; and 2Department of Biophysics, University of Nijmegen, Geert Grooteplein 21, 6525 EZ Nijmegen, The Netherlands
| |
ABSTRACT |
|---|
|
|
|---|
Corneil, B. D., M. Van Wanrooij, D. P. Munoz, and A. J. Van Opstal. Auditory-Visual Interactions Subserving Goal-Directed Saccades in a Complex Scene. J. Neurophysiol. 88: 438-454, 2002. This study addresses the integration of auditory and visual stimuli subserving the generation of saccades in a complex scene. Previous studies have shown that saccadic reaction times (SRTs) to combined auditory-visual stimuli are reduced when compared with SRTs to either stimulus alone. However, these results have been typically obtained with high-intensity stimuli distributed over a limited number of positions in the horizontal plane. It is less clear how auditory-visual interactions influence saccades under more complex but arguably more natural conditions, when low-intensity stimuli are embedded in complex backgrounds and distributed throughout two-dimensional (2-D) space. To study this problem, human subjects made saccades to visual-only (V-saccades), auditory-only (A-saccades), or spatially coincident auditory-visual (AV-saccades) targets. In each trial, the low-intensity target was embedded within a complex auditory-visual background, and subjects were allowed over 3 s to search for and foveate the target at 1 of 24 possible locations within the 2-D oculomotor range. We varied systematically the onset times of the targets and the intensity of the auditory target relative to background [i.e., the signal-to-noise (S/N) ratio] to examine their effects on both SRT and saccadic accuracy. Subjects were often able to localize the target within one or two saccades, but in about 15% of the trials they generated scanning patterns that consisted of many saccades. The present study reports only the SRT and accuracy of the first saccade in each trial. In all subjects, A-saccades had shorter SRTs than V-saccades, but were more inaccurate than V-saccades when generated to auditory targets presented at low S/N ratios. AV-saccades were at least as accurate as V-saccades but were generated at SRTs typical of A-saccades. The properties of AV-saccades depended systematically on both stimulus timing and S/N ratio of the auditory target. Compared with unimodal A- and V-saccades, the improvements in SRT and accuracy of AV-saccades were greatest when the visual target was synchronous with or leading the auditory target, and when the S/N ratio of the auditory target was lowest. Further, the improvements in saccade accuracy were greater in elevation than in azimuth. A control experiment demonstrated that a portion of the improvements in SRT could be attributable to a warning-cue mechanism, but that the improvements in saccade accuracy depended on the spatial register of the stimuli. These results agree well with earlier electrophysiological results obtained from the midbrain superior colliculus (SC) of anesthetized preparations, and we argue that they demonstrate multisensory integration of auditory and visual signals in a complex, quasi-natural environment. A conceptual model incorporating the SC is presented to explain the observed data.
| |
INTRODUCTION |
|---|
|
|
|---|
Saccadic eye movements reorient gaze swiftly to a new target of
interest. Much has been learned about the neural processes underlying
the initiation of visually guided saccades (see Findlay and
Walker 1999
; Munoz et al. 2000
for review).
Under natural conditions, the saccadic system is typically challenged
by myriad possible targets to which gaze could be directed. Often,
these potential targets emit multisensory signals that may provide
different combinations of visual, auditory, and tactile inputs. The
integration of multisensory signals from a single event into an
orienting response is far from trivial as different sensory modalities
are transduced uniquely and encoded initially in different frames of
reference (see Sparks and Mays 1990
for review). The
oculocentric frame of reference in which saccades are represented must
be derived from retinotopic signals for visually guided saccades, and
from head-centered space for aurally guided saccades. This latter
transformation is particularly complex because the CNS constructs the
head-centered space from different acoustic cues: sound azimuth is
extracted from interaural timing and intensity disparities, and sound
elevation from monaural spectral shape cues induced by the pinnae (see
Blauert 1997
; Irvine 1986
for review).
There is ample experimental evidence that a combined presentation of
auditory and visual stimuli reduces saccadic reaction times (SRTs) (see
Colonius and Arndt 2001
for a recent review). These
reductions generally exceed the predictions of the so-called "race
model," which entails that combined auditory and visual stimuli are
processed independently but produce shorter SRTs so long as the
unimodal distributions overlap, since subjects can react to either
stimulus (Raab 1962
). Exceeding the race model implies
that the bimodal stimuli are neurally integrated prior to saccade
initiation (Hughes et al. 1994
; Nozawa et al.
1994
). Observed SRT reductions range usually between 10 and 50 ms and diminish as the spatial and temporal separation of the stimuli increases (Colonius and Arndt 2001
; Corneil and
Munoz 1996
; Frens et al. 1995
; Harrington
and Peck 1998
; Hughes et al. 1998
).
The neural correlates of multisensory integration have been studied
extensively in anesthetized preparations and also depend on the spatial
and temporal register of the stimuli (see Stein and Meredith
1993
for review). Another important property of neurons that
display multisensory integration is that of "inverse effectiveness" (Meredith and Stein 1986
), whereby smaller unimodal
responses from near-threshold stimulus intensities are associated with
conversely stronger amounts of multisensory integration. If similar
mechanisms operate in awake preparations, then the behavioral benefits
afforded by multisensory integration should also be greatest with
low-intensity stimuli. Accordingly, improved orienting to low-intensity
multisensory stimuli has been demonstrated in cats (Stein et al.
1989
). So far, human studies using low-intensity stimuli have
not demonstrated the dramatic behavioral benefits expected from inverse
effectiveness: the SRT reductions afforded by pairing low-intensity
stimuli usually approximate the SRT reductions afforded by pairing
high-intensity stimuli (Frens et al. 1995
; Hughes
et al. 1994
). Perhaps in these studies, the low intensities
were not close enough to threshold, or the limited number of potential
target locations may have allowed subjects to constrain their responses
prior to stimulus onset. In addition, the auditory stimulus in some of
these experiments did not serve as a potential target, but acted as a
distractor that could have been ignored by the subject.
The purpose of the present study is to evaluate multisensory
integration in human saccades in a complex experimental environment in
which both the auditory and visual stimuli serve as potential targets.
To this end, low-intensity unimodal or bimodal targets were distributed
over 24 possible target locations within the two-dimensional (2-D)
oculomotor range and embedded within an auditory-visual background
(Fig. 1). Both the signal-to-noise (S/N)
ratio of the auditory target relative to background, and the temporal
register of the auditory and visual targets on bimodal trials were
systematically varied. Attesting to the difficulty of this task,
subjects generated saccade scan patterns that consisted of anywhere
between 1 to over 10 saccades before localizing the target. This report
focuses exclusively on the SRT and accuracy of the first saccade in
each trial as indexes of how well the subjects initially localize the
target(s). Accurate saccades at short SRTs imply a well-localized
target, whereas inaccurate saccades at longer SRTs imply the opposite.
Our results demonstrate that the behavioral benefits of auditory-visual
integration vary systematically with the S/N ratio as predicted by
inverse effectiveness, and that such benefits were greater in the
elevation versus azimuth response component. Moreover, the observed
effects depended in a systematic way on the relative timing of the
auditory and visual stimuli. These behavioral data are in good
agreement with the rules extracted from multisensory-evoked responses
of cells in the mammalian superior colliculus (SC) (Stein and
Meredith 1993
).
|
Abstracts describing some of these data have been published
(Corneil et al. 2001
; Van Wanrooij et al.
2000
).
| |
METHODS |
|---|
|
|
|---|
Subjects
Five male subjects (ages 23-43) participated in the experiments and provided their informed consent. Experimental procedures were approved by the local ethics committee of the University of Nijmegen. All subjects were experienced with eye-movement recording protocols. Subjects JO, BC, DM, and MW are authors of this paper, although the latter three had no prior experience with sound localization studies. Subject MZ was naive as to the purpose of the study. All subjects had normal hearing, as determined by audiograms of both ears that were obtained with a standard staircase procedure (10 tone pips, 0.5-octave separation, between 500 Hz and 11.3 kHz). With corrective glasses in the experimental setup (subjects BC, DM, and MZ), all subjects had normal binocular vision except for JO, who is amblyopic in his right (recorded) eye. The calibration procedure described below corrected for any nonlinearities from this subject.
Apparatus
Experiments were conducted in a completely dark and
sound-attenuated room in which the inner walls, ceiling, and floor, as well as every large object present, were covered with black
sound-absorbing acoustic foam that effectively eliminated echoes above
500 Hz. The overall background sound level within the room was
approximately 30 dB SPL (A-weighted). The subject was seated
comfortably on a chair with back and foot support, and the head was
aligned with the center of the room. A customized neck rest, rigidly
attached to the floor, prevented the head from moving. Eye movements
were recorded with the scleral search coil technique (Collewijn
et al. 1975
). Horizontal and vertical eye position signals were
demodulated by lock-in amplifiers (PAR 128A), amplified and low-pass
filtered (cutoff 150 Hz), and sampled at 500 Hz per channel (Metrabyte DAS16H) before being stored on hard disk.
stimulus generation.
Visual stimuli.
Visual stimuli were generated by 85 light-emitting diodes (LEDs) that
were mounted on a thin wireframe that formed a hemispheric surface 85 cm in front of the subject (the "LED sky"). LEDs were positioned at
visual angles that corresponded in a 2-D polar coordinate system to
seven radial eccentricities R
[2; 5; 9; 14; 20; 27; 35] deg with respect to the center of the LED sky, and 12 directions
[0; 30; 60; ... ; 330] deg, respectively (where
= 0 deg is rightward,
= 90 deg is upward, etc.; Fig.
1A). All LEDs could be turned green or red. The visual
background was formed by turning all 85 LEDs green. The initial
fixation point (FP) was presented by turning the central LED at
[R,
] = [0, 0] red. The visual target was
lit by turning one of the other green LEDs red. LED intensities were
kept low to ensure that localization was difficult in the presence of
the background (green LEDs: 0.25 cd/m2; red LEDs: 0.18 cd/m2). The LED sky was backed by an acoustically
transparent thin black cloth.
Paradigms
Every subject performed three types of experiments: a calibration experiment, the primary auditory-visual (AV) experiment, and an AV-control experiment. Every session began with one block of the calibration experiment without the AV-background, then two blocks of either the primary or control AV-experiment, with the AV-background.
CALIBRATION EXPERIMENT.
In all experimental sessions, the subjects first performed a
calibration experiment without the AV-background. Subjects were instructed to look from a central red FP to a randomly selected peripheral red LED target that was illuminated as soon as the FP was
extinguished (1 block consisted of 72 targets: 12 directions × 6 eccentricities, R
5°, each presented once), and
press a hand-held button when the target was finally fixated.
AV EXPERIMENT.
The spatial and temporal layout of the AV-experiment is depicted in
Fig. 1. At time 0 in each trial, the AV-background was turned on. After a randomly selected interval of 100, 225, or 350 ms,
the central red FP changed color from green to red, and the subject was
required to fixate it. At time 1,000 ms, the central FP turned from red
to green, and a peripheral target was presented 100-200 ms later (see
following text). The subject was instructed to acquire the
peripheral target as quickly and as accurately as possible. The
location of the peripheral target was selected at random from 1 of 24 different positions. All 12 directions on the LED sky were equally
likely, but for each direction only 2 of the following 3 eccentricities
were selected: R = 14, 20, or 27° (Fig. 1). Subjects
made saccades to red visual targets (V-trials), auditory targets
(A-trials), or to bimodal auditory-visual targets (AV-trials). The
auditory target was presented at one of four different signal-to-noise
intensity ratios (S/N ratio) relative to the fixed-intensity
background:
6,
12,
18, or
21 dB. For the unimodal V- or
A-trials, the target was presented 200 ms after the FP turned green
(i.e., at time 1,200 ms) and persisted for 3,300 ms, determining the
maximal search time. In AV-trials, the auditory and visual targets were
always spatially coincident. The red visual target was illuminated at
time 1,200 ms; the auditory target was presented randomly at time 1,100 ms, 1,200 ms, or 1,300 ms (i.e., either synchronous, or 100 ms before or after the visual target). Note that any time a visual target was
presented, it was presented 200 ms after the offset of the fixation
point (i.e., a 200-ms "gap"). This was done so that our data would
not have to be analyzed as a function of gap interval, given the known
effects of gap interval on SRT (see Findlay and Walker
1999
for review).
6-saccades denote data from unimodal A-trials where the target intensity was set
to
6 dB relative to the auditory background.
A
12100V-saccades denote data from AV-trials in
which the auditory target (
12 dB relative to auditory background) led
the visual target by 100 ms. Twelve different AV-trials were possible
(3 temporal asynchronies ×4 S/N ratios). In total, 17 different trial
types were tested at each target position (1 V-trial, 4 A-trials, 12 AV-trials), making a total of 408 different trials (17 trial types ×24
target positions) for one complete series. All trial types were
randomly interleaved. Each experimental session contained 204 trials
run in two blocks of 102 trials each. A subsequent session was
typically run on another day and contained the remaining 204 trials to
complete the series. Each subject completed at least three full series of AV-multimodal experiments (DM, MZ, and MW: 3 series; BC: 4 series; and JO: 5 series), yielding
between 72 and 120 trials per trial type.
An oversight on our part replaced the
A
6100V-trials with
V100A
6 trials. Although unfortunate, our
conclusions were not affected by the lack of data from
A
6100V-saccades.
AV-CONTROL EXPERIMENT.
It is known that the onset or offset of an auditory target can lower
SRTs to visual targets, presumably by a warning effect that is
independent of the spatial congruity of the auditory and visual stimuli
(Ross and Ross 1980
, 1981
). To parse out the portions of
the data set from the primary AV-experiment that were caused by this
nonspecific warning effect, each subject was also tested in a separate
control experiment. Three trial types from the primary AV-experiment
(V-only, A
12, and A
12V)
were mixed with a new type of bimodal stimulus in which the auditory
target sound was generated by the nine background speakers. For this auditory control stimulus, the acoustic signal was a linear
superposition of the Gaussian broadband white noise and the periodic
buzzer stimulus. A pilot test indicated good audibility of this sound when its level was at
3 dB relative to background. The subjects perceived this control sound as emanating from a single point near
straight ahead, although the exact location of this percept varied
between subjects. Accordingly, when the auditory control stimulus was
presented, the spatial coincidence of the visual and auditory targets
was lost, and the subject's task became ambiguous because of a
conflict between the location of the visual target and the perceived
location of the auditory control stimulus. In this experiment, a total
of 192 trials was measured (each trial type presented twice at each
location, yielding 48 trials per stimulus type).
Data analysis
DATA CALIBRATION.
Off-line calibration of horizontal and vertical eye position was
achieved by training two three-layer neural networks with the
back-propagation algorithm on the 72 fixation positions from the
calibration experiment (when the button was pressed) and the target
coordinates (see Goossens and Van Opstal 1997
for
details). The absolute accuracy of the calibration was within 3% over
the entire response range. The networks were subsequently applied to
the raw data from the calibration (1st saccades only), AV-multimodal, and AV-control experiments to map the measured induction voltages onto
the corresponding 2-D orientations of the eye. Target and response
coordinates are expressed as azimuth (
) and elevation (
) angles,
determined by a double-pole coordinate system in which the origin
coincides with the center of the head. In this reference frame, target
azimuth,
T, is defined as the angle between
the target and the midsagittal plane. Target elevation,
T, is the angle between the target and the
horizontal plane through the ears with the head in a straight-ahead orientation.
1/SRT) (see Carpenter and Williams 1995STATISTICS.
For saccade accuracy, the optimal linear fit of the stimulus-response
relation between saccade amplitude and target eccentricity was found by
minimizing the sum-squared deviation of
|
(1) |
CALCULATION OF THE RACE MODEL PREDICTION.
Previous studies of the SRT reduction afforded by presenting bimodal
stimuli have utilized the concept of a race model to provide a
prediction for the SRT distribution that would be expected if the
subject reacted simply to whichever stimulus was perceived first
(Colonius and Arndt 2001
; Corneil and Munoz
1996
; Harrington and Peck 1998
; Hughes et
al. 1994
, 1998
). This concept, which operates
like a logical OR-gate, was originally developed to model manual reaction times and is alternatively referred to as statistical facilitation or probability summation (Gielen et al.
1983
; Miller 1982
; Raab 1962
).
The SRT distribution predicted by a race model, R(
)
(where
is a given SRT), is derived from the normalized SRT
distributions for saccades to the unimodal auditory or visual stimuli,
A(
) and V(
), respectively, by the following
equation
|
(2) |
| |
RESULTS |
|---|
|
|
|---|
Properties of unimodal V-saccades and A-saccades
The presence of the AV-background and the S/N ratio of the
acoustic environment impacted the SRT and accuracy of unimodal V-saccades and A-saccades. V-saccades had longer SRTs in the presence of the AV-background, as evidenced by comparing the results from the
AV-experiment (with the AV-background) to the results from the
calibration experiment (without the AV-background; Fig.
2A and Table
1; P < 10
8 for all subjects, 1-D KS
test).1 Response
accuracy, quantified by the parameters of the linear regression
analysis between saccade amplitude and target eccentricity (see
METHODS), demonstrated that V-saccade accuracy in both
azimuth (Fig. 2B) and elevation (Fig. 2C) was
also compromised in the presence of the AV-background. These accuracy
differences were significant across all subjects (P < 0.05 using the 1-D KS test; Table 1).
|
|
The SRT and accuracy of A-saccades depended on the S/N ratio of the
auditory target relative to background. SRTs were systematically longer
and more variable for lower S/N ratios, as shown for a representative
subject in Fig. 3A (Table
2 for all subjects). Interestingly,
the accuracy of A-saccades decreased for the lower S/N ratios, but in a
manner that differed for the azimuth and elevation response components.
Targets were well localized in both azimuth and elevation at the
highest S/N ratios (i.e., A
6-saccades), although the residual error was greater in elevation (Fig. 3, B and C). At the high S/N ratios, the accuracy of
A-saccades was in the same range as V-saccades (compare gain and error
values in Tables 1 and 2). At the lowest S/N ratio (i.e.,
A
21-saccades), saccade accuracy in azimuth
decreased only slightly when compared with
A
6-saccades (Fig. 3B), yet was
almost completely abolished in elevation (Fig. 3C). An
analysis of the gain of the stimulus-response relationship (Fig.
4A) and the absolute response
error (Fig. 4B) for the azimuth and elevation response
components of A-saccades across S/N ratio confirmed the greater
inaccuracy of aurally guided saccades in the elevation component at
lower S/N ratios (i.e., for A
18 and
A
21-saccades) than in the azimuth component, which was only slightly compromised for
A
21-saccades (see also Table 2). These results
confirm and extend earlier findings of auditory localization
(Good and Gilkey 1996
; Zwiers et al. 2001
).
|
|
|
In summary, the increased SRT and decreased accuracy of V- and A-saccades (particularly at low S/N ratios) confirmed that the presence of the AV-background made the task much more difficult, although not impossible. A similar analysis on saccades generated to the different types of bimodal stimuli is now presented.
Properties of AV-saccades (no temporal asynchronies)
A representative example of the properties of AV-saccades is
demonstrated in Fig. 5, in which V-,
A
18-, and A
18V-saccades are compared. Note that the SRT distributions for
A
18 and A
18V-saccades
are nearly superimposed (Fig. 5A), equaling but not
exceeding the race model prediction based on the unimodal SRT
distributions (black, solid line). The relationships of AV-saccades to
the race model prediction are studied more thoroughly below. A
comparison of saccade accuracy demonstrated that the residual errors of
A
18V-saccades were smaller than the residual errors for both V- and A
18-saccades in the
elevation (Fig. 5C), but not in the azimuth (Fig.
5B), component.
|
The SRT and accuracy of AV-saccades is contrasted more directly with
unimodal saccades using a 2-D comparison of absolute localization error
(combining both azimuth and elevation) versus SRT (see Fig.
6 for 1 subject). Each point in Fig. 6
stems from an individual saccade, and the ellipses circumscribe the
mean values within one SD. Note that V-saccades were generated at
longer SRTs but were more accurate than
A
18-saccades. However, the 2-D distribution of
A
18V-saccades is clearly distinct from either
unimodal distribution, as the AV-saccades attained accuracies in the
range of V-saccades, but at SRTs in the range of
A
18-saccades. The 2-D KS pair-wise statistic
comparing the three distributions showed that all were significantly
different (P < 10
5 for all 3 comparisons). The results of this three-way statistical comparison
across all subjects and at all S/N ratios is shown in Table
3. When the S/N ratio was low (
18 or
21 dB), the 2-D distributions for AV-saccades differed significantly
from both unimodal A-saccades and V-saccades. For the higher S/N ratios (
12 or
6 dB), the distributions for AV-saccades were often similar to the A-saccade distributions, but were always significantly different
from V-saccades.
|
|
Another interesting observation from Fig. 6 is that the AV-saccades appeared to be distributed over a narrower accuracy-SRT range than A- and V-saccade distributions (compare the horizontal and vertical spans of the ellipses in Fig. 6). To quantify this point across all S/N ratios and subjects, we made two comparisons. First, we compared the SRT variance of AV-saccades to A-saccades (Fig. 7A) and demonstrated that the SRT variance for AV-saccades was similar to A-saccades at high S/N ratios, but had consistently lower variances at lower S/N ratios. Second, a comparison of the accuracy variance between AV-saccades and V-saccades showed that the accuracy variances for AV-saccades were consistently narrower than those for V-saccades, as the majority of data points lay below the diagonal in Fig. 7B. Thus, although auditory targets were barely detectable in elevation at low S/N ratios (Fig. 4 and Table 2), they were integrated effectively with the visual target to reduce both the mean and the variance of AV-saccade SRT and accuracy.
|
Taken together, the data suggest that the magnitude of multisensory interactions depended systematically on the S/N ratio of the auditory target relative to the background. At low S/N ratios, A-saccades were characterized by decreasing accuracy, and longer SRT and V-saccades were more accurate but were generated at much longer SRTs. When these two weak stimuli were combined, AV-saccades benefited from the "best of both worlds" in that they were as accurate as V-saccades and initiated at SRTs typical of A-saccades. Moreover, the variability of SRT and accuracy for AV-saccades was decreased compared with A-saccades and V-saccades, respectively, indicating more consistent responses.
AV interactions as a function of stimulus timing
In this section, we compare the properties of saccade responses to synchronous stimuli (i.e., AV-saccades) to those from asynchronous stimuli (i.e., A100V- or V100A-saccades). First, the SRT distributions for each stimulus asynchrony and S/N ratio combination were compared with the distributions predicted by the race model (see METHODS). To that end, the observed cumulative response distributions for bimodal stimuli were plotted as a function of the predicted cumulative race distributions (Fig. 8A for a representative subject). Such plots compare the relative differences between the observed and predicted cumulative distributions, regardless of absolute SRT. Note that the comparison plots for AV-saccades (solid lines in Fig. 8A) lay close to the unity line, implying that the observed SRT distributions were approximately equal to those predicted by the race model for all four S/N ratios, consistent with Fig. 5A. However, the comparison plots for V100A-saccades (dashed lines in Fig. 8A) lay well above the unity line, indicating that the observed SRT distributions were considerably shorter than those predicted by the race model. Conversely, the comparison plots for A100V-saccades (dashed-dotted lines in Fig. 8A) lay well below the unity line, meaning that the observed SRTs were much longer than predicted by the race model. This latter finding is quite striking since it implies that the delayed visual stimulus inhibits the SRTs for A100V-saccades compared with the SRTs for A-saccades.
|
It is not trivial to appreciate how these relationships with the race model change with the S/N ratio of the acoustic environment. To quantify this, we determined the area of the difference curve between the observed and predicted cumulative SRT distributions for those SRTs where the cumulative probabilities fell between 0.1 and 0.9. These calculated areas express the amount by which the observed data exceeded (positive values) or fell short (negative values) of the race model prediction regardless of the absolute SRTs, and are plotted for the same subject in Fig. 8B. Presented this way, it is clear that no systematic relationship emerged with the S/N ratio, hence the extracted areas were averaged across all S/N ratios (gray bars in Fig. 8B). AV-saccades did not deviate significantly from the race model (P > 0.05), whereas A100V-saccades had significantly longer SRTs by about 20% (P < 0.02) and V100A-saccades had significantly shorter SRTs by about 15% (P < 0.02) than those predicted by the race model. This pattern of SRT responses was found for three of five subjects. In the other two subjects, there was no significant difference between the observed and predicted SRT distributions for both AV-saccades and V100A-saccades. In these two subjects, the unimodal (shifted) SRT distributions did not overlap sufficiently, so that the race model prediction equaled the shorter unimodal SRT distribution (in this case for A-saccades). Regardless, averaging across all subjects revealed that the overall patterns were consistent (Fig. 8C; P < 0.001 for the A100V-saccades, P > 0.05 for AV-saccades, and P < 0.05 for V100A-saccades). Thus the relationships of the observed SRTs to those predicted by the race model depended on stimulus asynchrony.
To quantify the accuracy of bimodal saccades across stimulus asynchrony and S/N ratio, we first plotted the absolute azimuth and elevation localization errors as a function of S/N ratio for the different temporal asynchronies. Figure 9 shows data from one representative subject. Note that the accuracy of bimodal saccades almost always surpassed that of A-saccades, regardless of asynchrony or S/N ratio. In most cases, the accuracy of AV-saccades was also better than that of V-saccades. V100A-saccades tended to be among the most accurate, surpassing both AV- and A100V-saccades, particularly in elevation at low S/N ratios (Fig. 9B). Statistical analysis across all subjects confirmed that the elevation gain of bimodal saccades differed more from the gains obtained from V-saccades than A-saccades did at the lower S/N ratios (Table 4). However, this trend was not observed in the azimuth response component (Table 4).
|
|
A summary of the combined SRT-accuracy results for all bimodal stimulus
conditions is shown in Fig. 10. These
data were obtained by first normalizing the results for each stimulus
condition with respect to the accuracy and SRT of V-saccades within
each subject, and then averaging the normalized results for each
condition across all subjects (note that data for
A
6100V-saccades are absent; see
METHODS). All bimodal data in this accuracy-SRT plane are clearly distinct from the unimodal saccades, and there were obvious patterns depending on both the asynchrony and the S/N ratio. First, the
normalized SRT and absolute localization error of A-saccades and
bimodal saccades progressively increased with decreasing S/N ratios.
Second, the position of the bimodal data in the accuracy-SRT plane
depended strongly on the stimulus asynchrony. Relative to AV-saccades,
A100V-saccades were more inaccurate and had longer SRTs at the lower
S/N ratios. This latter point is in agreement with our earlier analysis
on the SRTs compared with the race model (Fig. 8) and again shows that
the delayed visual stimulus slowed the SRT of A100V-saccades compared
with unimodal A-saccades. In contrast, V100A-saccades were more
accurate than AV-saccades, but had longer SRTs. Yet, V100A-saccades
clearly surpassed the predictions of the race model (Fig. 8; recall
that the unimodal distributions had to be shifted by 100 ms to
determine the race model predictions). Overall, the best performances,
indexed by the relative position of the bimodal saccades compared with
the unimodal counterpart, were observed for AV- and V100A-saccades at
the lowest S/N ratios.
|
AV-control experiment
We conducted a control experiment to test for the presence and
influence of a generalized warning effect of the auditory target on
both SRT and accuracy. Figure 11 shows
the data pooled for all subjects from the AV-control experiment, which
used an additional bimodal stimulus consisting of a visual target with
a control auditory stimulus set up by the background speakers (recall
that subjects perceived this sound as emanating from a fixed location near center). We emphasize two main points from this experiment. First,
although the control auditory stimulus provided some warning cue
information to shorten SRTs of AV-control saccades compared with
V-saccades, the SRTs for spatially coincident
A
12V-saccades were still shorter (Fig.
11A). Thus, although one component (around 60 ms) of the
shorter SRTs for A
12V-saccades could be
attributed to a warning effect, another component (accounting for an
additional 65 ms) depends on the spatial alignment of the stimuli.
Second, note that AV-control saccades were much more inaccurate than
spatially coincident A
12V-saccades (Fig. 11,
B and C) or either V- or A-saccades (Table
5). Thus, although the nonlocalizable
auditory target conferred a beneficial warning effect on SRTs, it
degraded saccade accuracy. These results were consistent across all
subjects (Table 5), from which it was concluded that the combined
benefits conferred by auditory-visual integration across SRT and
accuracy depended on the spatial alignment of the stimuli.
|
|
| |
DISCUSSION |
|---|
|
|
|---|
This study investigated the first-saccade responses to visual, auditory, and bimodal stimuli distributed throughout the 2-D oculomotor range and embedded within a complex AV-background. We believe the timing and metrics of the first saccade provide a measure for the speed and precision with which the oculomotor system can localize and orient to the stimuli. The properties of saccades to unimodal stimuli testify to the complexity of the task: the SRT and error of V-saccades increased greatly in the presence of the AV-background (Fig. 2, Table 1), and the SRT and error of A-saccades depended systematically on the S/N ratio of the acoustic scene, becoming prolonged and inaccurate, particularly in the elevation component, at lower S/N ratios (Figs. 3 and 4; Table 2). The properties of unimodal saccades provided wide ranges over which the benefits afforded by multisensory integration were realized. Specifically, saccades to bimodal stimuli were generated at SRTs typical of A-saccades, but at accuracies typical of V-saccades. These results depended critically on the temporal register of the stimuli and on the S/N ratio of the acoustic environment (Fig. 10). The control experiment demonstrated that the spatial register of the stimuli is also important (Fig. 11; Table 5), although this variable was not systematically manipulated. In this discussion, we argue that mechanisms other than neural integration of the auditory and visual signals cannot explain all aspects of our data. Our results are then related to behavioral and neurophysiological studies. Last, we propose a conceptual neural framework.
Consideration of mechanisms other than neural integration
We consider three mechanisms that could underlie the observed
properties of bimodal saccades: race models, aurally assisted visual
search, and auditory warning-cue effects. Each predicts specific
patterns of SRTs and accuracy that differ substantially from those we
observed. For example, race models state that subjects respond to
whichever stimulus is perceived first, and derive SRT distributions
from the unimodal data (Eq. 2) (Colonius and Arndt 2001
; Corneil and Munoz 1996
; Gielen et
al. 1983
; Harrington and Peck 1998
;
Hughes et al. 1994
, 1998
; Nozawa
et al. 1994
). Since the SRTs for A-saccades were much shorter
than for V-saccades, race models predict that most saccades in bimodal
trials would be initiated in response to the auditory target. However,
if the subjects only reacted to the auditory target on bimodal trials, then the accuracy of bimodal saccades should equal the accuracy of
A-saccades. This was never observed; bimodal saccades were always more
accurate than A-saccades (Fig. 10). Even a trial-by-trial comparison of
SRT and accuracy shows that individual AV-saccades combine properties
typical of both A-saccades and V-saccades (Fig. 6).
Whereas the observed SRTs for AV-saccades agree nicely with those
predicted by the race models, the observed SRTs for A100V- were longer
and the SRTs for V100A-saccades were shorter than the race model
predictions (Fig. 8), testifying to another inadequacy of a race model
mechanism. At first, it might seem surprising that the observed SRTs
for AV-saccades did not exceed the predicted SRTs, given the many
examples of race model violations in the literature (Colonius
and Arndt 2001
; Harrington and Peck 1996
; Hughes et al. 1994
, 1998
). However, many
of these race model violations stem from simple experiments in which
subjects orient to the target(s) without the presence of distracting
stimuli. Complicating the experiments by employing distracting stimuli,
or by instructing the subjects to orient to the auditory instead of the
visual target, lead to observed SRTs to bimodal stimuli that do not
exceed, let alone meet, the SRTs predicted by the race model
(Corneil and Munoz 1996
; Hughes et al.
1994
). More complex experimental paradigms, such as the one
described here, presumably engage processes related to target selection
and/or discrimination that elongate SRTs and demonstrate the
insufficiency of a simple race model mechanism in accounting for the
observed data. Below, we surmise on neural mechanisms that could
account for the shorter SRT and improved accuracy of V100A-, but not
A100V-saccades.
An "aurally assisted visual search" mechanism (Perrott et
al. 1990
, 1991
) also cannot explain the combined
patterns of SRT and accuracy. This mechanism proposes that the role of
the auditory localization system is to bring the fovea into line with
an auditory stimulus, constraining the area over which the visual
system searches for a visual target, thereby expediting the time to
locate and identify a visual target without necessitating
auditory-visual integration. Importantly, while this mechanism
considers processes beyond the generation of the first saccade and
could explain the evolution of the scanning pattern, it holds that the
first saccade to a bimodal stimulus is aurally guided. This mechanism
therefore predicts that both the SRT and accuracy of AV-saccades should equal A-saccades, which differs from the observed data. As with the
race model, the aurally assisted visual search mechanism cannot explain
the improved accuracy of AV-saccades beyond the level typical of
A-saccades. Another prediction of this mechanism is that A100V-saccades
should be the most accurate and V100A-saccades the most inaccurate.
This also differed drastically from the observed data (Fig. 10).
A third explanation of the observed data could be that the auditory
system provides a nonlocalized "warning cue" to the subject to
initiate the saccade, irrespective of the spatial register of the
stimuli (Kingstone and Klein 1993
; Ross and Ross
1980
, 1981
). While this mechanism could explain
a partial reduction of SRTs of bimodal saccades (Fig. 11; Table 5), the
AV-control experiment demonstrated that spatial alignment of the visual
and auditory stimuli was crucial in mediating further improvements in
SRT and accuracy, counter to a warning-cue mechanism (Fig. 11; Table
5). It is also hard to imagine how a warning cue mechanism could
explain the larger improvements in accuracy at lower S/N ratios (Fig.
9) or why saccade accuracy and SRT varied systematically with the
different temporal asynchronies (Fig. 10). While we would have liked to
have systematically altered the spatial congruity between the
AV-stimuli in this experiment, such an experiment is a major
undertaking and is the focus of a separate and ongoing series of experiments.
In conclusion, all three mechanisms assume that bimodal saccades are driven in response to one modality, and therefore predict that their timing and metrics should be identical to either V- or A-saccades. Yet, in none of the 11 different stimulus configurations tested did bimodal saccades have an accuracy-SRT profile identical to V- or A- saccades (Fig. 10). The observed properties of bimodal saccades combine aspects of both A-saccades and V-saccades to achieve the "best-of-both-worlds," and accordingly the most parsimonious explanation is that auditory and visual stimuli are integrated in a way that depends on their spatial and temporal register.
Rules for multisensory integration of bimodal signals and comparison to previous work
In the intermediate layers of the mammalian SC, many neurons
respond to multimodal stimuli (Stein and Meredith 1993
).
Studies in anesthetized preparations show that the form and magnitude of multisensory interactions in these neurons depend on the temporal and spatial alignment of the stimuli. Further, SC neurons obey the
principle of inverse effectiveness, whereby the magnitude of
multisensory interactions are largest when the multisensory stimuli are
presented at near-threshold intensities (Meredith and Stein
1986
). Studies of SC activity in awake preparations have confirmed these basic rules (Bell et al. 2001
;
Frens and Van Opstal 1998
; Peck 1996
;
Peck et al. 1995
; Wallace et al. 1998
). However, linking these rules to behavior is not always straightforward. For example, mean SRTs in humans to high-intensity audio-visual stimuli
are typically 10-50 ms shorter than SRTs to unimodal stimuli (Colonius and Arndt 2001
; Engelken and Stevens
1989
; Frens et al. 1995
; Goldring et al.
1996
; Harrington and Peck 1998
; Hughes et
al. 1994
, 1998
; Munoz and Corneil
1995
; Nozawa et al. 1994
). Studies with lower
intensity stimuli have found that the SRT reductions to low-intensity
bimodal stimuli are in the same range (Frens et al.
1995
; Hughes et al. 1994
; Nozawa et al.
1994
), contrary to what would have been predicted given inverse
effectiveness (Meredith and Stein 1986
). Is it possible
that inverse effectiveness is masked by other neural processes
operating only in behavioral experiments? If so, could these processes
also confound SRT and accuracy of bimodal saccades?
In light of these questions, we highlight several limitations or
confounds of previous behavioral studies. First, auditory stimuli have
been typically constrained to the horizontal plane, meaning that only
sound-source azimuth needed to be extracted. Our extension to the 2-D
oculomotor range, as well as the manipulation of the acoustic S/N
ratio, provided the opportunity to observe major differences in the
sensitivity of azimuth and elevation perception. The ability of the
auditory system to extract stimulus elevation degraded at higher S/N
ratios than stimulus azimuth (Fig. 4), consistent with recent studies
(Good and Gilkey 1996
; Zwiers et al.
2001
). This effect relates presumably to the different mechanisms the CNS uses to extract sound-source azimuth and elevation from the acoustic cues (see Blauert 1997
for review).
Consequently, the accuracy improvements afforded by presenting bimodal
targets at low S/N ratios were greater in elevation than in azimuth
(Fig. 9, Table 4). Previous studies may have underestimated the
contributions of multisensory integration to saccade accuracy by
constraining targets to the horizontal plane.
A second limitation of previous studies is the use of a limited number of potential target positions. This could allow subjects to use prior knowledge about potential target positions to prepare movements before target presentation, which, if left unaccounted for, would also lead to underestimations of the contributions of multisensory integration to saccade accuracy. The present setup used 24 potential target locations, making such a strategy highly unlikely.
Third, subject instructions and experimental context affect the
temporal expression of multisensory integration (i.e., SRT). For
example, requiring a subject to orient specifically to one modality
while ignoring the other yields SRT distributions that violate the race
model when the instructed target is visual, but not when the instructed
target is auditory (Corneil and Munoz 1996
;
Hughes et al. 1994
). In general, requiring subjects to
discriminate between modalities prolongs SRTs (Corneil and Munoz
1996
) and could confound the estimation of the contributions of
multisensory integration to SRT. In the present experiments, subjects
could orient to both the auditory and visual stimulus, so this was not a concern. Overall, the setup employed in our experiments allows for a
behavioral assessment of the consequences of multisensory integration
over both spatial and temporal domains, while being removed from
confounds, such as the three discussed here, that affected the
interpretation of previous studies.
A few behavioral studies have manipulated the temporal alignment
between auditory and visual stimuli to address the temporal window over
which stimuli may interact (Colonius and Arndt 2001
; Corneil and Munoz 1996
; Engelken and Stevens
1989
; Frens et al. 1995
). For the
saccadic system, the temporal window is about ±100 ms, presumably
allowing AV-integration in spite of differences in retinal versus
cochlear transduction times (~50 ms and 2-10 ms, respectively)
(Gouras 1967
; Kraus and McGee 1992
) and
the speed of sound versus light over a large range of stimulus
distances. In the SC of awake, behaving primates, auditory response
latencies usually range around 30 ms (Bell et al. 2001
;
Jay and Sparks 1987
) and visual response latencies
around 60 ms (see Munoz et al. 2000
for review),
suggesting that the more complex transformation of auditory responses
into oculocentric coordinates does not greatly affect the relative
arrival times at the SC. As shown in Fig. 10, the combination of SRT
and accuracy of AV-saccades surpassed that for either unimodal A- and
V-saccades, and we argued above that a race model could not account for
these data. However, Fig. 10 also shows that the temporal window
permitting excitatory interactions is not symmetrical around
synchronously presented stimuli. V100A-saccades were initiated at SRTs
that surpassed the race model prediction and were more accurate than
any other saccade type. Conversely, A100V-saccades were initiated at
SRTs that fell well short of the race model prediction (and were even
slower than A-saccades) and were more inaccurate than AV-saccades and
V-saccades at low S/N ratios (but were still more accurate than
A-saccades). These findings suggest a nonlinearity in the interactions
of delayed visual or auditory signals. Apparently, a delayed auditory
signal facilitates saccade initiation and sharpens the accuracy of a developing visually guided saccade, but a delayed visual signal inhibits saccade initiation and worsens the accuracy of a developing aurally guided saccade. Although surprising, this nonlinearity bears
some resemblance to multisensory recordings in the SC of anesthetized
cats, wherein response enhancements are observed if the visual stimulus
leads the auditory stimulus but response depressions are observed if
the auditory stimulus leads the visual stimulus (Meredith et al.
1987
). Understanding the neural mechanism(s) responsible for
this nonlinearity requires neuronal recordings from awake, behaving preparations.
Conceptual model of auditory-visual interactions in a complex scene
Figure 12 presents a conceptual
model to explain how activity within the SC might evolve prior to A-,
V-, and AV-saccades. We assume that visual and auditory signals are
initially processed separately and converge on the SC, inducing
modality-specific profiles of SC activity. At high intensities, aurally
induced profiles arrive earlier than visually induced profiles, but
with lower firing rates and a broader tuning (dashed lines and empty profiles in Fig. 12, B and C) (Bell et al.
2001
; Frens and Van Opstal 1998
; Jay and
Sparks 1987
; Peck et al. 1995
; Wallace et al. 1996
, 1998![]()