Visual object recognition is computationally difficult because changes in an object's position, distance, pose, or setting may cause it to produce a different retinal image on each encounter. To robustly recognize objects, the primate brain must have mechanisms to compensate for these variations. Although these mechanisms are poorly understood, it is thought that they elaborate neuronal representations in the inferotemporal cortex that are sensitive to object form but substantially invariant to other image variations. This study examines this hypothesis for image variation resulting from changes in object position. We studied the effect of small differences (±1.5°) in the retinal position of small (0.6° wide) visual forms on both the behavior of monkeys trained to identify those forms and the responses of 146 anterior IT (AIT) neurons collected during that behavior. Behavioral accuracy and speed were largely unaffected by these small changes in position. Consistent with previous studies, many AIT responses were highly selective for the forms. However, AIT responses showed far greater sensitivity to retinal position than predicted from their reported receptive field (RF) sizes. The median AIT neuron showed a ∼60% response decrease between positions within ±1.5° of the center of gaze, and 52% of neurons were unresponsive to one or more of these positions. Consistent with previous studies, each neuron's rank order of target preferences was largely unaffected across position changes. Although we have not yet determined the conditions necessary to observe this marked position sensitivity in AIT responses, we rule out effects of spatial-frequency content, eye movements, and failures to include the RF center. To reconcile this observation with previous studies, we hypothesize that either AIT position sensitivity strongly depends on object size or that position sensitivity is sharpened by extensive visual experience at fixed retinal positions or by the presence of flanking distractors.
Although we effortlessly perform object recognition thousands of times per day, it is a remarkably difficult computational task (Edelman 1999; Ullman 1996). The key computational problem the brain must solve is that the same object can produce a wide variety of sensory images (Edelman 1999; Riesenhuber and Poggio 2000; Ullman 1996). In the visual domain, retinal image variations arise from changes in object position, scale (e.g., viewing distance), orientation, pose, and illumination as well as the presence of other objects in the visual scene. How does the brain tolerate this tremendous variability to identify the object? In this report, we present data aimed at understanding how behaving animals tolerate one type of image variability—that due to changes in object position relative to the center of gaze.
Object position changes are a common source of image variation because they occur frequently when environments are explored with eye, head, or body movements. Yet even in the face of such position variation, we easily carry out behaviors that depend on recognition. Indeed, some studies suggest that recognition can tolerate changes of ≥5° (Biederman and Cooper 1991; Ellis et al. 1989). However, others indicate that the position tolerance of recognition depends on visual experience and the similarity of the objects to be distinguished (Dill and Edelman 2001; Dill and Fahle 1997, 1998; Foster and Kahn 1985; Nazir and O'Regan 1990).
Any theory that can explain some range of position tolerance in recognition behavior must include mechanisms that transform retinal images to neuronal signals that are sensitive to object form but are largely insensitive to object position over that range. That is, neuronal signals that are at least as position tolerant as the behavior must exist somewhere in the brain because the behavior dictates their presence at the level of motor neurons. Such neurons could be described as having large receptive fields (RFs) in that they respond selectively to objects over all retinal positions at which recognition occurs. However, because it would be inappropriate to describe motor neurons as having large RFs, we use the term position sensitivity because it can be applied without confusion to the neuronal responses along the entire stimulus-motor chain of processing.
Although many mechanisms have been proposed to create object-selective, position-tolerant signals in the brain (e.g., Biederman 1987; Mel 1997; Olshausen et al. 1993; Riesenhuber and Poggio 1999; Salinas and Abbott 1997; Ullman 1996), the actual mechanisms are unknown, and the brain regions thought to contain these signals are poorly understood. The dominant hypothesis is that these mechanisms operate in the ventral visual processing stream of the cerebral cortex and produce position-tolerant patterns of neuronal activity at the highest level of that stream—the anterior inferotemporal cortex (AIT) (Gross 1973; Logothetis and Sheinberg 1996; Tanaka 1996; Ungerleider and Mishkin 1982). Indeed, inferotemporal cortex (IT) likely plays a central role in object recognition because IT lesions (Dean 1982; Weiskrantz and Saunders 1984) or inactivation (Horel 1996) impair recognition, and IT neuronal responses are selective for complex stimulus forms (Logothetis and Sheinberg 1996; Miyashita 1993; Tanaka 1996), such as faces (Desimone et al. 1984; Perrett et al. 1982).
The strongest statement of the IT position-tolerance hypothesis predicts that IT responses should be highly sensitive to stimulus form (i.e., identity) and completely insensitive to stimulus position (within the visual field). It is already well known that this strict interpretation is not true because previous studies show that IT neurons have finite RFs and that IT responses often decrease with changes in stimulus position away from the RF center (Boussaoud et al. 1991; Desimone et al. 1984; Gross et al. 1969, 1972; Ito et al. 1995; Kobatake and Tanaka 1994; Leuschow et al. 1994; Logothetis et al. 1995; Missal et al. 1999; Op de Beeck and Vogels 2000; Richmond et al. 1983; Sary et al. 1993; Schwartz et al. 1983; Tovée et al. 1994). Furthermore, IT neurons are often described as having only a relative form of position tolerance in which the neuron's overall responsiveness decreases with changes in position but its rank order of target preferences remains the same (e.g., Logothetis and Sheinberg 1996). We do not yet know if or how this relative position tolerance supports nonrelative behavioral position tolerance. Nevertheless, IT neurons have been shown to maintain this relative position tolerance over visual regions ≥10° in diameter (Ito et al. 1995; but see Logothetis et al. 1995 and discussion; Sary et al. 1993; Schwartz et al. 1983; Tovée et al. 1994). Thus all of these studies suggest that IT neurons maintain responsivity over large regions of visual space—that is, they have large RFs. Indeed, standard RF mapping methods indicate that AIT neurons have very large RFs (10 -30° in diameter) (Boussaoud et al. 1991; Desimone et al. 1984; Gross et al. 1969, 1972; Kobatake and Tanaka 1994; Op de Beeck and Vogels 2000; Richmond et al. 1983).
Although previous studies indicate that AIT neurons maintain relative form selectivity over large RFs, it is not known if or how these neuronal responses compare with the position tolerance of the recognition behavior they are thought to support. We therefore sought to understand the neuronal responses to one or more recognition targets placed within the large RFs of form-selective AIT neurons while animals performed form-recognition tasks. To this end, we trained animals to recognize and report the identity of familiar objects and developed a technique that allowed presentation of visual stimuli to arbitrary retinal positions with an accuracy of ∼0.1°, even in free-viewing animals (DiCarlo and Maunsell 2000). We first sought to confirm the large RF property of AIT neurons by presenting stimuli at three closely spaced retinal positions (-1.5, 0, and +1.5°). Based on the studies described in the preceding text, these positions should have all been well within the RFs of essentially all AIT neurons. Unexpectedly, most AIT neurons were highly sensitive to these small changes in stimulus position.
Animals and surgery
Experiments were performed on two male rhesus monkeys (Macaca mulatta) weighing 4.5 and 4.7 kg. Before behavioral training, aseptic surgery was performed to attach a head post to the skull and to implant a scleral search coil in the right eye. After 2-3 mo of behavioral training (following text), a second surgery was performed to place a recording chamber (18 mm diam) to reach the anterior half of the left temporal lobe (chamber Horsley-Clark center = 15 mm A). All animal procedures were performed in compliance with the standards of the Baylor College of Medicine Animal Research Committee and the American Physiological Society.
Horizontal and vertical eye positions were monitored using the scleral search coil (Robinson 1963). Each channel was low-pass filtered at a corner frequency of 400 Hz and was digitally sampled at 1 kHz with a resolution of ∼0.003°. The instrumentation time lag was <1.5 ms, the RMS noise in each channel was 0.025°, and accuracy was ∼0.1°. Saccades greater than ∼0.2° were reliably detected in real time using speed criteria (saccade start: speed >24°/s; saccade end: speed <16°/s). The methods for detecting saccades and calibrating retinal locations with monitor locations are described in detail elsewhere (DiCarlo and Maunsell 2000).
Stimuli were presented on a video monitor (37.5 × 28.1 cm, 75 Hz frame rate, 1,600 × 1,200 pixels) positioned 62 cm from the monkey so that the display subtended ±17 (h) and ±13 (v)° of visual angle. The background luminance of the monitor was 22 cd/m2; it was the only light source in the room. Both animals worked with the same fixed set of five achromatic forms (Fig. 1A). Each form was constructed by connecting line segments (0.02° width) to form the stimulus outline. This outline shape was then convolved with a difference-of-Gaussians spatial filter (0.01° SD positive, 0.02° SD negative) so that the average luminance over each form was the same as the monitor background (Fig. 10A). The peak luminance was set to the monitor maximal white (46 cd/m2). The size and spatial frequency content of the forms were tuned to allowed us to study both the effects of free viewing (DiCarlo and Maunsell 2000) and of stimulus position (current study). Specifically, based on the animal's performance with stimuli placed at a range of eccentricities, we chose the stimulus size so that recognition accuracy was good for stimuli at 1.5° eccentricity (Fig. 2) but was approaching chance levels for stimuli at ∼5° eccentricity (monkey 1 = 0.52° width, monkey 2 = 0.68° width). Although acuity limits depend on the forms to be distinguished, at 1.5° eccentricity acuity is reduced to 40-60% of that observed at the center of gaze (Ludvigh 1941; Merigan and Katz 1990), and retinal cone density is ∼40% of maximal (Curcio et al. 1987; Perry and Cowey 1985).
Some neurons in monkey 1 were also studied with a second set of target objects that had the same shapes as the original stimuli but substantially different elemental spatial frequency content (Fig. 10, see RESULTS). These were constructed with the same outline shapes, except that the outlines were 0.04° wide and were not filtered with the difference-of-Gaussians spatial filter. Instead, to keep the average luminance over the stimulus near the background luminance, each of these stimulus shapes was added to a negative, (i.e., below the average luminance), circularly symmetric Gaussian (0.3° SD). The amplitude of this Gaussian was set so that the average luminance over a 2° square window centered on the stimulus was the same as the background luminance.
Basic form recognition task
Both animals performed a form recognition task. Four of the five stimulus forms were designated as targets; the remaining form was the distractor (Fig. 1A). Four response locations near the corners of the monitor (16.8° from the display center) were at all times indicated by identical white squares (0.6 × 0.6°, 46 cd/m2; Fig. 1B). For each animal, each target form was assigned a different response location, and this mapping never changed. When a target was presented, the animal was required to signal the target form by making a saccade to the appropriate response location. Saccades that ended within a window [±11.9° (h) and ±4° (v)] around any response location were scored as a response. The horizontal width of these windows was chosen to ensure that the animal would register a response if it produced the same saccade vector from a broad range of absolute horizontal eye positions where targets could be encountered during free viewing studies, described elsewhere (DiCarlo and Maunsell 2000). Correct responses produced a juice reward and a brief tone. Reaction time was defined as the duration between target onset and the start of the response saccade.
Each trial began with the presentation of a small, white fixation point (0.1 × 0.1°) near the display center (Fig. 1B). The animal was required to bring and hold its gaze within ±0.5° of the point. The fixation point was extinguished 300 ms after acquisition, and one of the five forms was immediately presented in one of three positions: at the center of gaze, 1.5° to the left of the center of gaze (ipsilateral to the recorded hemisphere), or 1.5° to the right of the center of gaze (contralateral to the recorded hemisphere). Because we desired identical retinal stimulation for all trials within a condition and because position variability on the retina can produce neuronal response variability (Gur and Snodderly 1987), the three positions were always specified relative to the animal's center of gaze at the end of the fixation period. That is, the three positions were specified in retinal coordinates rather than monitor coordinates. Over all recording sessions, the mean center of gaze at the end of the fixation period was 0.01° (h) and 0.13° (v) (monkey 1) and -0.02° (h) and 0.14° (v) (monkey 2) from the fixation point center (h and v SD ∼0.14° in both monkeys). On each trial, the stimulus form (4 target forms) and the position of the form (3 possible positions) were each randomly chosen with equal likelihood and were presented only briefly (mean: ∼290 ms, see following text). Thus the animal could not bias spatial or featural attention differently on each trial because it could not predict the position or form of the target. These 12 trial types were presented in blocks such that a correctly completed trial type was not presented again until all trial types were correctly completed.
After a target form was presented, the animal was allowed to respond as rapidly as it liked. If the animal made a saccade that ended >3° (h) or 1° (v) from the fixation point but did not reach one of the response widows, the trial was scored as a failed trial (∼4% of trials). Any eye movement that brought the center of gaze out of the fixation window (±0.5° around the initial fixation point) caused the stimulus to be immediately extinguished. Indeed, on ∼97% of trials in which a form was presented to the left or right of the center of gaze, the animal made a small “adjustment” saccade (mean amplitude = 1.1°; mean duration = 23 ms; latency mean and SD = 140 ± 33 ms) toward the form, and the form was extinguished during this saccade. The animals generated these adjustment saccades without training, and we did not attempt to modify this behavior. Extinguishing the target during the adjustment saccade ensured that the animal could not acquire information about target form from a retinal position other than the initially stimulated position (see Fig. 11). In these trials, the monitor phosphors that comprised the form were last excited 22.5 ms (mean; 95% range = 10 -36 ms) before the saccade out of the fixation window was completed. Because the phosphors decayed exponentially with a time constant of <1 ms, the extinguished form could not have been visible at the end of the adjustment saccade (Michelson contrast <10-9 on average; 95% upper bound = 5 × 10-5). After the adjustment saccade, the animal's gaze typically remained at the new, now empty, position (i.e., near the original target position) for ∼150 ms before the animal began its response saccade (i.e., to 1 of the 4 response locations). This pattern of eye movements was observed in essentially all correct trials in both animals (monkey 1: 93% of central position trials, 95% of eccentric position trials; monkey 2: 88%, 99%; see Fig. 3, top). In the remaining central position trials, the animals made a small saccade (typically <0.5°) before the response saccade. In the remaining eccentric position trials, no adjustment saccade was detected.
Additional task conditions
We also recorded data while the animal performed the basic recognition task in the presence of visual clutter. For these trials, the single target form was embedded in a horizontal row of 20 identical distractor forms with a 1.5° center-to-center separation (see Fig. 13) (see also Fig. 1 of DiCarlo and Maunsell 2000). Trials run with clutter were run in separate blocks, and these blocks were interleaved with the primary behavioral task blocks.
Monkey 1 was also studied in a version of the basic recognition task in which target shapes were presented not just at the central three positions but also at more eccentric positions along the horizontal meridian (±4.5° in 1.5° increments). Initially, the animal's performance was better than chance for targets presented in these more eccentric positions (∼52% accuracy; chance is 25%), indicating that the animal had generalized the task (i.e., shape identification regardless of retinal position). After ∼2 wk of training, performance gradually improved but was still not as good as the central three positions (see RESULTS) and was very poor for some target shapes. Because of this, we did not force the animal to complete an equal number of correct trials for each target in each position but instead included neuronal response data from all trials in which the target was presented, regardless of the behavioral outcome (i.e., correct, wrong, or failed).
Recording and data collection
A guide tube (23 G) was used to reach AIT using a dorsal to ventral approach. Recordings were made using glass-coated Pt/Ir electrodes (0.5-1.5 MΩ at 1 kHz), and spikes from individual neurons were amplified, filtered, and isolated using conventional equipment. The superior temporal sulcus (STS) and the ventral surface were identified by comparing gray and white matter transitions and the depth of the skull base with atlas sections. Penetrations were made over a ∼10 × 10 mm area of the ventral STS and ventral surface (Horsley-Clark AP: 10 -20 mm, ML: 14 -24 mm) of the left hemisphere of each animal. In both animals, the penetrations were concentrated near the center of this region, where form selective neurons were more reliably found. Using electrolytic lesions and fluorescent dye (DiI, Molecular Probes) to coat the electrode (DiCarlo et al. 1996), we confirmed that the bulk of the recordings from the first animal were on the ventral surface, centered ∼10.5 mm posterior of the temporal pole, lateral of the anterior middle temporal sulcus (AMTS). Based on the anterior-posterior coordinates, and the sulci, this region is approximately the anterior third of IT and is contained in area TE (Felleman and Van Essen 1991; Logothetis and Pauls 1995; Logothetis and Sheinberg 1996). We refer to this region as AIT (Felleman and Van Essen 1991).
The animal cycled through behavioral blocks as the electrode was advanced into AIT. Responses from every isolated neuron were assessed with an audio monitor and on-line histograms, and data were collected from even marginally responsive cells under the assumption that longer periods of observation might reveal statistically detectable effects. Data from each recorded neuron were considered for further analysis if isolation was maintained for at least six presentations (mean = 8.5, maximum = 10) of each target form in each position during all task conditions (∼20 -35 min of recording). The responses of 220 AIT neurons (monkey 1 = 119, monkey 2 = 101) were recorded. Among these, 74 (33%) were not considered for further analysis because they failed to produce a statistically significant response to any of the three tested retinal positions (described in the following text). The presence of these 74 unresponsive neurons in the recorded data set is consistent with our low threshold for selecting neurons during the recording sessions. Most of the neurons were located on the ventral surface (127 of 146; 87%); the rest were in the ventral bank of the STS. For brevity, the data from both animals were combined in some plots, and summary values for each animal are indicated in the text and figure legends.
Only neuronal responses collected during correctly completed behavioral trials were included in the analyses (88% of trials; except Fig. 8, see METHODS). We also excluded trials in which eye movements >0.3° occurred during the first 50 ms after target onset (<1% of all correct trials) or those in which the animal began its response saccade <100 ms after target onset (<<1% of all correct trials). We estimated the background firing rate of each neuron as the mean rate of firing over all trials in a 100-ms-duration window that directly preceded target onset. For the majority of the data (where only 3 positions were tested), we quantified the response of each neuron to each of the 12 stimulus conditions (4 forms × 3 positions) as the mean response in a 150-ms window that began 100 ms after target onset. One advantage of the behavioral task is that the choice of the temporal analysis window was constrained by both the start of the AIT responses (∼100 ms after stimulus onset, see Fig. 13) (see also Baylis et al. 1987; Vogels and Orban 1994) and by the animal's reaction times (∼300 ms after stimulus onset, see Fig. 2B). The results were largely unaffected by the details of the analysis time window (see RESULTS).
The mean response above background for each of the 12 stimulus conditions (4 target forms × 3 positions) was used to determine the form and position preferences of each neuron. Eight neurons that showed decreases in firing rates in all 12 conditions were excluded from further analyses. We defined the neuron's best and worst target forms as those that produced the largest and smallest mean response over all three positions. Likewise, we defined the neuron's best and worst positions as those that produced the largest and smallest mean response over all four target forms. Responsive neurons (n = 146 of 220) were defined as those that showed a statistically significant increase in firing rate (relative to background rate) to their best target form presented in any of the three positions (3 t-test, each run at P = 0.017). Because we selected the neuron's best target before running these tests, Monte Carlo simulation shows this gives an overall false positive level of 0.075. The main result (Fig. 6) was unaffected when false positive levels of 0.05 (n = 140), 0.01 (n = 128), and 0.001 (n = 101) were applied.
In Fig. 6, we used the RF data of Op de Beeck et al. (Op de Beeck and Vogels 2000) to predict the expected neuronal sensitivity to our tested positions. That report is the most quatitative study of IT RFs currently available. It showed that Gaussian sensitivity profiles fit most of the measured IT RFs, and it provided the distribution of RF sizes and RF centers. Based on those data, we simulated the position sensitivity of 10,000 randomly selected (normal), circularly symmetric Gaussian RFs using the following parameters: mean RF size (square root of RF area) = 10.3°; RF size SD = 5°; min RF size = 2°; mean RF center azimuth = 1.5° (contralateral), mean RF center elevation = 0.0°; RF center SD = 1.5° (azimuth and elevation).
Two monkeys were trained to identify four target forms by making a saccade to one of four fixed locations (Fig. 1). Each target form was presented to the fixating animal at one of three retinal positions on the horizontal meridian (center of gaze, 1.5° left of center, and 1.5° right of center). Both animals were highly accurate at this task (Fig. 2A). Accuracy was best at the central position (monkey 1 = 94% correct, monkey 2 = 88% correct) and only slightly reduced at the eccentric positions (monkey 1 = 3% decrease in accuracy; monkey 2 = 8%). Mean reaction times were short in both animals (monkey 1 = 285 ms; monkey 2 = 303 ms) and were little affected by position (Fig. 2B). Although these behavioral effects of position were small, most were statistically significant because of the large number of behavioral trials examined (∼4,500 trials for each animal in each position; accuracy: monkey 1: χ2 = 3.0, P > 0.05; monkey 2: χ2 = 22.2, P < 0.01, df = 2; reaction time: monkey 1: F = 53, P < 0.01; monkey 2: F = 287, P < 0.01). In sum, the behavior showed excellent position tolerance— both animals could rapidly and accurately identify each target form, regardless of its position, and without foreknowledge of precisely where it would appear.
If individual AIT neurons were underlying the animal's recognition, the behavioral observations suggested that these neuronal responses should be largely unaffected by these small position changes. Likewise, previous studies showing AIT RFs to be 10° or more in diameter (see INTRODUCTION) also predicted that the neuronal responses should be largely unaffected by our small position changes. To examine these predictions, we analyzed data from all 146 recorded neurons that were responsive in at least one position (72 from monkey 1, 74 from monkey 2; see METHODS). Consistent with previous studies (Logothetis and Sheinberg 1996; Miyashita 1993; Tanaka 1996), many of the recorded neurons were selective for stimulus form (n = 54 of 146, see later). However, the AIT neuronal responses in our animals were largely inconsistent with the large RFs previously reported in AIT (see INTRODUCTION). In particular, almost all neurons showed a stronger than expected sensitivity to small (1.5°) position changes, and some were exquisitely sensitive to these position changes. Responses from one such neuron are shown in Fig. 3. Middle shows that when targets were presented at the center of gaze, the neuron responded strongly to two of the target forms but gave little response to the other two. That is, this neuron was highly form selective at the center of gaze (ANOVA, P < 10-7). However, the neuron produced almost no response when the same target forms appeared either 1.5° ipsilateral or 1.5° contralateral to the center of gaze. Thus this neuron was selective for stimulus form but responded only over a very limited range of stimulus positions (assuming that positions more eccentric than the tested three would yield little or no response, see following text). It should be emphasized that all three tested retinal positions were within the fovea (±2°). One interpretation of these observations is that the neuron had a very small RF near the center of gaze (i.e., <2° in diameter). However, because we did not perform full RF mapping for most neurons and because some neurons showed more than one hot spot in their RF (e.g., Fig. 4), we use the term position sensitivity to describe the effect of our tested position changes on the neuronal responses.
The neuron in Fig. 3 could contribute to form discrimination at the central fovea, but it is poorly suited for the eccentric positions just 1.5° away. However, the animals were highly accurate at identifying target forms at all three retinal positions. If AIT supported recognition at all three positions, one would expect to find neurons that showed form selectivity at eccentric positions. Indeed, we also encountered many neurons that preferred stimuli at one or both of the eccentric locations. For example, the response pattern of the neuron shown in Fig. 4 was complementary to that of the previous neuron in that it was most responsive to stimuli presented in the contralateral position, with some response in the ipsilateral position, and almost no response in the central position.
In light of previous studies, the observation that AIT neuronal responses change with stimulus position is not surprising. Indeed, any neuron must show some position sensitivity—at least at the edges of its RF. However, the neuronal position sensitivity was typically much larger than that previously reported or expected based on reported RF sizes in AIT. Indeed, many neuronal responses were so strongly affected by retinal position that they failed to respond at one or two of the three tested locations (all were within the fovea). Among the neurons that were responsive in at least one location, 77 (52%) gave no statistically significant response for one or both of the remaining positions (t-test), and 18 of these gave no statistically significant response to the central fovea (using the best target form for all tests). This was not due to the neurons being poorly responsive overall because the mean driven response rate at preferred positions was 24.3 spikes/s (n = 146)— comparable to rates previously reported in AIT (20-40 spikes/s) (Leuschow et al. 1994; Missal et al. 1999; Op de Beeck and Vogels 2000). The examples in Fig. 5 illustrate the range of position and form sensitivities seen in the recorded population.
To summarize the position sensitivity of each neuron, we plotted its reduction in response when its best target form was presented in its worst position (relative to the response in its best position; Fig. 6). The median relative response was 0.41. In other words, the response of the typical AIT neuron in our sample could be reduced by ∼60% when the neuron's preferred stimulus form was moved within a region of only ±1.5° around the center of gaze. If we only consider neurons that prefer the center of gaze (i.e., where we clearly included the RF center), assume 2D Gaussian shaped RFs, and define RF cutoff at 50% (as in previous studies, see Op de Beeck and Vogels 2000), then this median decrease over a position change of 1.5° corresponds to a median RF diameter of 2.6°. This is not an artifact of noisy responses—the result was nearly identical when the data were split in half and one group was used to compute the best and worst targets and positions and the other group used to compute the position sensitivity.
Because form-selective neurons are most likely to underlie the recognition behavior, it is possible that they have less position sensitivity (because the behavior showed virtually no position sensitivity). However, examination of the 54 neurons (37%) that were selective for stimulus form (ANOVA, P < 0.05) revealed even greater position sensitivity (median = 0.27) than that seen in the entire responsive population (Fig. 6). Under the RF assumptions described above, this corresponds to a median RF diameter of 2.2°.
To compare the distribution of position sensitivities of the recorded population (Fig. 6) with that predicted from previous studies, we estimated the expected AIT position sensitivity using the RF data from a recent, thorough study of AIT RFs (Op de Beeck and Vogels 2000) (see METHODS). That data predict that the median AIT neuron should have shown only an 18% maximal response change across our three tested positions, nearly fourfold less than we observed.
The stronger than expected position sensitivity could be due to changes in overall responsivity at some retinal positions (e.g., due to small RFs), changes in form preference at each retinal position, or both. The example neuronal data (Figs. 3, 4, 5) suggest the former hypothesis. This hypothesis also seemed most likely because previous studies have reported that the rank order of form preference is largely unaltered by changes in position (e.g., Desimone et al. 1984; Ito et al. 1995; Sary et al. 1993; Schwartz et al. 1983). However, because we found much greater position sensitivity than previous studies, we sought to confirm that it acted across all stimulus forms. Because the position sensitivity of the neuronal responses was so strong, we could not test this hypothesis for about half the neurons because the 1.5° position shifts eliminated the response (e.g., Fig. 3). Even when responses remained at non-preferred positions, they were so weak that most neurons were no longer significantly form selective at those positions. Specifically, 54 of the 146 responsive neurons (37%) were significantly form selective at their best position but less than half of these (25 of 54) were still significantly form selective at their second best position. Nevertheless, 24 of these 25 neurons maintained the rank order of their best and worst forms at their second best position.
To summarize the average effect of position changes on form selectivity, we split the 54 form-selective neurons into three groups, where each group preferred one of the three tested positions (n = 2, n = 35, n = 17 for the ipsi, central, and contra positions). We then rank-ordered the target forms for each neuron and averaged the normalized (to best response) responses of all neurons in the group for each rank-ordered form in each position (Fig. 7). This analysis showed that, on average, neurons that preferred the central position (Fig. 7, left) maintained their rank order of form preferences at the eccentric positions and showed a strong response reduction in each side position that operated largely as a decrease in response gain over all four target forms (gain of ∼0.4 across the 1.5° position changes). Results were similar for neurons that preferred the contralateral position, but the decrease in response gain was slightly weaker (Fig. 7, right). In summary, although we found much greater position sensitivity than most previous studies, the results were consistent with other studies in that, when it could be measured, the rank order of target form preference was largely unaffected by position. Thus the strong position sensitivity observed in this study is most consistent with the hypothesis that the neurons have small RFs (∼2.5° diam), or that those RFs contain unresponsive locations (e.g., Fig. 4).
We could not fully characterize the spatial RFs of the neurons because we tested only three positions. Because the animal's task was to identify forms at these positions, our logic was that the position sensitivity of AIT neurons responding to any of these positions would provide the most appropriate measurement of the position sensitivity of AIT neurons that might support the behavior. Exploration of additional retinal positions could only show that we had underestimated the neuronal position sensitivity. However, we wondered if our measurements were on the edge of some RFs or if they always included the RF center (i.e., maximal response position). Although a thorough exploration of these RF issues is the focus of future studies, we have collected preliminary data from 17 responsive neurons in one animal (monkey 1). For these neurons we extended our measure of position sensitivity along the horizontal meridian by placing stimuli at four additional positions eccentric to those tested for the larger neuronal population. In particular, we tested horizontal eccentricities of -4.5 to +4.5° in 1.5° increments (Fig. 8). Although the animal performed well above chance the first day it saw these new positions, the animal received additional training to better acclimate it to the occurrence of targets at these new positions (see METHODS). After training, the animal's performance at these positions was reduced relative to the more central positions, but was well above chance (70 and 62% correct at 3.0 and 4.5° eccentricity, respectively). Each neuron's preferred target form was determined from the central three positions as before, and the response to that target plotted as a function of position. Of the 17 neurons tested, no neuron gave a significantly larger mean response to any of the more eccentric positions than it did to the best of the original, central three positions (t-test, P = 0.05). Data from four representative neurons are shown in Fig. 8. Thus although the RF shape varied from neuron to neuron, the extended field mapping suggests that the RF centers of the tested neurons were within the original three positions.
Time course of position sensitivity
We next sought to determine if the position sensitivity was present in the earliest part of the responses or if it developed over time. For example, perhaps the AIT neurons had different response latencies for different positions. Inspection of the data revealed little evidence of large differences in latency across stimulus position (e.g., Fig. 4), but we examined the time course for subtle effects. As a first step, we re-analyzed the entire data set using two other analysis windows (100 -200 and 150 -250 ms after stimulus onset) with little effect on any of the results. The median position sensitivity ratios using these time windows were similar (0.36 and 0.38, respectively; cf. Fig. 6). An ideal analysis would estimate each neuron's response latency for each position, but this is problematic because of the limited number of trials and because many neurons did not respond to nonpreferred positions. Instead we estimated the population time course of the position sensitivity by computing the population average response to each neuron's best target form presented in the neuron's best and worst positions (Fig. 9). For the best position, AIT neurons began to respond ∼100 ms after stimulus onset. For the worst position, the average response began slightly later, rose more slowly, and reached a lower peak. The plot suggests that latency differences across stimulus position account for only a small amount of the position sensitivity reported above. To quantify this, we found the temporal shift and scale factor that could be applied to the average response in the worst position to best match the average response in the best position (RMS error function). The fit was good (correlation coef = 0.976, 0 -300 ms after stimulus onset; dashed line in Fig. 9), and it required a temporal shift of 19 ms and a vertical scale factor of 2.7. The scale factor is an estimate of the amount of position sensitivity not due to latency differences, and it shows that mean position sensitivity (worst/best position) was 0.37 (i.e., 1/2.7), which is comparable to the median effect of 0.41 already described. In summary, changes in response gain with position underlie almost all of the position sensitivity reported in this study.
Because we found much greater position sensitivity than almost all previous studies of AIT (but see DISCUSSION), we considered factors that might explain this finding. The most intriguing possibilities require further systematic study (see DISCUSSION). However, here we report our examination of three possible artifacts that might have contributed to our findings: stimulus spatial frequency content, differences in eye movements across position, and differences in stimulus duration across position.
The first factor we considered was the spatial frequency composition of the target forms. The target forms were made of line segments with a high spatial frequency content (∼25 cycles/°, see METHODS). Because stimulus form (identity) depended on the spatial arrangement of these line segments, the spatial frequencies that supported the animal's differentiation of the forms were much lower (∼5 cycles/°)—near the maximal contrast sensitivity for primates (Merigan and Maunsell 1993). Indeed, the stimuli had spatial frequency content similar to that of individual letters during normal reading. Nevertheless, we considered the possibility that the spatial frequency content of the stimulus elements was responsible for the strong position sensitivity. We created a set of four new targets that had the same size and spatial layout as the original four targets, but whose line segments contained lower spatial frequencies (Fig. 10). One of the animals (monkey 1) was retrained to respond to these four modified targets using the same form-response mapping as the four original targets even when both target types were randomly interleaved across trials (∼1 wk of training). We recorded the responses of an additional 15 AIT neurons to each of the eight targets in each of the three original positions. We measured position sensitivity for each spatial-frequency condition exactly as before with the exception that each neuron's best target and best and worst positions were chosen after averaging the data from the two spatial-frequency conditions (results were nearly identical when each condition was considered separately). The analysis showed that some neurons were less sensitive to the position of the modified stimuli (Fig. 10C) but that other neurons were equally (Fig. 10D) or more position sensitive (Fig. 10E). Over the population (n = 15), the median position sensitivity for the original stimuli was nearly identical to that measured in the larger group of neurons (0.37) and was not significantly different from the population position sensitivity measured with the modified stimuli (median = 0.33; t-test, P = 0.60). Thus these data suggest that the strong position sensitivity cannot be simply explained by the spatial-frequency content of the stimulus elements per se (but see DISCUSSION).
The second and third potential artifacts we considered were differences in eye movements and differences in stimulus duration across target position. As described in METHODS, we did not place strong constraints on the animal's eye movements but ensured that the target was only presented at the intended retinal position. Because of this, the animal's pattern of eye movement and the stimulus duration were both confounded with the primary variable of retinal position. These confounds are illustrated in Fig. 11, A-C. We admitted these confounds in our design because we wanted the task to remain as natural as possible while still varying the retinal position of the target forms. As a result, it is possible that the shorter stimulus exposure durations used for eccentric stimuli (∼150 ms) relative to the central stimuli (∼300 ms) could affect response rate and cause apparent strong position sensitivity. This seemed unlikely because rapid presentation of stimuli indicates little peak response reduction for stimulus exposure durations greater than ∼50 ms (Keysers et al. 2001) and because the latency of AIT neurons to stimulus onset is ∼100 ms (Fig. 9) (Baylis et al. 1987; DiCarlo and Maunsell 2000; Vogels and Orban 1994). If stimulus offset requires the same latency as stimulus onset to alter AIT firing rates, then the offset of the target form would not alter the response until the end of the analysis window (i.e., 100 ms after the form offset is ∼250 ms). A second possibility is that the neuronal processes that produce eye movements toward the target (“adjustment saccades” in Fig. 11; see METHODS) could cause a change in ongoing AIT neuronal activity (e.g., a “reset” signal or saccadic suppression). The fact that the monkeys' reaction times were nearly identical for central and eccentric stimulus positions argues against this possibility (Fig. 2) but does not exclude it. Because the two confounding factors (stimulus exposure duration and time of adjustment saccade) were perfectly correlated in our design, we cannot distinguish their effects, so we considered them to be a single confound and performed analyses to isolate the effect of this confound from that of stimulus position.
One analysis is summarized in Fig. 11 (D and E). Each point in each panel is the response rate of one neuron on one trial relative to the average response rate of this neuron over all trials with the neuron's best form in one position. These normalized trial-by-trial responses are plotted relative to the time that the adjustment saccade (i.e., the confound) occurred for that trial. Thus these plots show the average effect of the confound on response rate (isolated from the effect of stimulus position). If the confound had a consistent effect across the population of AIT neurons (e.g., decrease in ongoing neuronal responses), the running averages in the plots should show a trend. Instead, no trends were apparent and the correlation coefficients were not significantly different from zero (-0.012, -0.030, P > 0.1). The two symbol types in the plots indicate data from the two monkeys, illustrating that monkey 2 tended to make adjustment saccades at shorter latencies than monkey 1. This difference in behavior does not obscure a relationship between the time of the adjustment saccade and response rate because the within-animal correlations are also not significantly different from zero (monkey 1: -0.051, -0.021; monkey 2: 0.013, -0.043; P > 0.1 all cases). In addition, the mean of the normalized responses on trials where no adjustment saccade occurred was not significantly different from that expected based on trials where an adjustment saccade was made (t-test against a value of 1, P > 0.1 for the ipsilateral and contralateral conditions). If the confound causes some neurons to increase their firing rates and others to decrease, the analysis in Fig. 11 might fail to detect these effects. However, a neuron-by-neuron analysis revealed that only ∼5% of neurons (8% for ipsilateral stimuli, 3% for contralateral stimuli) showed any significant correlation of response rate with adjustment saccade latency (Spearman ranked correlation, P < 0.05), which is approximately the number expected by chance. Furthermore, a mixture of positive and negative effects should increase the variability of relative response rates (i.e., the SD of the ordinate values in Fig. 11) relative to that which would have been observed without the effects. Instead, the observed SDs (ipsi: 0.50, contra: 0.47) were slightly below those obtained from simulated trial-by-trial responses using the average rates observed in the actual population and Poisson firing statistics (ipsi: 0.53, contra: 0.50) (see Shadlen and Newsome 1994 for Poisson assumption; Softky and Koch 1993). In summary, because these analyses failed to find a significant effect of the time of the adjustment saccade (and stimulus offset) on the response rate, we conclude that these factors did not significantly modify the AIT responses and thus they cannot explain the position sensitivity of those responses.
Behavioral significance of neuronal position sensitivity
Unlike almost all previous studies of AIT RFs or AIT position tolerance, the current data were collected while the subjects performed recognition across changes in object position. Thus we were also able to examine position sensitivity in the context of that behavior. Here we present three such analyses.
In the first analysis, we adopt a standard view of AIT in which the purported role of AIT neurons is to extract object identity and to support the “perceptual equivalence” of the same object over changes in, for example, object position (e.g., Desimone et al. 1984; Gross and Mishkin 1977). This hypothesis predicts that individual AIT neurons should be capable of signaling object identity across changes in object position that are “perceptually equivalent.” Testing this prediction depends on defining both perceptual equivalence and the manner in which AIT neurons signal or code object identity. The spirit of perceptual equivalence is that the subject's interpretation of the identity of the object remains the same over changes in, for example, object position. The animal's accurate identification of each object across changes in position (even for less trained positions, see METHODS) suggests that it treats each object as equivalent across position. Thus we assume that AIT neurons should signal object identity across these same position changes. We defined an AIT neuron's ability to signal object identity as its response to its best target form relative to a distractor response (d′). The distractor response was taken to be the maximal response to the neuron's worst target form over all three positions. We then asked, how well does each neuron continue to signal its preferred object across the tested position changes?
The results from the 54 form-selective neurons are shown in Fig. 12. Almost all of these neurons provided a strong signal of target identity at their preferred position. In particular, 41 of the 54 neurons (76%) had d′ values >1.35 (discrimination performance of 75% correct) at their preferred position. However, only three of the neurons (6%) could continue to provide this target identity signal (d ′ > 1.35) at all three of the tested positions. Put another way, the typical form-selective neuron could correctly discriminate its best target from the distractor on 83% of the trials (median d ′ = 1.89), but a position change within 1.5° of the fovea caused that same neuron's performance to fall to near chance (median d′ = 0.15; 53% correct discrimination; 50% is chance). In sum, these data show that only a few AIT neurons are individually capable of mediating perceptual (behavioral) equivalence.
In the second analysis, we ask: were the AIT neurons better at signaling object identity or object position? Tovée et al. (1994) asked this question in passively fixating animals and showed that the median AIT neuron carried four times as much information about object identity as it did about object form. However, comparison of position sensitivity and form sensitivity is problematic because it depends on the tested range of objects and positions. The comparison is only meaningful in the context of a behavioral task. In particular, if the putative role of AIT neuronal responses is to inform the animal about object identity regardless of small changes in object position, then AIT responses must be more sensitive to an identity change that is critical to the animal's task than to a position change that is irrelevant in that task. Our behavioral task was specifically designed to test this hypothesis, because it required the animal to signal object identity (stimulus form) regardless of position.
We compared the position and form sensitivity of the population of AIT neurons. The median position sensitivity was 11.5 spikes/s (n = 146; best-worst position; monkeys 1 and 2 = 13.1 and 10.9) and the median form sensitivity was 10.4 spikes/s (best-worst form; monkeys 1 and 2 = 10.5 and 10.2). If we consider only the 94 neurons that showed a statistically significant effect of either identity or position or an interaction (2-way ANOVA, P < 0.05), the median sensitivity differences were 14.3 spikes/s (position) and 13.8 spikes/s (form) and the median sensitivity ratios were 3:1 (position) and 2.4:1 (form). In summary, the AIT neurons were slightly more sensitive to differences in position within the fovea that were irrelevant to the task than they were to differences in target form that were critical to the task. These data cannot rule out the possibility that the object position information conveyed in the AIT responses is completely ignored by downstream brain areas. However, these data suggest that the role of AIT neurons is to provide the animal with a representation of both object identity and object position and that the representation of object position can be of much higher spatial resolution than previously appreciated.
So far we have focused on the idea that to perform position-tolerant recognition, the brain should seek large RFs and thus less neuronal position sensitivity. However, there may be competing behavioral demands for small RFs and thus more position sensitivity (i.e., as seen in this study). In this third analysis, we consider one of those behavioral demands—recognition in visual clutter. Before any recording began, both animals were successfully trained to recognize each target in each position even when the target form was flanked on both sides by a row of distractors (see Fig. 13A and METHODS; mean behavioral accuracy was 87% with clutter vs. 88% without clutter). We considered the hypothesis that small RFs (i.e., high position sensitivity) might have developed to protect each neuron's response, and thus the animal's behavior, from the influence of flanking visual clutter by limiting its intrusion into the RF.
We compared each neuron's responses at its best position with and without the flanking distractors. Consistent with the small RF hypothesis, addition of this flanking visual clutter only slightly decreased each neuron's response to its best target at its best position (median 23% decrease; Fig. 13B). Similarly, clutter had only modest effects on form selectivity. For 52 of the 54 (96%) form selective neurons, the response to the best target form remained above the response to the worst target form when clutter was added to the display, and clutter reduced median form selectivity by 24% (20.0 -15.2 spikes/s; best-worst form). Because these effects of clutter are relatively mild, they are consistent with the small RF hypothesis. A more convincing test would ask if neuronal immunity to clutter is negatively correlated with RF size. However, when we took position sensitivity as an inverse measure of RF size, and form sensitivity in clutter as a measure of clutter immunity, we found no such relationship (Fig. 13C). We also observed little relationship between position sensitivity and responsivity in clutter (data not shown). In summary, this study suggests a relationship between position sensitivity and clutter immunity because it reports a much stronger effect of position than most previous studies and a weaker effect of clutter than most previous studies (Chelazzi et al. 1998; Miller et al. 1993; Missal et al. 1999; Rolls and Tovee 1995; Sato 1988). However, this relationship may not be simply explained by the hypothesis that these neurons have small RFs because the stronger position sensitivity was not associated on a neuron-by-neuron basis with improved immunity to clutter.
It is thought that the position tolerance of object recognition is supported by the large RFs of individual AIT neurons and their ability to maintain target preferences within those large RFs. Here we provide data relevant to that hypothesis by examining the effect of small differences in object position on recognition behavior and AIT neuronal responses. Behavioral accuracy and reaction times were largely unaffected by the differences in position. However, individual AIT responses were remarkably sensitive to position. The median AIT neuron showed ∼60% decrease in response when stimuli were shifted within ±1.5° from the center of gaze, and 52% of neurons were unresponsive to one or two positions within this range. Although we did not systematically characterize the size of the AIT RFs, the position sensitivity would be explained by a median RF diameter of ∼2.5°. For comparison, a recent, systematic study (Op de Beeck and Vogels 2000) of AIT RFs estimated a mean diameter of ∼10°. Most studies have reported even larger RFs (e.g., 30° in diameter or more) (Boussaoud et al. 1991; Desimone et al. 1984; Gross et al. 1969, 1972; Kobatake and Tanaka 1994; Richmond et al. 1983). Although we report much greater position sensitivity than previous studies, we do not refute or discount the results of those studies. Instead we believe that our observations point to several hypotheses whose exploration might unify previous observations and, in the process, provide a much deeper understanding of the tolerance properties of AIT neurons.
Consistent with previous studies, we found that many AIT neurons were highly sensitive to the form of visual stimuli (reviewed by Logothetis and Sheinberg 1996; Miyashita 1993; Tanaka 1996). We considered the possibility that the strong position sensitivity of our recorded neurons was due to changes in form preferences across position. However, our findings were consistent with other reports (Desimone et al. 1984; Leuschow et al. 1994; Logothetis and Sheinberg 1996) in that the rank order of target preferences was largely maintained across responsive locations. In sum, the primary novel finding of this study is that AIT neurons can be highly sensitive to retinal position and thus appear to have much smaller RFs than previously reported. Control experiments and analyses revealed that this observation was largely unaffected by substantial changes in the spatial frequency content of the stimuli, was not an artifact of missing the RF center, and was not due to differences in eye movements or stimulus exposure duration at each position. Earlier studies did not use such small, precisely positioned stimuli and therefore would not have been able to measure position sensitivity at this spatial scale.
Like this study, several other studies have probed IT position tolerance by testing a few positions for changes in responsivity and selectivity. Ito et al. (1995) selected stimuli to optimize IT neuronal responses in anesthetized monkeys and then reported that a position change of 5° produced a ∼30% response decrease. This is about sixfold less position sensitivity than reported here. Studies in awake, passively fixating animals typically show even less position sensitivity (Desimone et al. 1984; Gross et al. 1969, 1972; Kobatake and Tanaka 1994; Richmond et al. 1983; Tovée et al. 1994). For example, Tovée et al. (1994) showed that neurons responding best to face stimuli did not decrease their response rates <50% of maximal until the center of the face was displaced by >15°. In monkeys performing a delayed match-to-sample task, Leuschow et al. (1994) showed that a position change of 5° produced only a ∼25% decrease in response rate.
Notably, the data that most closely approach those reported here come from monkeys trained to recognize wire-frame objects (Logothetis et al. 1995). In that study, Logothetis and colleagues tested the position sensitivity of nine AIT neurons that were tuned for specific views of the wire-frame objects, while the animal maintained passive fixation. For three neurons, they reported that responses were largely insensitive to position differences of at least ±2° but less than ±7.5° (i.e., a RF size of 4 -15° in diameter) (Logothetis et al. 1995). When position tolerance was measured as the size of the region where the response to the best form remained above the responses of a large set of distractors, the typical tolerance region was found to be only ∼4° in diameter (see Riesenhuber and Poggio 1999). This means that the RF size in that study was at least twice as large as that predicted by the current observations. Nevertheless, the results of Logothetis and colleagues may be the most consistent with our results because they also studied animals highly trained to recognize specific stimuli (see following text).
At least four possibilities could explain the unexpectedly strong position sensitivity of the AIT neurons in this study. First, the animal was actively performing a recognition task, whereas most previous studies of position effects in IT have been carried out in anesthetized or passively fixating animals (Desimone et al. 1984; Gross et al. 1969, 1972; Ito et al. 1995; Kobatake and Tanaka 1994; Logothetis et al. 1995; Op de Beeck and Vogels 2000; Richmond et al. 1983; Tovée et al. 1994). Second, by allowing the animal to respond as rapidly as it liked, we obtained a short (150 ms), physiologically meaningful time window in which to analyze neuronal response rates. Previous studies generally averaged response rates over much longer, arbitrary periods of time. It seems unlikely that either of these possibilities account for the strong position effects reported here. This first possibility is unlikely because task effects in IT are typically weak (Vogels et al. 1995) and greater position tolerance has been observed in behaving animals (Leuschow et al. 1994; Logothetis et al. 1995). The second possibility is unlikely because initial response transients can dominate response rates computed over longer time windows in early visual areas (e.g., Muller et al. 2001) and perhaps IT (e.g., Logothetis et al. 1995) (see also Figs. 3 and 4).
A third possibility is stimulus size—we used stimuli that were much smaller (0.6° width) than those used in previous studies. Although the effect of stimulus size on AIT position tolerance has not been thoroughly studied, some data suggest that AIT neurons are somewhat less tolerant to position changes of small stimuli (Op de Beeck and Vogels 2000), and at least one computational model of recognition predicts that the position tolerance of AIT neurons will decrease with smaller stimuli (Gochin 1994). Furthermore, comparison across studies suggests that position tolerance is roughly proportional to stimulus size. For example, Tovee et al. (1994) employed the largest stimuli (8.5-17°) of any study of IT position tolerance and also reported the most position tolerance (only a 50% response decrease over >12°). Most studies used ∼5° wide stimuli and found that IT response rates fell by ∼25% over a 5° position change (Desimone et al. 1984; Ito et al. 1995; Leuschow et al. 1994; Missal et al. 1999). Using these results as a benchmark, a straightforward scaling of position tolerance with stimulus width predicts that our 0.6° wide stimuli should have caused response rates to drop by 62% with our 1.5° position change. Indeed, we found that the median neuron's response rate fell by ∼60%.
In the limit, AIT position sensitivity must depend on stimulus size. An AIT neuron (and the subject) can only be position tolerant within the sampling limits of the retina—it cannot respond selectively to stimuli that are not sampled by the retina with sufficient resolution to distinguish between them. As the retinal images of the stimuli to be recognized are made smaller (e.g., increasing viewing distance), this loss of discriminability should first occur at eccentric positions where retinal sampling density is lowest (Curcio et al. 1987; Perry and Cowey 1985). Thus even if AIT neurons were always maximally position tolerant, the tolerance region must be smaller for smaller stimuli. This could manifest itself in several ways. For instance, AIT RFs might shrink toward the fovea when measured with smaller and smaller stimuli. Alternatively, AIT neurons with large regions of position tolerance measured with large stimuli might simply not respond to small stimuli, as other AIT neurons begin to respond, albeit with less position tolerance (i.e., smaller RFs). However, this alternative may be inconsistent with data showing IT neurons to be largely size-invariant (Desimone et al. 1984; Ito et al. 1995; Logothetis et al. 1995; Sary et al. 1993; Schwartz et al. 1983). One of the goals of future studies is to address these issues by comparing both the size and position tolerance properties of AIT neurons with the tolerance properties of the recognition behavior they are thought to support.
Although retinal sampling sets an upper limit on behavioral and AIT neuronal position tolerance, the actual limit may be imposed by further neuronal processing. If this neuronal processing is modified with experience, AIT position tolerance will not correspond to a single measurement, or to a measurement that scales simply with stimulus size, but instead it will be highly dependent on the experience of the observer with the objects to be discriminated. Thus a fourth possibility that might explain the strong position sensitivity found in this study is visual experience. In particular, our animals had extensive experience with four fixed target objects at the three highly controlled retinal positions. This visual experience might have caused some AIT neurons to become tuned to target forms at those specific positions. Indeed, this hypothesis is suggested by the profile of some of the AIT RFs (Fig. 8). Besides the position-specific experience, the experience with visual clutter may have also enhanced position sensitivity (see Fig. 13). Because the animals had approximately equal visual experience with each position during several months of recording, and we found that most neurons preferred the central position (Fig. 6), visual experience alone may not suffice to explain our results. Nevertheless, if the strong position sensitivity of the AIT neurons reported here depends at all on visual experience, these effects must be understood, because they bear directly on the mechanisms that underlie AIT position tolerance.
To our knowledge, no study has asked if AIT position sensitivity can be modified by visual experience, but some studies have touched on closely related issues. Several previous studies have shown that experience results in IT neurons that are tuned to the form of familiar stimuli (Kobatake et al. 1998; Logothetis et al. 1995; Miyashita 1988). For example, some AIT neurons are tuned to specific, trained views of familiar objects (Booth and Rolls 1998; Logothetis et al. 1995; Logothetis and Pauls 1995). Indeed, some studies suggest that AIT neurons become tuned to discriminate stimulus forms that are often encountered by the animal (Kobatake et al. 1998; Sigala et al. 2002; Young and Yamane 1992). At the level of behavior, a large literature has emerged describing perceptual learning tasks in which performance improvements on various types of visual discriminations (e.g., orientation discrimination) are specific to the trained retinal location but typically show inter-ocular transfer (e.g., Crist et al. 1997; Goldstone 1998; Schoups et al. 1995). These studies illustrate that recognition (i.e., stimulus form discrimination) is not always fully position-tolerant, even at equal eccentricities. Because IT RFs are generally thought to be too large to account for the position specificity of perceptual learning, it has been argued that the locus of plasticity must be in visual areas where RFs are small but typically binocular, such as V1 or V2 (e.g., Crist et al. 2001; Fahle 1994; Schoups et al. 1995). However, several recent monkey studies in which extensive training resulted in perceptual learning, found significant, but subtle changes in the response properties of V1 and V2 neurons (Crist et al. 2001; Ghose et al. 2002; Schoups et al. 2001). Because we found that AIT neurons can be sensitive to stimulus features (i.e., form) over small visual field regions, this raises the possibility that AIT plasticity could contribute to performance improvements that are specific to trained retinal positions. In other words, the position sensitivity reported here cautions that IT cortex should not be ruled out as a locus of plasticity underlying perceptual learning.
In the present study, we found that individual AIT responses were much more sensitive to target position than was the animal's behavior and that neurons preferred different locations within the trained region of the visual field. This suggests that the tiling of visual space by the RFs of form-selective AIT neurons produced the position-tolerant recognition behavior. If each of these tiles resulted from experience with a particular form at a particular retinal position, then position-tolerant recognition might depend on such experience over a range of positions (Dill and Fahle 1997; Hebb 1949; Nazir and O'Regan 1990). Although computational considerations argue that position tolerance could be achieved with built-in, general-purpose mechanisms that require experience with only a single retinal image of an object (e.g., Olshausen et al. 1993; Salinas and Abbott 1997; Ullman 1996; Vetter et al. 1995), this does not rule out the possibility that the brain has adopted an experience-dependent, “brute force” solution. Indeed, the idea of learned tolerance is not new, and it has been proposed to explain several types of tolerance at various cortical levels (e.g., Poggio 1990; Wallis and Rolls 1997), including the limited position tolerance of complex cells in V1 (Foldiak 1991).
The idea that position-tolerant recognition depends on visual experience at those positions seems at odds with the widely held belief that if we learn to recognize an object at one retinal position, recognition automatically transfers to other retinal positions. However, this expectation appears to rest not on a body of psychophysical data, but from introspection in everyday situations where recognition is assisted by both eye movements and extensive retinal experience with the objects we recognize (Nazir and O'Regan 1990). In fact, few psychophysical studies have examined the position tolerance of recognition. Although some suggest automatic position tolerance, (Biederman and Cooper 1991; Ellis et al. 1989), others, including the perceptual learning studies already mentioned, indicate that recognition is best at positions where the subject has the most experience (Dill and Edelman 2001; Dill and Fahle 1997, 1998; Foster and Kahn 1985; Nazir and O'Regan 1990). The likely resolution of these different results is that the role of experience in position tolerance depends on the stimuli to be discriminated (Dill and Edelman 2001).
Present address of J. J. DiCarlo, McGovern Institute for Brain Research, Dept. of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139
We thank C. Boudreau, E. Cook, G. Ghose, C. Hocker, and T. Yang for discussions on design, analysis and presentation, D. Murray and T. Williford for technical assistance, and T. Poggio, M. Riesenhuber, and S. Treue for helpful comments.
This work was supported by National Eye Institute Grant EY-05911. J.H.R. Maunsell is an investigator with the Howard Hughes Medical Institute.
The costs of publication of this article were defrayed in part by the payment of page charges. The article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
- Copyright © 2003 by the American Physiological Society