|
|
||||||||
Volen Center for Complex Systems, Brandeis University, Waltham, Massachusetts 02454
Submitted 10 September 2002; accepted in final form 19 March 2003
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
10 ms), suggesting that
during the subsequent recognition process there can be interaction of B-U,
T-D, and lateral information flow (Hupe et
al. 2001
A second important property of information flow in cortex is that it is
controlled by attentional processes. In a simplified view, attention can be
considered a window (Broadbent and
Broadbent 1990
; Campbell
1985
; Nakayama
1991
; Van Essen et al.
1991
) that is moved either by eye movements or by
"covert" processes that occur without eye movements
(Posner et al. 1980
). Recent
work using imaging methods in human V1
(Smith et al. 2001
) and in
monkey lateral geniculate nucleus (LGN) and V1
(Tootell et al. 1998
;
Vanduffel et al. 2000
) and
psychophysical methods (Bahcall and Kowler
1999
; Caputo and Guerra
1998
) indicates that attention imposes a ring of inhibition around
an attended item, indicating a limitation to the window analogy. Attention can
also be directed toward nonspatial properties, such as color or event, but
even in such cases, spatial localization remains important
(Bichot et al. 1999
;
Chun 2001
;
Mozer and Sitton 1998
;
Nissen 1985
;
Snyder 1972
;
Tsal and Lavie 1988
).
The properties of attention have been explored using visual search tasks.
In some cases, an object can be so distinct from nearby distractors that the
time required to find the object is independent of the number of distractors
(e.g., the "pop-out" of color). However, if the target and
distractors cannot be simply distinguished (e.g., most letters), the time
required to identify the target increases linearly with the number of
distractors, consistent with a serial search process
(Treisman and Gelade 1980
;
Treisman and Souther 1985
).
The slope of this linear relationship suggests that covert attention can shift
serially
50 times/s (Horowitz and
Wolfe 1998
). These internally generated shifts skip over an
intervening obstacle without any time cost, suggesting that attention jumps
rather than moves along a path (Egeth and
Yantis 1997
). Recent recordings from cortex have provided direct
evidence for rapid, internally generated, covert shifts of attention
(Woodman and Luck 1999
).
Attention shifts can also be generated by the onset of external stimuli, but
these take longer to generate (Motter
1994
; Ward et al.
1996
) than internally generated shifts measured under the same
conditions (Wolfe et al.
2000
). The attention dependence of neuronal firing is most
prominent in higher level areas (V4 and beyond)
(Moran and Desimone 1985
) but
can also be detected in V1 and the lateral geniculate (reviewed in
Kastner and Ungerleider 2000
).
The effects in V1 and geniculate may reflect feedback from attentionally
modulated processes in higher areas
(Martinez et al. 1999
).
In this paper, we have attempted to understand how an attention-based
recognition process could be organized by bidirectional processing in the
cortical hierarchy. There has been considerable previous theoretical work on
the role of bidirectional information flow in cortex
(Cave 1999
;
Dayan et al. 1995
;
Grossberg 2001
;
McClelland and Rumelhart 1981
;
Rao and Ballard 1999
;
Tononi et al. 1992
;
Ullman 1995
). There has also
been considerable interest in how attention is moved
(Koch and Ullman 1985
;
Phaf et al. 1990
;
Olshausen et al. 1993
;
Tsotsos et al. 1995
) based on
the saliency of information in the visual image (reviewed in
Itti and Koch 2001
) and in
some cases the role of T-D information flow in controlling attention has been
considered (Cave 1999
;
Phaf et al. 1990
;
Tsotsos 1990
;
Tsotsos et al. 1995
). However,
this work has dealt with visual search tasks or attentional tasks rather than
single stimulus recognition tasks. In search tasks, the identity of the item
is already known, and the subject attempts to find it among multiple items in
the visual scene. In attentional tasks
(Phaf et al. 1990
),
attentional control is directed to one stimulus attribute on the basis of
another attribute, e.g., the subject must "name the form on the
left." Here, we consider pure recognition tasks in which a single item
is presented and the subject must compare it to the many items stored in
memory until a match is found. The problem of moving attention during
recognition has recently been addressed
(Schill et al. 2001
), although
not in neural terms.
The involvement of attention in recognition raises two fundamental
questions: how can the information obtained in multiple samples be integrated
and how is the window of attention moved? A successful model of recognition
should provide physiologically plausible answers to these questions and
satisfy basic constraints provided by psychophysical measurements. Perhaps
most fundamental of these measurements is that recognition time for targets
within contextually constrained sets in long-term memory depends
logarithmically on set size (Burrows and
Okada 1975
; Ross
1970
). This rules out the possibility that memory is searched
serially and therefore implies a parallel process. The basis of this
logarithmic dependence is not known. A second important property of
recognition is that it can be speeded by high level contextual information as
seen in semantic priming experiments
(Neely 1991
). A third property
of recognition is demonstrated in visual search for target words: when
subjects search lists for target words, they search faster through nonword
distractors than through word distracters
(Graboi 1974
). We have
attempted to develop a model that accounts for these findings.
| RESULTS |
|---|
|
|
|---|
Description of model
The general idea is as follows. The network has three hierarchical levels corresponding to the feature, letter, and word levels. Nodes at the "word" level are active at the beginning of the recognition process, provided they are consistent with current contextual constraints (an inclusion process). B-U flow of information through a narrow window of attention then leads to the inactivation (exclusion) of nodes that are inconsistent with the sampled information, thereby reducing the number of possible words. Recognition occurs when the serially sampled information leads to the inactivation of all but one word node. We will show that there are algorithms for moving attention that make the exclusion process efficient. These algorithms make use of T-D connections to compute the relative probability of each feature, given the set of still-possible words. Algorithms to move attention using both this T-D information and B-U information about which features are actually present can exclude a large fraction of words on each cycle. A diagram of the information flow is given in Fig. 1. What follows is a more detailed description of these processes.
|
PROPERTIES OF THE HIERARCHICAL LEVELS. At the feature level,
there is a frame for the detection of four-letter words with a subframe for
each letter (Fig. 2). Within
each subframe there are 14 feature detectors used to distinguish letters in
the font we have used (Fig. 3;
note simplified font in Figs. 1
and 2). These detectors are
sensitive to oriented line segments in a manner similar to the simple cells of
V1. For simplicity, we assume that the sensory input drives the feature
detectors between two states, "there" and "not there."
This binary simplification is warranted, given the high-contrast stimuli used
to obtain the experimental results we seek to account for. At the letter level
there are four subframes, one for each letter position. Each subframe has 26
nodes representing each of the possible letters. At the word level, each node
represents one of the stored common (nonpejorative) English four-letter words
(typically 950). In the computer implementation, feature nodes receive input
from pixel nodes having differing positions along a line segment, as in the
Rumelhart (1971
) model.
However, because this pixel processing does not affect the function of the
model, it will not be discussed further.
|
|
SPECIFICATION OF T-D AND B-U CONNECTIONS. Collectively, the highly specific connections in the model represent the long-term memory of the structure of letters and words. These connections obey a simple "compositional rule": word nodes make T-D excitatory connections to all the letter nodes that compose the word; similarly letter nodes are connected to the feature nodes that compose the letter (Fig. 2). B-U connections connect features to all the letter nodes that contain that feature; similarly letter nodes are connected to the word nodes that contain that letter (Fig. 2).
RECOGNITION BY EXCLUSION OF ALL BUT ONE WORD. We assume that at the start of the recognition process the word nodes for contextually possible words are active. This leads to activity at letter and feature levels as computed by T-D linear summation processes and provides information used by the selective attention process (see following text). The B-U flow from each feature selected by attention will strongly excite all letter nodes that contain the feature, and these will excite the word nodes that contain these letters. Those nodes that do not receive excitation are assumed to be strongly inhibited by those that do and become inactive; this inactivation persists for the rest of the recognition process. We term this the process of "exclusion." The major phase of the recognition process is completed when all but one of the initially possible words has been excluded. This is sufficient for recognition if the subject can be certain that the items being presented are known words. If the task is such that the subject cannot be certain, an additional cycle, termed the "confirmation phase," is required. This will be described later.
We further assume that the activation of word nodes is normalized; as word nodes are excluded, the activity of the remaining active word nodes increases accordingly. As a result, the activity level is inversely proportional to the number of still possible words and represents word probability. Thus for the word node corresponding to the presented word, the probability will increase from a small value at the start of the recognition process to a value of 1 when recognition occurs. An important consequence of normalization is that the compositional rule for T-D processing leads straightforwardly to the computation of feature probabilities, which can then be used to efficiently move attention (see following text).
Selective attention algorithm (SAA) moves the window of attention during each cycle of the iterative recognition process
Although research shows that attention can be more complex than a simple
"window," location is nevertheless always important
(Bichot et al. 1999
;
Chun 2001
;
Mozer and Sitton 1998
;
Nissen 1985
;
Snyder 1972
;
Tsal and Lavie 1988
), and it
is the movement of attention to different locations that we address in our
model. The aperture of the window of attention has not been established with
certainty (Chun 2001
); we
therefore make the worst-case assumption that the window is very
small and transmits only a single feature. If recognition under these
conditions is feasible, it will only be more so if the window of attention is
widened. The window of attention is implemented by "attentional gating
nodes" (Fig. 1), a
concept that was incorporated into several previous models of attention (e.g.
Cave 1999
;
Tsotsos et al. 1995
). These
allow the further upward signal flow only if attention is moved to that node.
In this way, the output from a single-feature node (perhaps in V1) is
transmitted B-U to higher-level cortical regions where it leads to the
exclusion of the still-possible letters and words that do not contain it. This
is followed by T-D computation of a new feature probability landscape, which
can then contribute to processes that determine the next location of
attention. This model posits continual T-D/B-U processing cycles, each adding
the information from a single feature to the accumulating knowledge base
associated with the object being recognized. The specific set of computations
that determine where attention will next be moved is termed the selective
attention algorithm (SAA). Various SAAs for moving the window of attention
will be considered later. These make different use of the available T-D and
B-U information described in the next two sections.
T-D processing computes feature probabilities from word probabilities
Consider first the case when only one word node is active. It will excite the letter nodes contained in the word; the letter nodes (for each of the 4 positions) will then excite the features contained in those letters. Thus in this case, the feature probability landscape will resemble the word itself. If two words are active, linear summation processes will produce a feature probability landscape that looks like the superposition of two words, with features contained in both words twice as active as features contained in only one. The same logic applies for any number of still-possible words. Thus the feature probability will be directly proportional to the number of still-possible words that contain that feature. Figure 3 (1) shows the a priori feature probabilities for the set of 950 words that are stored in the long-term memory of the system. It is of interest that the probabilities of features are uneven. For instance, the diagonal features are relatively rare. Thus the landscape reflects constraints due to high-level context (which can reduce the number of possible words), the feature composition of letters and the letter composition of words. This probability landscape is a source of information available to the SAA even before a word is displayed. During recognition (Fig. 3, 14), the number of still-possible words is gradually reduced, and this, in turn, leads to changes in word probabilities, letter probabilities, and the feature probability landscape.
LOW LEVEL B-U PROCESSING DETERMINES WHICH FEATURES ARE "THERE" AND "NOT THERE." Another source of information available to the SAA is the result of continuous parallel low-level B-U processing of the stimulus from the retina to the primary projection area (V1). This specifies which of the 56 features are "there" (i.e., have contrast) and which are not.
Example of the recognition process
A detailed example of the recognition of a known word, LADY, is shown in Fig. 3. In this example the SAA uses both T-D and B-U information and selects the feature that is "there" that has the lowest probability. In the period before the item is presented, all of the 950 words are active and have equal low probability. From these probabilities, T-D processing computes the a priori feature probabilities shown in Fig. 3, 1. When the word "lady" is presented, the recognition process goes through 3 cycles leading to recognition. In the 1st cycle all but 75 words are eliminated; on the 2nd all but 7 are eliminated; on the 3rd cycle, the only still-possible word is the actual word, LADY. This is a sufficient criterion for recognition if the subject knows that only known words are being presented. This example illustrates the ability of the algorithm to eliminate a large percentage (in this case, >90%) of the remaining possible words on each successive cycle. The interested reader can follow each step of this process in Fig. 3. It is noteworthy that although attention acts at a particular place (i.e., gating nodes), the activity of each node at all levels will change as features, letters, and words are excluded. Thus information (a reduction in the number of alternatives) accumulates at all levels during recognition. In the example of Fig. 3, recognition of "LADY" occurred in a small number of steps. Figure 4A shows the recognition process for four other words, BEAR, CHEW, SURF, and ROSE, and illustrates the variability in the number of cycles required for word recognition. Considering 50 randomly selected cases of word recognition from the set of 950 words, the average was 4.9 cycles.
|
This form of information processing makes inferences. For example, during recognition of LADY, the system inferred that the first letter was L even though the SAA never moved attention to the first letter position. This inference was based on constraints at the word level: given that the last three letters were ADY, the only known word possible was LADY. The panels in Fig. 3 show (green color) the gradual development of inferred features (features inferred "there," dark green, P = 1; and "not there," light green, P = 0). Note that when there is only one still-possible word, the inferred plus known features exactly resemble the presented word (Fig. 3, 5). In other words, the T-D-computed feature probability map exactly resembles the features of the presented word.
Comparison of different SAAs
As illustrated in the example of Fig. 3, it is possible to determine the number of iterative cycles required for recognition of a given known word. By repeating such measurements for different words, one can determine the average number of cycles required for recognition using different SAAs. This number provides a quantitative measure for determining how the recognition process depends on the number of known words and for comparing the efficiency of different SAAs. Within the context of this model, two sources of information are available for selecting each feature. One source is the feature information provided by parallel low-level B-U processing of the stimulus (which features are "there" and "not there"). As a result of such processing, the visual stimulus activates a subset of the feature nodes in cortex. A second source of information is the feature probability landscape computed T-D. As argued in the preceding text, T-D connections convert word probabilities into feature probabilities. Although the a priori word probabilities are equal, the feature probabilities are not equal (Fig. 3). Furthermore, as word probabilities change during the recognition process, the T-D-computed feature probability landscape changes accordingly.
We have explored several different SAAs, which illustrate different ways of using the available B-U and T-D information. For each SAA, the average number of cycles required for recognition was determined for word sets of varying size ranging from 15 to 950. This number is plotted as a function of log2 of the number of words in long-term memory in Fig. 5. The data were well fit by straight lines (see Fig. 5 caption for details). We first consider an SAA that has predictable properties. This SAA picks a feature that is "there," as determined by low-level B-U processing and that is contained in 50% of the still-possible words (T-D[50%] and B-U[There]). The processing of this feature excludes half the remaining words on each cycle. This implies a slope of 1 when plotted on a log2 axis. The measured slope is 0.98 in good agreement with prediction. In this case, 1 bit of word-level information is acquired per cycle because the number of alternative words is reduced by one-half per feature acquisition.
|
Several of the SAAs tested were either less effective or only slightly more effective. These included simply picking a feature at random regardless of whether it was "there" or not; picking a feature that was "there" and expected with highest probability (T-D[Highest P] and B-U[There]); sampling the feature location with the lowest probability irregardless of whether the feature was "there" or not (T-D[Lowest P]) or picking at random only features that were "there" (B-U [There]).
Two other SAAs we examined were much more efficient than all the others.
The simpler of these is the "unidirectional mismatch" computation
(B-U [There] and T-D [Lowest P]). This selects a feature that is
"there," as determined by B-U computation and that has the lowest
probability, as determined by T-D processing. The other, the
"bidirectional mismatch" computation, considers in addition those
features that are expected with highest probability, but are "not
there": whichever form of mismatch is greatest is selected. In the
four-letter word-recognition task, this "bidirectional mismatch"
algorithm is only slightly more efficient than the "unidirectional
mismatch" algorithm. In these two most efficient algorithms,
2 bits
of word-level information are acquired per cycle and the average number of
remaining words is cut in one-fourth by each selection. The observed slopes
for these two algorithms are 0.52 and 0.47, respectively.
Three main conclusions can be made on the basis of the data shown in Fig. 5. First, the most efficient SAA's tested use both T-D and B-U information and exclude about twice as many words per cycle than algorithms that use only one source of information. Second, the most important principle that makes for an efficient SAA is to choose a feature with a large mismatch, e.g., a feature that is there, but which is contained in the smallest fraction of the still-possible words. Third, the time required for recognition with the efficient SAA's increases logarithmically with a slope of approximately one half (on log base 2 coordinates) with the number of words in the initial set.
Effects of contextual cueing
We next considered how the recognition process can be affected by contextual information that narrows the range of the initial set of possible words. The hierarchical organization of networks shown in Figs. 1 and 2 could be influenced by a yet higher network whose nodes represent categories of words, such as "animals," "plants," etc. In this case, the activity of particular word nodes would depend on whether the higher level category node to which the word belonged were active. If for example contextual information were present that made only the "animal" category node active, only the subset of word nodes that are in the animal category would be active at the start of the recognition process. The simulation in Fig. 4A shows that the availability of this contextual information reduces the initial set size to the 35 animal words in the list of 950 known words and leads to a dramatic reduction in recognition time.
It is instructive to plot how T-D-computed word probabilities change during the recognition process since neurons might have a firing rate related to item (word) probability. Thus the plots of probability in Fig. 4B may be relatable to electrophysiological data obtained from cortex during the recognition process (see DISCUSSION). It can be seen that when contextual information is introduced (the animal category), the probabilities of word nodes within this context (e.g., BEAR) increase, whereas the probabilities of nodes outside this context (ROSE) drop to zero. These changes reflect the fact that when the probabilities of some words fall, the probabilities of the remaining words necessarily rise. Such reciprocal changes in probability can also be seen during the course of the recognition process. Just after the stimulus BOAR is presented, the node for one word (MULE) stops firing after the first execution of the SAA, but BIRD, BEAR, and BOAR, which resemble each other, rise in probability. When the next feature is sampled, BIRD is eliminated and after one additional sample BEAR is eliminated. BOAR is now the only remaining word node and will fire maximally. This figure illustrates that when high-level (category level) contextual information is supplied, items within the category rise in probability, whereas items outside the category fall in probability. This reciprocal change is indicative of a competitive process. Similarly, this competition is evident throughout the recognition process; whenever the probability of some nodes rise within a given level, the probability of other nodes fall. Nodes representing words similar in shape to the target (e.g., BEAR is similar to BOAR) initially also rise but then fall off relative to the target at a time that increases as the similarity to the target increases. Feature nodes for both geometrically similar and semantically similar words (e.g., words in the same category) are preferentially selected. This may be viewed as a "filter" for feature selection based on both physical shape and semantic constraints.
Recognition when nonwords are possible: properties of the confirmation phase
So far we have considered how recognition can occur when only known words are presented. If both words and nonwords may be presented, then the exclusion of all but one word does not necessarily imply that this word corresponds to the presented word. For instance, if the nonword OADY is presented, the initial steps in this case are identical to those that occur when LADY is presented (Fig. 3, 13): after sampling three features, the only remaining known word is LADY. To establish whether all the inferred features correspond or don't correspond to those in the presented item, one additional cycle, which we term the "confirmation phase," is required. Because only one word is active at the word level, the computed feature probabilities will be one for all 19 features that are "there" in LADY and zero for the 37 features that are "not there." If the word presented is in fact LADY, the SAA in the final cycle finds no mismatch, and the word node for LADY will remain active (Fig. 3, 4). The system activity is then stable at all levels, confirming the word LADY. If the word presented is OADY, the feature shown in Fig. 3, 6, will be selected in the final cycle. The processing of this feature will exclude LADY, and the presented word must therefore be classified as an unknown word i.e., a "nonword." It should be noted that in this example, it takes the same number of cycles to classify OADY as a nonword as it takes to confirm LADY. However, as shown in the next section, on average, nonwords are classified faster than words.
Processing of words and nonwords
In visual search experiments in which subjects search lists for target
words, distractors that are nonwords are classified and rejected more quickly
than distractors that are words (Graboi
1974
). Moreover, nonwords that are very different from words can
be rejected more rapidly than nonwords that are similar to words (Graboi,
unpublished). To examine whether these effects are captured by the model, two
types of four-letter nonwords were generated: the letters of words in the list
of 950 were scrambled to produce nonwords that closely approximate English
("high-bigram" letter strings), and letter strings that are not
word-like ("low-bigram" letter strings). For example, the letters
in "THAW" can form "WATH" (high-bigram) or
"AWHT" (low-bigram). The methods for generating these two types of
nonwords are given in the caption of Table
1. The time to classify a letter string as a nonword was taken to
be the number of cycles required to eliminate all known words. The criterion
for recognition of a word was taken to be the moment when a single word
remained and was confirmed. Table
1 shows that it takes the least time on average to classify
low-bigram letter strings as nonwords. It takes longer to classify high-bigram
letter strings as nonwords and still longer to classify letter strings that
are words. This effect occurs because words and nonwords differ statistically
in their deviation from the average feature probabilities of words (nonwords
will have greater differences); the greater the deviation, the more words can
be eliminated on each cycle and the faster the process eliminates all known
words.
|
Studies using rapid serial presentation show that category judgments (e.g.,
animal/nonanimal) can be made in very short period of time
(Potter 1976
;
Thorpe et al. 1996
). To
explore this condition, we extended the simulations shown in
Fig. 4 by comparing the
processing time required for in-set (animal) and out-of-set (nonanimal) words.
The average time to recognize an animal word (including confirmation) was 3.2
cycles. In contrast, a nonanimal word could be rejected as an animal word more
quickly (2.3 cycles on average). This effect was significant at the P
< .005 level (t = 3.08, df = 18). In 8 of 10 cases when nonanimal
words were presented, the number of still-possible words jumped from greater
than one to zero in a single step.
| DISCUSSION |
|---|
|
|
|---|
Integration of information
One might postulate that just before the start of the recognition process, brain networks are inactive; feedforward sensory-driven activity might then lead to activation of only those high level nodes that represent the presented item. What we postulate here is a quite different scenario. Even before the item is presented, nodes consistent with the current context are activated, a process that can be considered inclusive. After the item is presented, each serially sampled feature leads to the exclusion of a fraction of the remaining possibilities; recognition occurs when only one possibility remains. Such an exclusion process is only plausible if algorithms exist that enable each sample to exclude a large fraction of the possibilities on each cycle. Thus if 3/4 of the words are excluded on each cycle, only 1/16 will remain after two cycles and 1/64 after the three, etc. As summarized in the next section, such efficient exclusion can be achieved.
The biophysical mechanism that underlies integration over multiple samples is the process that sustains the activity on nonexcluded words and the inactivity of excluded words. Such processes can be implemented by known physiological mechanisms as will be discussed later. Although we have considered only integration for covert movements of attention, it may also occur during overt movements of attention that involve eye movements. In summary, the idea of high-level nodes (word nodes in this case) that are activated by context and deactivated by serially sampled sensory information provides an attractive and plausible scheme for integrating information.
Efficient exclusion
The efficiency of the exclusion process results from the computation of the
feature probability landscape by T-D processing and the utilization of these
probabilities in the guidance of selective attention. We have compared several
different algorithms (Fig. 5) and found that the most efficient selects the feature for which there is the
greatest mismatch between expectation and actuality (a feature that is
"there" but has low probability or that is "not there"
but has high probability, whichever is greatest). In information theory, this
highly informative feature is termed the "Shannon Surprise"
(Dayan and Abbott 2001
).
Almost as efficient as this "bidirectional mismatch" is the
"unidirectional mismatch" algorithm that selects the feature that
is there, that has the lowest probability. The most efficient algorithms that
we have found are termed "greedy;" each successive attentional
movement is determined by selecting a feature with the greatest ability to
eliminate words in that cycle, without "looking ahead" to consider
the process as a whole. It can be shown that although the greedy strategy is
not necessarily optimal for the overall process, a true optimal strategy is NP
complete (i.e., computationally intractable) and that the greedy algorithm is
as close to optimal as achievable by any tractable strategy (Cohn).
An important finding is that T-D connections that obey a simple connectivity rule provide a simple mechanism for computing the feature probability landscape required for efficient exclusion. A "compositional rule" defines the encoding of long-term memory into the hierarchical structure: a word node connects to all the letter nodes contained in that word; a letter node connects to all the features contained in that letter (Fig. 2). This connectivity rule allows linear synaptic summation processes to compute the feature probability landscape. Thus for instance, if only one word is active, the features contained in that word will be active and all others will be inactive. If two words are active, the probability landscape will be the superposition of the two words with features contained in both words being twice as active as features contained in only one. It follows that when many words are active, probability of any given feature will be proportional to the number of these words that contain that feature. Importantly, the probability landscape can be based on a priori semantic and high-level contextual information (category) that is used to guide attention from the moment the stimulus arrives. As shown in Fig. 3, 1, the structure of English four-letter words is such that some features are less probable than others, for instance diagonals. It follows that if a diagonal is present in the stimulus, attention brought to this feature will be an effective way of excluding words.
The core prediction of the model is that during recognition of complex
visual stimuli exemplified by words, the window of attention undergoes rapid
covert movements to stimulus regions that are rich in information relevant to
recognition. There are no methods yet for tracking rapid shifts in covert
attention. However, these predictions could possibly be tested if the stimuli
were arranged so that recognition required observable shifts in eye position.
Indeed, experiments on eye movements during the viewing of natural scenes
indicate that eye movements are preferentially made to information rich
regions (Mackworth and Morandi
1967
; Reinagel and Zador
1999
).
Dependence of recognition time on the number of stored words and on context
A fundamental property of the cortical recognition reproduced by our model
(Fig. 5) is that the time to
recognize a target depends logarithmically on the size of the set of possible
target items in long-term memory (Burrows
and Okada 1975
; Ross
1970
). This logarithmic dependence follows straightforwardly from
the idea that the underlying process is one of exclusion and that, on average,
a constant fraction of items is excluded on each cycle.
The dependence of recognition time on set size provides a simple
explanation for why contextual information can speed recognition time, as in
"semantic priming," (Lorch et
al. 1986
; Neely et al.
1989
; see Neely
1991
for a review.). An example of such priming is shown in
Fig. 4A. When a
contextual cue is given that narrows the range of possible words to animal
words, the recognition of animal words is speeded. The reason is simply that
whenever the initial set of possible words is reduced, there is a consequent
reduction in the number of cycles required to exclude all but the one
presented. Our model is also able to account for why letter strings
statistically less similar to word patterns are classified as nonwords more
quickly than letter strings more similar to word patterns
(Table 1). Taken together,
these results show that several fundamental properties of word recognition are
captured by the model. Because the flow of information is bidirectional, the
architecture straightforwardly integrates high- and low-level information and
uses this information to control the movement of attention in an orderly
way.
Neural plausibility of the model
One objection to any serially organized process could be that it takes too
long to be a realistic component of a fast recognition process. The available
data indicate that covert attention (without eye movements) can be moved about
once every 2030 ms (i.e., 3350 times/s.)
(Horowitz and Wolfe 1998
).
Thus if it takes an average of 4.9 cycles for recognition of a four letter
word to occur, the time required is 100150 ms. This speed is in
reasonable agreement with what is known about the speed of recognition during
reading: the average reader makes about one saccade per word, bringing it into
the fovea for a fixation that lasts
250 ms.
(Starr and Rayner 2001
). There
is thus
250-ms processing time in cortex between the arrival of
information about sequential words. These results are thus compatible with the
idea that word recognition could involve five covert attentional shifts, each
requiring 2030 ms. It should be noted that the time required for
specific item recognition is somewhat longer than that required for the
simpler task of two-choice categorization
(Gleitman and Jonides 1976
;
Jonides and Gleitman 1976
;
Thorpe et al. 1996
).
A second related objection is that information flow between cortical
regions may be too slow to allow T-D/B-U processing within a 25-ms iterative
cycle. Relevant to this issue is the recent measurement of individual steps in
cortical transmission. It has been shown that the time it takes for
information to travel between cortical areas is <2 ms
(Domenici et al. 1995
;
Movshon and Newsome 1996
;
Girard et al. 2001
). When
feedforward information arrives at a cortical area, it is generally received
by layer 4 cells, which then excite layer 2/3 cells. It is these cells that in
turn, can transmit information back to lower areas or up to higher areas.
Using paired recording, the time for information transmission from layer 4 to
layer 2/3 cells was measured as <2 ms
(Silver et al. 2001
). The
short range of these times indicates that substantial T-D/B-U cortical
processing can occur in 25 ms. If it is assumed that it takes
100 ms for
visual signals to reach V1 after a word is displayed, followed by five
cortical cycles at 25 ms/cycle, which would take 125 ms, the total time to
recognition would be 225 ms (it should also not be excluded that important
bidirectional processing occurs within a given region of cortex; indeed simple
and complex cells are found in V1.)
A third objection is that additional time or cycles may be required to
perform other computations not discussed here, for example to transform the
image, e.g., to correct for angle or skew. Although the experimental data that
we have sought to explain do not deal with transformed images, it should be
noted that in experiments dealing with recognition of rotated objects, there
is in fact an increase in the time required for recognition
(Shepard and Metzler
1971
).
A fourth objection is that while T-D may be important in setting context, recognition can then proceed using purely B-U processing, with no updating by T-D information. Several counterarguments can be given: 1) the physiological data point to that fact that during recognition there is activity in the cells that give rise to T-D connections. It thus seems conservative to argue that this T-D flow has functional consequences. 2) Recognition can be shaped by rapid changes in context (e.g., instructions) that occur too fast to shape B-U processing by changing connections. 3) In a purely feedforward system, attention must be held hostage to salient environmental cues and cannot therefore be directed by high level constraints that could direct attention to important, but low-saliency information. This type of task can be performed. And 4) a SAA that moves attention without considering T-D information ("random SAA") is inefficient compared with SAAs that do.
The T-D/B-U signal flow in our model is organized into cycles of
25 ms
that might be detectable as oscillations. Brain oscillations in the gamma
range (3080 Hz) might reflect the cyclical organization of T-D/B-U
processing. Such oscillations have been observed during the recognition
process (Garrett et al. 2000
;
Lutzenberger et al. 1994
) and
have been linked to attention (Fries et
al. 2001
). It has generally been thought that gamma oscillations
have a different role, specifically as an organizer of a binding process
(Engel and Singer 2001
), but
some evidence against this has emerged
(Lamme and Spekreijse 1998
).
Experiments that distinguish between these functional roles are needed. If
indeed gamma oscillations are indicative of cyclical T-D/B-U processing, their
amplitude in lower level areas should be reduced by inactivation of higher
level areas.
An important question that relates to the current model is whether it could
be implemented by plausible cellular and network mechanisms. Most aspects of
the model utilize standard neural network principles and do not require
extensive comment. For instance, the bottom-up formation of feature cells
could be as in the Hubel and Weisel model
(Reid 2001
). Similarly, the
B-U process by which feature nodes excite letter nodes and letter nodes excite
word nodes depends on standard linear synaptic interactions, as do the inverse
T-D processes. The types of B-U and T-D parallel processing required are
straightforward to implement in parallel by suitably connected networks and
are not computationally expensive. Finally, mechanisms based on reverberatory
processes can make neurons bistable (Wang
2001
); such processes could keep word nodes active until activity
is quenched by inhibition from word nodes that receive B-U support. The
decreased number of active word nodes would make the remaining ones receive
less lateral inhibition and thus achieve a higher firing rate. This could be
the basis of a normalization process at the word level. The recently
discovered large network of electrically coupled cortical interneurons
(Beirlein et al. 2000) is a potential mechanism for providing the inhibition
required to perform this computation.
The gating cells (Fig. 1)
that select the site to which attention is moved (e.g., a feature that is
there but has low T-D probability) may function according to simple neural
mechanisms. Mismatch can be computed by gating cells that are excited if the
feature is present and inhibited in proportion to the probability of that
feature as determined by T-D processing. The maximum mismatch can be
found by an oscillatory winner-take-all mechanism
(Lisman 1998
) based on
negative feedback inhibition from a global network of interneurons
(Beierlein et al. 2000
).
Specifically, as global inhibition wanes on each cycle, the cell with the
greatest mismatch fires first and then inhibits all the others.
Winner-take-all process have been previously implicated in visual search, as
first proposed by Koch and Ullman
(1985
).
Relationship to physiological data
The model makes several predictions about neurons in higher cortical
regions that are critical for recognition. As shown in
Fig. 4B, the word node
that represents the displayed item will gradually increase its probability as
the sampling process excludes other words. On the assumption that probability
would be represented by the firing rate of a neuron, the firing rate would be
expected to gradually rise during recognition. Consistent with this, neurons
in the inferior temporal cortex, a high-level region critically involved in
item recognition, gradually increase their firing rate >150 ms during the
recognition process (Chelazzi et al.
1998
; Desimone
1998
). Furthermore both the model and data show that cells not
well tuned to the stimulus either immediately undergo a decrease in firing
rate or initially increase their firing rate along with the well tuned cells
but then show a delayed drop. According to the model, the delay at which this
divergence occurs should increase, the more similar the cell's tuning to the
stimulus. This has not yet been examined in temporal cortex but has been seen
in frontal cortex neurons (Bichot et al.
1999
). A final prediction is that the low basal firing rates of
neurons (before a stimulus) should not be considered noise but rather a
representation of the low but finite probability of a contextually possible
item. According to this view, if contextual information lowers the probability
of the object represented by the cell, the cell's firing rate should go down
(Fig. 4A). Such
effects of context on baseline firing have been observed in temporal cortex
(reviewed in Desimone
1998
).
Perception as a construction of a model of the stimulus
The "figural synthesis" model of perception
(Neisser 1967
) was inspired by
Hebb's (Hebb 1949
) comparison
of the perceiver with a paleontologist who extracts a few bones from a mass of
irrelevant rubble and "reconstructs" the dinosaur. In the
"figural synthesis model," Neisser proposed that focal attention
involves a similar sparse sampling. The perceptual object is then
reconstructed from this limited sampling and information stored in long-term
memory. Similarly, in our model, recognition occurs gradually as the number of
sampled and inferred features increases until eventually there is a perfect
correspondence between features constructed by T-D computations and the word
itself (Fig. 3, 5). An
extreme case occurs in the situation where contextual information provided by
previous words indicates the single word that is likely to come next. The T-D
excitation will create a "model" of the word and if no mismatch is
detected, the model can be determined to be correct, a process that occurs in
only a single cycle.
The relationship of the T-D-computed model to perception may also be
relevant when considering how nontarget words are perceived during rapid
visual search through lists. Nontargets with low approximation to English are
classified more rapidly than nontargets that are words
(Graboi 1974
), an effect
reproduced in the model (Table
1). In such experiments, subjects report that whereas targets are
clearly perceived, nontargets appear as a blur but are nevertheless correctly
rejected (Briggs and Blaha
1969
; Cavanaugh and Chase
1971
; Gould and Carn
1973
; Graboi 1971
;
Neisser 1963
;
Neisser et al. 1963
;). We have
found that during the progressive exclusion of nonwords there is often a
direct transition from a state with several remaining words to a state with no
remaining words. Thus the last feature probability landscape is the
superposition of several still possible words, which may give rise to the
perception of a "blur."
Extending the model
A simplification of the current model is that all words are assumed to have
equal probability. More efficient recognition of real text would be possible
if the model were modified to take word frequency into account. It is
important to emphasize that we have modeled the worst-case assumption in which
the window of attention is wide enough for only a single feature. Recognition
might be speeded significantly if this window is widened. One way to
effectively widen the window would be to pick the most informative feature in
each letter subframe and then process the four selected features in
parallel (cf., Phaf et al.
1990
) ("heterarchical processing"). If our model was
modified in this way, the information acquisition would be increased from 2 to
8 bits/cycle, i.e., processing speed would be increased by a factor of 64.
However, there is clearly a limit to how wide the window can be made. In
general, the small aperture of a single-feature window has the virtue of high
selectivity; this results in improved noise immunity. We emphasize that many
of the difficulties associated with real-world vision are not dealt with in
our model. For instance, if visual noise was present, it would be rapidly
detected by our algorithm since it detects features that are there but not
expected in the context of the current set; this would lead incorrectly to
rejection of all possible words. One modification of the SAA that would
minimize such errors would be to require that the T-D probability exceed a low
threshold value. Thus features that are not contextually relevant and have
zero or very low probability could be present but not attract attention (in
this sense attention to specific features is "filtered" by high
level contextual constraints). A second difficulty with real world vision is
missing or occluded features. In this regard, the unidirectional SAA,
T-D[lowest P]and B-U[There] seems preferable because it does not select
missing features and thus could produce recognition even when some features
are occluded. Other models of recognition have successfully shown how
hierarchically organized feed-forward neural networks can deal with some of
the difficult problems of position and scale invariance
(Fukushima 1986
;
Mel et al. 1998
;
Olshausen et al. 1993
;
Riesenhuber and Poggio 1999
).
It would seem useful to seek hybrid models that combine some of the power of
feedforward processing with the T-D control of attention described here.
| DISCLOSURES |
|---|
|
|
|---|
| ACKNOWLEDGMENTS |
|---|
|
|
|---|
| FOOTNOTES |
|---|
Address for reprint requests: D. Graboi, 1314 Desert Rose Way, Encinitas, CA 92024 (E-mail: dgraboi{at}cts.com).
| REFERENCES |
|---|
|
|
|---|
Barone P,
Batardiere A, Knoblauch K, and Kennedy H. Laminar distribution of neurons
in extrastriate areas projecting to visual areas V1 and V4 correlates with the
hierarchical rank and indicates the operation of a distance rule. J
Neurosci 20:
32633281, 2000.
Beierlein M, Gibson JR, and Connors BW. A network of electrically coupled interneurons drives synchronized inhibition in neocortex. Nat Neurosci 3: 904910, 2000.[ISI][Medline]
Bichot NP, Cave KR, and Pashler H. Visual selection mediated by location: feature-based selection of noncontiguous locations. Percept Psychophys 61: 403423, 1999.[ISI][Medline]
Briggs GE and Blaha J. Memory retrieval and central comparison times in information processing. J Exp Psychol 79: 395402, 1969.
Broadbent D and Broadbent MH. Human attention: the exclusion of distracting information as a function of real and apparent separation of relevant and irrelevant events. Proc R Soc Lond B Biol Sci 242: 1116, 1990.[Medline]
Burrows D and
Okada R. Memory retrieval from long and short lists.
Science 188:
10311032, 1975.
Campbell FW. How much of the information falling on the retina reaches the visual cortex and how much is stored in the visual memory? In: Pattern Recognition Mechanisms, edited by Chagas C and Gross C. Berlin, Germany: Springer, 1985, p. 8395.
Caputo G and Guerra S. Attentional selection by distractor suppression. Vision Res 38: 669689, 1998.[ISI][Medline]
Cauller LJ and Kulics AT. The neural basis of the behaviorally relevant N1 component of the somatosensory-evoked potential in SI cortex of awake monkeys: evidence that backward cortical projections signal conscious touch sensation. Exp Brain Res 84: 607619, 1991.[ISI][Medline]
Cavanaugh JP and Chase WG. The equivalence of target and nontarget processing in visual search. Percept Psychophys 9: 493495, 1971.
Cave KR. The FeatureGate model of visual selection. Psychol Res 62: 182194, 1999.[ISI][Medline]
Chelazzi L,
Duncan J, Miller EK, and Desimone R. Responses of neurons in inferior
temporal cortex during memory-guided visual search. J
Neurophysiol 80:
29182940, 1998.
Chun MM. Visual attention. In: Blackwell's Handbook of Perception, edited by Goldstein EB. Oxford, UK: Blackwell, 2001, p. 272310.
Cohn M. On the computational complexity of a vision task [Online]. Brandeis Computer Science Technical Report CS-02-229. (http://www.cs.brandeis.edu/~marty) [2002, July].
Dayan P and Abbott LF. Entropy. In: Theoretical Neuroscience - Computational and Mathematical Modeling of Neural Systems, edited by Sejnowski TJ and Poggio T. Cambridge, MA: The MIT Press, 2001, p. 124.
Dayan P, Hinton GE, Neal RM, and Zemel RS. The Helmholtz machine. Neural Comput 7: 889904, 1995.[Abstract]
Desimone R. Visual attention mediated by biased competition in extrastriate visual cortex. Philos Trans R Soc Lond B Biol Sci 353: 12451255, 1998.[ISI][Medline]
Domenici L,
Harding GW, and Burkhalter A. Patterns of synaptic activity in forward and
feedback pathways within rat visual cortex. J
Neurophysiol 74:
26492664, 1995.
Egeth HE and Yantis S. Visual attention: control, representation, and time course. Annu Rev Psychol 48: 269297, 1997.[ISI][Medline]
Engel AK and Singer W. Temporal binding and the neural correlates of sensory awareness. Trends Cognit Sci 5: 1625, 2001.[ISI][Medline]
Felleman DJ and
Van Essen DC. Distributed hierarchical processing in the primate cerebral
cortex. Cereb Cortex 1:
147, 1991.
Fries P,
Reynolds JH, Rorie AE, and Desimone R. Modulation of oscillatory neuronal
synchronization by selective visual attention. Science
291: 15601563,
2001.
Fukushima K. A neural network model for selective attention in visual pattern recognition. Biol Cybern 55: 515, 1986.[ISI][Medline]
Garrett AS, Flowers DL, Absher JR, Fahey FH, Gage HD, Keyes JW, Porrino LJ, and Wood FB. Cortical activity related to ac