This study investigated how neuronal activity in orbitofrontal cortex related to the expectation of reward changed while monkeys repeatedly learned to associate new instruction pictures with known behavioral reactions and reinforcers. In a delayed go-nogo task with several trial types, an initial picture instructed the animal to execute or withhold a reaching movement and to expect a liquid reward or a conditioned auditory reinforcer. When novel instruction pictures were presented, animals learned according to a trial-and-error strategy. After experience with a large number of novel pictures, learning occurred in a few trials, and correct performance usually exceeded 70% in the first 60–90 trials. About 150 task-related neurons in orbitofrontal cortex were studied in both familiar and learning conditions and showed two major forms of changes during learning. Quantitative changes of responses to the initial instruction were seen as appearance of new responses, increase of existing responses, or decrease or complete disappearance of responses. The changes usually outlasted initial learning trials and persisted during subsequent consolidation. They often modified the trial selectivities of activations. Increases might reflect the increased attention during learning and induce neuronal changes underlying the behavioral adaptations. Decreases might be related to the unreliable reward-predicting value of frequently changing learning instructions. The second form of changes reflected the adaptation of reward expectations during learning. In initial learning trials, animals reacted as if they expected liquid reward in every trial type, although only two of the three trial types were rewarded with liquid. In close correspondence, neuronal activations related to the expectation of reward occurred initially in every trial type. The behavioral indices for reward expectation and their neuronal correlates adapted in parallel during the course of learning and became restricted to rewarded trials. In conclusion, these data support the notion that neurons in orbitofrontal cortex code reward information in a flexible and adaptive manner during behavioral changes after novel stimuli.
The adaptation of behavioral reactions to novel environmental events appears to constitute an important process in the control of goal-directed behavior. It is conceivable that brain systems involved in controlling behavior also are engaged in learning and adaptation of this behavior. The frontal lobe controls several aspects of voluntary behavior at a very high level (Fuster 1989;Passingham 1993; Roberts et al. 1998). Previous studies have shown that neurons in several areas of the frontal lobe are activated in close relation to changes in voluntary behavior during learning (Chen and Wise 1995a,b;Mitz et al. 1991; Niki et al. 1990;Rolls et al. 1996; Watanabe 1990).
Rewards play a central role in mediating voluntary behavior (Dickinson and Balleine 1994). The orbital part of prefrontal cortex is involved importantly in the processing of reward information and the motivational control of behavior (Damasio 1994; Rolls 1996). Patients with lesions of orbitofrontal cortex show impairments when making decisions about the expected outcome of actions (Bechara et al. 1998). Monkeys with orbitofrontal lesions respond abnormally to changes in reward contingencies (Dias et al. 1996; Iversen and Mishkin 1970), and neurons in this area show activity related to the delivery and expectation of reward (Thorpe et al. 1983; Tremblay and Schultz 2000).
The present study investigated how reward-related activity in the orbitofrontal cortex changed during the repeated learning of novel, reward-predicting visual stimuli. We employed a learning set situation in which only a single task component changed and animals learned the new reward contingencies within a few trials (Gaffan et al. 1988; Harlow 1949). Similar to a learning study with striatal neurons (Tremblay et al. 1998), this procedure allowed us to investigate single neurons during complete courses of learning in comparison with familiar performance. The learning experiment involved a delayed go-nogo paradigm that typically tests the functions of the frontal cortex in behavioral control, working memory, and movement preparation and inhibition (Iversen and Mishkin 1970; Jacobsen and Niessen 1937). An instruction picture predicted the reinforcer to be expected and the behavioral reaction to be performed. Novel instruction pictures were presented in learning trials, and animals learned their reward prediction and behavioral significance by trial and error. These data were presented previously as an abstract (Tremblay and Schultz 1996).
We used the same two Macaca fascicularis monkeys (A and B) with the mostly same experimental procedures as described in the preceding report (Tremblay and Schultz 2000). Animal A had served before for recordings in the striatum in the same task (animal B ofHollerman et al. 1998).
In the behavioral task, one of three colored instruction pictures was presented on a computer monitor in front of the animal for 1.0 s and specifically indicated one of three trial types (rewarded movement, rewarded nonmovement, and unrewarded movement). A red trigger stimulus, identical in each trial type, was presented at random 1.5–2.5 s after instruction offset and elicited the movement or nonmovement reaction as instructed. In rewarded-movement trials, the animal released a resting key and touched a small lever below the trigger to receive a small quantity of apple juice (0.15–0.20 ml) after a delay of 1.5 s. In rewarded-nonmovement trials, the animal remained motionless on the resting key for 2.0 s after the trigger and received the same liquid reward after a further 1.5 s. In unrewarded-movement trials, the animal reacted as in rewarded-movement trials, but correct performance was followed by a 1-kHz sound. The sound constituted a conditioned auditory reinforcer, as it visibly helped the animal to perform the task, but it was not an explicit reward, hence the simplifying term “unrewarded” movements. Thus each instruction indicated in each trial the behavioral reaction (execution or withholding of movement) and predicted the reinforcer (liquid or sound). Correctly performed unrewarded movements were followed by one of the rewarded trials. Apart from that, trial types alternated semirandomly with consecutive trials of the same type restricted to three rewarded movement trials, two nonmovement trials, and one unrewarded movement trial. Trials lasted 11–13 s, intertrial intervals were 4–7 s.
During learning, one new instruction picture was presented in each of the three trial types, whereas all other task events and the global task structure remained unchanged. Learning consisted of associating each new instruction picture with the execution or withholding of movement and with liquid reward versus the conditioned auditory reinforcer. Whereas each familiar instruction consisted of a single fractal image (Fig. 1, top), each learning instruction was composed of two partly superimposed, simple geometric shapes that were drawn randomly from 256 images (64 shapes having 1 of the colors red, yellow, green, or blue), resulting in a total of 65,536 possible stimuli (Fig. 1, middle andbottom). A color subtraction mode produced composite pictures with up to five colors on the computer monitor.
Animals first learned all three trial types with familiar pictures at >90% correct. Then the instruction stimuli for each of the three trial types were replaced successively by learning stimuli. When >90% correct performance was reached with a set of three learning stimuli, those stimuli were discarded and three new stimuli were introduced successively, until a total of ∼15 new images had been learned. This stage lasted ∼1.5 mo. Subsequently, new learning stimuli for all three trial types were introduced at once as a new problem, andmonkeys A and B were trained for a further 1.5 and 0.5 mo, respectively. Previously presented learning stimuli never were reused with the animal. Errors in behavioral performance cancelled all further signals in a trial, including reward. Learning was facilitated by repeating erroneous trials until correct performance was obtained.
Activity of single neurons was recorded with moveable microelectrodes from histologically reconstructed positions in left orbitofrontal cortex during contralateral task performance, together with activity from the right forearm extensor digitorum communis and biceps muscles. Every neuron was tested in 40–80 trials with the three familiar instructions and, in separate blocks, with >60 learning trials using three novel instruction pictures. The sequence of testing varied randomly between familiar and learning blocks. Task-related activations in individual neurons were assessed with the one-tailed Wilcoxon test (P < 0.01) incorporated in the evaluation software. Neurons not activated in familiar or learning trials are not reported. Differences in magnitudes and latencies of task-related changes between familiar and learning trials were assessed in individual neurons with the two-tailed Mann-Whitney U test (P < 0.05). Movement parameters were evaluated in terms of reaction time (from trigger onset to key release), movement time (from key release to lever touch), and return time (from lever touch back to touch of resting key). They were compared on a trial-by-trial basis with the Kolmogorov-Smirnov test (P < 0.001).
The repeated learning of new stimuli within the same task structure resulted in a learning set in which each problem of three new stimuli was learned rapidly. At the onset of neuronal recordings,animals A and B had learned 625 and 77 problems, respectively. The first instruction in a block of new pictures could be of any of the three trial types. Monkey A reacted to this entirely unknown instruction randomly or by withholding the movement, whereas monkey B reacted preferentially with a movement. The trial outcome after the reaction to the first instruction (reward or sound if correct, no reinforcement if erroneous) influenced the animals' reaction to the subsequent instruction, and trial outcomes with each subsequent instruction influenced the following behavioral reactions. This learning behavior resulted, in average, in above chance performance already with the first new instruction of each trial type (Fig. 2 A), as two of three instructions of a new set already were preceded by at least one instruction and outcome. Thus learning occurred largely within the first trials and approached an asymptote within 5–10 trials of each type (Fig. 2 A). Medians of correct performance in the first 15 trials in the three trial types were, respectively, 87, 80, and 93% (monkey A) and 73, 73, and 87% (monkey B). Performance in familiar trials exceeded 98% in both animals. Learning curves were stable during the period of neuronal recordings inanimal A (Fig. 2 B) but became steeper with increasing learning experience in animal B.
Rewarded movements in familiar trials showed consistently shorter reaction times and longer return times (from lever touch back to resting key), as compared with unrewarded movements (Fig.3, top). This allowed identification of typical rewarded versus unrewarded movements. Thus animals kept their hand on the lever until the reward was delivered with rewarded movements, whereas they left the lever before the sound after unrewarded movements. With new stimuli, median reaction times in the first movement trial were intermediate between the two movement trial types. By contrast, median return times were initially typical for rewarded movements in both types of movement trials. All parameters became distinguishable and typical for the two types of movement trial after the first few correct trials of each type (P < 0.001) (Fig. 3, bottom). Erroneous movements in nonmovement trials usually were performed with return times typical for rewarded movements.
In rewarded movement trials, forearm muscle activity between key release and return to the resting key resembled that seen in familiar trials and remained stable over consecutive learning trials (Fig.4, left). In unrewarded-movement trials, muscle activity was frequently typical of rewarded movements in initial learning trials and later approached a pattern typical for unrewarded movements with shorter return times (Fig. 4, bottom right). Thus movement parameters and muscle activity in initial learning trials were typical for rewarded movements and subsequently differentiated between the two reinforcers.
Overview of neuronal changes
Performance of the task with familiar instruction pictures resulted in three main types of task relationships in orbitofrontal neurons, namely responses to instruction stimuli, activations preceding the reinforcers, and responses to the reinforcers (Fig.5) (Tremblay and Schultz 2000). Most activations occurred in both rewarded trial types irrespective of movement but not in unrewarded-movement trials or only in one of the three trial types.
A total of 148 neurons with statistically significant activations of one or two of the three main types were recorded in rostral area 13, entire area 11, and lateral area 14 of orbitofrontal cortex and were tested in both familiar and learning trials. Other task relationships occurred rarely and were not tested during learning. Responses to instructions or reinforcers had median latencies and durations ranging from 90 to 110 ms and from 280 to 700 ms, respectively. Activations preceding reinforcers began 1,188–1,282 ms before these events and subsided <500 ms after them. These values varied insignificantly between familiar performance and learning (P >0.05; Mann-Whitney U test).
Responses to instructions
A total of 93 neurons responded to instructions in familiar or learning trials. Responses to familiar instructions were stable during repeated testing blocks in all 15 neurons examined.
Response magnitudes in 56 neurons increased significantly during learning as compared with familiar trials (Table1). In a first form, increases consisted of appearance of activations in 18 neurons that did not respond in any familiar trial type (Fig. 6). In a second form, increases consisted of additional responses in one or two trial types in 24 neurons that responded selectively in one or two familiar trial types. This reduced the trial selectivity in these neurons. The additional activations began with the first learning trial. They were reproducible in 12 of 13 neurons tested in two learning problems, the exceptional neuron showing varying increases and decreases. In the example of Fig. 7, the neuron showed a solid response in familiar rewarded-movement trials, but only mild or no responses in the other familiar trial types (A–C). During learning, response latency shortened, and additional responses occurred in nonmovement trials (E). The neuron also showed strong responses in initial unrewarded-movement trials, which were performed with parameters of rewarded movements (F). Responses decreased progressively during learning, whereas movement parameters became appropriate for unrewarded movements. The increases were reproduced in a second learning problem with more rapid learning (G–I).
In a third form seen in 14 neurons responding unselectively to all three familiar instructions, existing responses increased significantly by a median of 54%, and latencies shortened occasionally (Fig.8).
Response magnitudes in 26 neurons decreased significantly during learning, as compared with familiar trials (by a median of 55%; Table1). The decreases were reproducible in three of four neurons tested in two learning problems (Fig. 9), the exceptional neuron showing varying increases and decreases. As in 10 other neurons, responses occurred only in unrewarded movement trials and were lost during learning. Response decreases persisted beyond the adaptation of movement parameters. Decreases also were seen in six other neurons responding in both rewarded trial types during familiar performance.
Average population histograms indicated that existing instruction responses mainly increased in duration after the initial peak during learning (Fig. 10, top), whereas decreases resulted only in a slightly shortened population response (middle). The averaging of both increases and decreases resulted only in a minor learning effect (bottom).
Activations preceding reinforcers
Activations preceding reinforcers were observed in 39 neurons. They began after the trigger stimulus and lasted until liquid reward or sound was delivered. These activations occurred either in one or both rewarded trial types and not in unrewarded movement trials or exclusively in unrewarded movement trials (Table 1).
Most of these activations were maintained during learning (23 neurons). Activations occurred also in initial unrewarded movement trials that were performed with parameters of rewarded movements and subsided once the animals performed the unrewarded movement with appropriate parameters (Fig. 11). Activations occurred also with erroneous movements in nonmovement trials, consistent with the frequent treatment of all initial movements as rewarded. Comparable adaptations in the opposite direction were observed with activations occurring exclusively in unrewarded movement trials (Fig. 12). Activations were absent in initial unrewarded movement trials during learning and appeared only after a few trials. Activations in this neuron were also absent in error trials.
In different forms of changes, nine neurons showed significantly increased activations irrespective of the adaptation of movement parameters (by a median of 47%), including five neurons that were activated exclusively during learning. Seven neurons showed decreased activations (median 67%).
Responses to reinforcers
During familiar performance, 44 neurons responded to liquid reward in either or both rewarded trials and none to the auditory reinforcer in unrewarded movement trials (Table 1). Most reward responses were maintained unaltered during learning and occurred in both rewarded trials but not in unrewarded movement trials (28 neurons) (Fig.13). Responses in 10 neurons were significantly increased (by a median of 80%) and in 6 decreased (median 67%). Four neurons responded to rewards exclusively during learning.
These data show that task-related neuronal activations in the orbitofrontal cortex undergo changes when animals rapidly learn new stimuli. Responses to the reward-predicting instructions were frequently increased or decreased. Some neurons responded exclusively during learning or, conversely, became entirely unresponsive. The changes began during the initial learning period and extended into the subsequent consolidation phase. Activations preceding the reward, apparently related to reward expectation, adapted together with the animals' reward expectations, as evidenced by their behavioral reactions. Although reinforcers play different roles during learning than during established task performance, responses following the reinforcers were largely unchanged. Thus orbitofrontal neurons show modified activity when expectations of upcoming rewards change.
Changes in reward expectation
In most learning situations, only a part of the environment changes, whereas the other task components remain unmodified and learning advances relatively rapidly. Our learning paradigm was based on a conditional, delayed go-nogo task and consisted of changing a single task component, the reward-predicting and movement preparatory instruction stimulus. Thus animals learned new associations between visual stimuli and behavioral reactions from their general task experience and without explicit sensorimotor guidance. Similar conditional delay tasks were used in learning set situations in prefrontal and premotor cortex (Mitz et al. 1991;Niki et al. 1990; Watanabe 1990) and supplementary and frontal eye fields (Chen and Wise 1995a,b,1996). Whereas the previous studies concerned mainly changes of behavior-related activity during learning, our present experiments and our previous study on striatum (Tremblay et al. 1998) particularly investigated the changes of reward expectation during learning.
During initial learning trials, unrewarded movements showed often parameters and muscle activity typical for rewarded movements and distinguished only after a few trials between rewarded and unrewarded movements. This suggests that animals initially expected a reward with all movements. Apparently animals had an existing set of expectations derived from their previous experience in the familiar task. When confronted with a novel instruction, they could draw on these expectations to test the novel stimulus and later adapt their expectations and behavior to the new situation. Inappropriate or “erroneous” behavioral performance thus would reflect inappropriate but otherwise unchanged expectations evoked by the new instructions. The simplicity of evoking existing expectations and matching them to the experienced meaning of new instructions may explain the remarkable speed of learning.
Increased and decreased instruction responses
Most responses to instructions were either increased or decreased during learning. The new appearance or loss of responses during learning resulted in changes in trial selectivity. The present learning increases resembled the sustained increases in the presupplementary motor area during learning of sequential movements (Nakamura et al. 1998) and the “learning-static” increases in frontal and supplementary eye fields during the preparation of eye movements (Chen and Wise 1995b). The exclusive responses during learning resembled “learning-selective” activations in the supplementary eye field (Chen and Wise 1995a) and comparable changes in hippocampus (Cahusac et al. 1993). Reduced trial selectivity by increased responsiveness also was observed in prefrontal cortex (Watanabe 1990) and is compatible with the modification or breakdown of directional selectivity in the supplementary eye field (Chen and Wise 1996). These comparisons indicate a number of common neuronal learning mechanisms in different areas of frontal cortex.
The present decreases resembled the learning-static decreases outlasting the learning phase in frontal eye field and supplementary eye field neurons during the preparation of eye movements (Chen and Wise 1995b). Similar changes were seen in prefrontal (Niki et al. 1990) and hippocampal neurons (Cahusac et al. 1993).
The increased or decreased responses to the instructions during learning might simply reflect differences in visual stimulus features. However, learning changes were similar with different sets of instruction pictures. This suggests that many of the observed changes were not primarily related to differences in visual features. Responses dependent on visual features were found in the tail of caudate (Brown et al. 1995).
In contrast to visual stimulus features, mechanisms related to general arousal and attention may have contributed to increased responses. Most task events probably were processed with heightened general attention during learning. The instruction stimuli were novel and had unknown behavioral significance, and the other task events were processed in a less automatic manner. Environmental stimuli and behavioral reactions during spatial attention were accompanied by enhanced neuronal activations in parietal and temporal cortex (Bushnell et al. 1981; Mountcastle et al. 1981; Sakata et al. 1983; Treue and Maunsell 1996), although their potential contribution to the present changes is unclear.
The increased instruction responses during learning might influence the storage of new stimulus-response associations in memory. Associating new instructions with known behavioral reactions and reinforcers would require modifications of neuronal processing, most likely involving synaptic changes. These should take place during the acquisition of new instructions and subsequent consolidation. Neuronal activations presently were increased during both periods. The increased orbitofrontal responses were most likely due to increased synaptic input that might possibly lead to long-lasting changes at cortical synapses, as indeed found in dorsolateral prefrontal cortex (Otani et al. 1998).
The decreases outlasted the initial learning phase. The familiar instructions consisted of highly structured, fractal images which were used without changes for many months and thus constituted reliable reward predictors. By contrast, the learning pictures were simple geometric forms which were usually presented for <100 trials and then replaced by new pictures. Thus the reward prediction of these “disposable” stimuli was only transient. Some neurons might have been sensitive to the different characters of reward prediction among the two stimulus classes and react with decreases to the disposable instructions during learning.
Adaptation of reward expectation-related activity
The changes of reward expectations during learning, as evidenced by the modifications of behavioral reactions, were paralleled by changes in neuronal responses to the reward-predicting instructions and expectation-related activations preceding the reward. However, the reward expectation-related activations were not related to overt behavioral reactions, like arm movement or licking (Tremblay and Schultz 2000), and their changes during learning should not in any simple manner reflect the behavioral changes during learning. Thus it appears that initially inappropriate neuronal activations reflected inappropriate reward expectations. The activations adapted and became appropriate for the trial type during learning when expectations were matched to the new task contingencies. In analogy, the few activations reflecting the expectation of the auditory reinforcer were rare in initial learning trials and reappeared subsequently. Adaptations of reward expectation-related activity also were found in the orbitofrontal cortex and amygdala of rats (Schoenbaum et al. 1998). These data suggest a mechanism for adaptive learning in which existing expectation-related neuronal activity is matched to the new conditions rather than acquiring all relationships to task contingencies from scratch.
Neuronal changes in parallel with behavioral changes were found also in orbitofrontal cortex during reversals of visual stimuli (Rolls et al. 1996; Thorpe et al. 1983). In similar go-nogo learning set tasks as employed presently, some dorsolateral prefrontal neurons changed movement preparatory activity in close correspondence with changes in actually executed behavioral responses (Niki et al. 1990; Watanabe 1990). Further experiments may assess how inputs from these frontal areas might mediate the adaptations of reward expectation-related activity in orbitofrontal cortex.
Largely maintained responses to reinforcers
As a negative but puzzling finding, most reinforcer responses were unchanged between familiar and learning trials. Only few responses to reward or auditory reinforcement occurred exclusively during learning. This contrasts strongly with the differences in function of reinforcers in learning versus established task performance (Schultz 1998). These results might indicate that orbitofrontal neurons use reinforcer information for the consolidation of learned reactions rather than for early learning. However, some orbitofrontal neurons responded particularly well to unpredictable rewards (Tremblay and Schultz 2000), a situation in which the learning of new task contingencies is particularly efficient (Rescorla and Wagner 1972). It would be interesting to resolve this conflicting issue concerning the use of reinforcer information during learning.
Comparison with learning-related changes in striatum
Neurons in the striatum (caudate nucleus, putamen, and ventral striatum) showed a considerable spectrum of activity related to various behavioral events, such as expectation of external events including rewards, preparation of movement, responses to reward-predicting or movement-eliciting stimuli, and responses to rewards (Alexander and Crutcher 1990; Apicella et al. 1992;Hikosaka et al. 1989; Schultz 1995). A large fraction of these activities reflected the expected reward (Hollerman et al. 1998). By contrast, orbitofrontal neurons showed a narrower range of task-related activities that predominantly reflected reinforcement processes (Rolls et al. 1996; Thorpe et al. 1983; Tremblay and Schultz 2000; see also Fig. 5). Nevertheless, there were considerable parallels between learning-related changes in both structures in exactly the same learning situation (Tremblay et al. 1998). Striatal neurons also showed adaptation of reward expectation-related activity. They also showed new responses or increases of existing responses to novel instructions, although increases lasted frequently shorter during learning in striatum as compared with orbitofrontral cortex. The fractions of decreases or complete abolition of instruction responses were about similar in the two structures. By contrast, orbitofrontal neurons showed twice as many increases than decreases. Although orbitofrontal inputs to ventral striatum (Eblen and Graybiel 1995; Haber et al. 1995; Selemon and Goldman-Rakic 1985) may not induce the large variety of task-related activities throughout the striatum, the presently observed orbitofrontal activities could influence some of the reward expectation-related changes in the ventral striatum during learning.
We thank B. Aebischer, J. Corpataux, A. Gaillard, A. Pisani, A. Schwarz, and F. Tinguely for expert technical assistance.
The study was supported by Swiss National Science Foundation Grants 31-28591.90, 31.43331.95, and NFP38.4038-43997. L. Tremblay received a postdoctoral fellowship from the Fondation pour la Recherche Scientifique of Quebec.
Present address of L. Tremblay: INSERM Unit 289, Hôpital de la Salpetri re, 47 Boulevard de l'Hôpital, F-75651 Paris, France.
Address reprint requests to W. Schultz.
The costs of publication of this article were defrayed in part by the payment of page charges. The article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
- Copyright © 2000 The American Physiological Society