JN Fuel your research with LabChart
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


J Neurophysiol 98: 2038-2057, 2007. First published July 25, 2007; doi:10.1152/jn.01311.2006 Free Article
0022-3077/07 $8.00
This Article
Free upon publication Free Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Media Files
Right arrowFree Article All Versions of this Article:
98/4/2038    most recent
01311.2006v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Fiete, I. R.
Right arrow Articles by Seung, H. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Fiete, I. R.
Right arrow Articles by Seung, H. S.

Model of Birdsong Learning Based on Gradient Estimation by Dynamic Perturbation of Neural Conductances

Ila R. Fiete1,2, Michale S. Fee3,5 and H. Sebastian Seung4,5

1Kavli Institute for Theoretical Physics, University of California, Santa Barbara, Santa Barbara; 2Center for Theoretical Biological Physics, University of California, San Diego, La Jolla, California; and 3McGovern Institute for Brain Research, 4Howard Hughes Medical Institute, and 5Brain and Cognitive Sciences Department, Massachusetts Institute of Technology, Cambridge, Massachusetts

Submitted 14 December 2006; accepted in final form 13 July 2007


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 APPENDIX A
 APPENDIX B
 GRANTS
 ACKNOWLEDGMENTS
 REFERENCES
 
We propose a model of songbird learning that focuses on avian brain areas HVC and RA, involved in song production, and area LMAN, important for generating song variability. Plasticity at HVC -> RA synapses is driven by hypothetical "rules" depending on three signals: activation of HVC -> RA synapses, activation of LMAN -> RA synapses, and reinforcement from an internal critic that compares the bird's own song with a memorized template of an adult tutor's song. Fluctuating glutamatergic input to RA from LMAN generates behavioral variability for trial-and-error learning. The plasticity rules perform gradient-based reinforcement learning in a spiking neural network model of song production. Although the reinforcement signal is delayed, temporally imprecise, and binarized, the model learns in a reasonable amount of time in numerical simulations. Varying the number of neurons in HVC and RA has little effect on learning time. The model makes specific predictions for the induction of bidirectional long-term plasticity at HVC -> RA synapses.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 APPENDIX A
 APPENDIX B
 GRANTS
 ACKNOWLEDGMENTS
 REFERENCES
 
Songbirds hatch not knowing how to sing. At first, a juvenile male is incapable of complex vocalizations and merely listens to its tutor's song. After the bird begins to vocalize, it gradually learns to produce an accurate copy of the tutor's song (Immelmann 1969Go; Price 1979Go), even when isolated from all other birds including its tutor (Brainard and Doupe 2002Go). In the latter period, auditory feedback of the bird's own song is crucial: if the bird is deafened after exposure to the tutor song but before it begins to sing, it cannot learn (Konishi 1965Go; Marler and Tamura 1964Go). Together, these findings suggest that the juvenile songbird memorizes a template of the tutor song and afterward learns by comparing its own vocalizations with the template.

A number of song-related avian brain areas have been discovered (Fig. 1A). Song production areas (Fig. 1A, open blue) include HVC (high vocal center) and RA (robust nucleus of the arcopallium), which generate sequences of neural activity patterns and through motoneurons control the muscles of the vocal apparatus during song (Hahnloser et al. 2002Go; Suthers and Margoliash 2002Go; Wild 1993Go, 2004Go; Yu and Margoliash 1986Go). Lesion of HVC or RA causes immediate loss of song (Nottebohm et al. 1976Go; Simpson and Vicario 1990Go). Other areas in the anterior forebrain pathway (AFP) appear to be important for song learning but not production (Fig. 1A, filled green), at least in adults. The AFP is regarded as an avian homologue of the mammalian basal ganglia thalamocortical loop (Farries 2004Go; Perkel 2004Go; Reiner et al. 2004Go). In particular, lesion of area LMAN (lateral magnocellular nucleus of the nidopallium) has little immediate effect on song production in adults, but arrests song learning in juveniles (Bottjer et al. 1984Go; Doupe 1993Go; Scharff and Nottebohm 1991Go). These facts suggest that LMAN plays a role in driving song learning, but the locus of plasticity is in brain areas related to song production, such as HVC and RA.


Figure 1
View larger version (14K):
[in this window]
[in a new window]

 
FIG. 1. Avian song pathways and the tripartite hypotheses. A: avian brain areas involved in song production and song learning. Premotor pathway (open) includes areas necessary for song production. Anterior forebrain pathway (filled) is required for song learning but not for song production. B: tripartite reinforcement learning schema: the actor produces behavior; the experimenter sends fluctuating input to the actor, producing variability in behavior that is used for trial-and-error learning; the critic evaluates the behavior of the actor and sends a reinforcement signal to it. For birdsong, the actor includes premotor song production areas HVC (high vocal center) and RA (robust nucleus of the arcopallium). Doya and Sejnowski hypothesized that the experimenter is LMAN (lateral magnocellular nucleus of the nidopallium). Location of the critic is unknown. C: plastic and empiric synapses. RA receives synaptic input from both HVC and LMAN. We will call the HVC synapses "plastic," in keeping with the hypothesis that these synapses are the locus of plasticity for song learning. Doya and Sejnowski conjectured that LMAN produces song variability by driving slow exploration in HVC -> RA synaptic strengths. However, recent data indicate that LMAN produces transient song perturbations (Kao et al. 2005Go) by driving rapid conductance fluctuations in postsynaptic RA neurons (Olveczky et al. 2005Go). We will refer to the connections from LMAN to RA as "empiric," in keeping with the hypothesis that they are specialized for experimentation.

 
Actor, critic, and experimenter

Nearly a decade ago, Doya and Sejnowski (1998)Go attempted to place such observations in a schema borrowed from mathematical theories of reinforcement learning. In this schema, learning is based on interactions between an actor and a critic (Fig. 1B). The critic evaluates the performance of the actor at a desired task. The actor uses this evaluation to change in a way that improves its performance. To learn by trial and error, the actor performs the task differently each time. It generates both good and bad variations, and the critic's evaluation is used to reinforce the good ones.1 Ordinarily it is assumed that the actor generates variations by itself. However, Doya and Sejnowski considered a schema in which the source of variation is external to the actor. We will call this source the experimenter.

Doya and Sejnowski proceeded to identify the three parts of their schema with specific areas of the avian brain. The actor was identified with HVC, RA, and the motor neurons that control vocalization. They hypothesized that the actor learns through plasticity at the synapses from HVC to RA (Fig. 1C). Based on evidence of structural changes like axonal growth and retraction that take place in the HVC to RA projection during song learning (Herrmann and Arnold 1991Go; Kittelberger and Mooney 1999Go; Mooney 1992Go; Sakaguchi and Saito 1996Go; Stark and Scheich 1997Go), this view is widely regarded as plausible. Curiously, no reliable protocols for the induction of activity-dependent plasticity at these synapses in vitro have yet been found (R Mooney, private communication), possibly for good reasons, which we consider in the DISCUSSION. For the experimenter and critic, Doya and Sejnowski turned to the anterior forebrain pathway, hypothesizing that the critic is X and the experimenter is LMAN.

What is the current status of the Doya–Sejnowski tripartite schema? The actor part of their model was on firm ground, but their ideas about the critic and the experimenter were more speculative. Unfortunately, the location of the critic is still unknown, although it is widely believed to exist. Because the critic has not been found, the nature of its feedback is still unknown. One could imagine a powerful critic, which gives the actor specific instructions about how to improve song. This would place more of the computational burden of the learning problem on the critic. Or one could imagine a weak critic, which simply tells the actor whether performance is good or bad. This would place more of the burden of learning on the actor.

On the other hand, there is increasing support for their general idea of LMAN as an experimenter. First, we review evidence in support of LMAN as an experimenter. Then we argue that recent experiments show important departures from the assumptions of Doya and Sejnowski about the structure and dynamics of LMAN's input to RA, which call for a different formulation of learning with LMAN experimentation in the songbird system.

During song, LMAN neural spiking is quite variable from trial to trial and more irregular than activity in RA (Hessler and Doupe 1999bGo; Leonardo 2004Go). Moreover, mean activity in LMAN correlates with the overall song variability: In adult birds, LMAN activity is low during song directed at females, which tends to be extremely stable and stereotyped, and much higher during the more variable undirected song (Hessler and Doupe 1999bGo; Kao et al. 2005Go). Although, as noted earlier, LMAN lesions have little effect on adult song, especially during directed bouts, closer inspection reveals that LMAN lesions reduce the slight variability present in adult undirected song (Kao et al. 2005Go). In juveniles, there is much greater trial-to-trial song variability compared with that of adults; this is dramatically reduced after LMAN lesions (Scharff and Nottebohm 1991Go). Recently it was shown that reversible pharmacological inactivation of juvenile LMAN with tetrodotoxin (TTX) or muscimol leads to immediate reduction in song variability (Kao et al. 2005Go; Olveczky et al. 2005Go). All of this evidence suggests that LMAN generates song variability through its projection to RA.2

But how, mechanistically and functionally, does LMAN drive song variability and learning? Doya and Sejnowski proposed that the role of LMAN input to RA is to produce a fluctuation that is static over the duration of a song bout, directly in the synaptic strengths from premotor nucleus HVC to RA. From a functional perspective, the model of Doya and Sejnowski is akin to "weight perturbation" (Dembo and Kailath 1990Go; Seung 2003Go; Williams 1992Go) and relatively easy to implement: a temporary but static HVC–RA weight change that lasts the duration of one song causes some change in song performance. If performance is good, the critic sends a reinforcement signal that makes the temporary static perturbation permanent. From a neurobiological perspective their model requires machinery whereby N-methyl-D-aspartate (NMDA)–mediated synaptic transmission from LMAN to RA can drive synaptic weight changes that remain static over the 1- to 2-s duration of song, in the heterosynaptic HVC–RA connections. However, LMAN activity in the songbird is dynamic and variable throughout song, evolving on a 10- to 100-ms timescale (Hessler and Doupe 1999aGo,bGo; Leonardo 2004Go), at odds with the assumption that at the beginning of song LMAN triggers an instantaneous perturbation in the HVC–RA weights, which is then held constant throughout the song.

Next, in recent experiments, transient stimulation in LMAN leads to transient, subsyllable-long changes in either song pitch or amplitude (Kao et al. 2005Go). Presumably, local stimulation excites local myotopic ensembles of LMAN neurons; if this LMAN activity led to static perturbations of a set of HVC synapses projecting to a myotopic RA group, it would have produced changes in pitch or amplitude that were not transient, but lasted to produce consistent biases in pitch or amplitude throughout one song iteration. In Olveczky et al. (2005)Go, blocking NMDA receptor currents in RA causes the same reduction in song variability as does LMAN inactivation,3 indicating that the effects of LMAN activity in RA are through ordinary glutamatergic synaptic transmission into RA neurons. In short, LMAN appears to drive fast, transient song fluctuations on a subsyllable level, effected by ordinary excitatory transmission that drives dynamic postsynaptic membrane conductance fluctuations in the postsynaptic RA neurons. This picture of rapidly fluctuating glutamatergic input from LMAN driving fast conductance perturbations in RA is quite different, in its neurobiological mechanism and mathematical implications for reinforcement learning, from the Doya and Sejnowski model based on slow modulatory influences on HVC -> RA weights.

Finally, for song learning, synapses from different HVC neurons to the same postsynaptic RA neuron must have the flexibility to change in opposite directions. Within the weight-perturbation model of Doya and Sejnowski, this requires that each synapse from HVC onto a single RA neuron receive independent perturbations in different directions, relative to other synapses from different HVC neurons onto the same RA neuron. In neurobiological terms, this could be possible if, for each synapse from a distinct HVC neuron onto a RA neuron, there were a separate LMAN input. However, this seems unlikely considering that each RA neuron receives only about 50 synapses from LMAN (Canady et al. 1988Go; Hermann and Arnold 1991Go) compared with about 1,000 synapses from ~200 different HVC neurons (Kittelberger and Mooney 1999Go).

Next, we describe a learning rule that, like the weight-perturbation scheme used by Doya and Sejnowski, also belongs in the broad category of actor–critic reinforcement learning rules. However, the rule is distinct functionally and in its neurobiological implications from weight-perturbation–like schemes. Applied to the song system, the rule is fully consistent with the physiological and anatomical findings on LMAN input to RA and with the phenomenology of song learning.

Learning with empiric synapses

The goal of this work is to relate the high-level concept of reinforcement learning by the tripartite schema to a biologically realistic lower level of description in terms of microscopic events at synapses and neurons in the birdsong system, to demonstrate song learning in a network of realistic spiking neurons, and to examine the plausibility of reinforcement algorithms in explaining biological fine motor skill learning with respect to learning time in the birdsong network.

The present model is based on many of the same general assumptions that were made by Doya and Sejnowski. We assume a tripartite actor–critic–experimenter schema. The critic is weak, providing only a scalar evaluation signal. The HVC sequence is fixed, and only the map from HVC to the motor neurons is learned, through plasticity at the HVC -> RA synapses.4 LMAN perturbs song through its inputs to the song premotor pathway. However, the structure and dynamics of LMAN inputs, and their influence on learning, are different, with distinct neurobiological implications. In keeping with our hypothesis that the function of LMAN drive to RA is to perform "experiments" for trial-and-error learning, the connections from LMAN to RA will be called "empiric" synapses (Fig. 1C).

We make a specific theoretical proposal for synaptic reinforcement learning in the case of birdsong, illustrated in Fig. 2. Functionally, our scheme is similar to "node perturbation" (Fiete and Seung 2006Go; Werfel et al. 2005Go; Xie and Seung 2004Go) because it relies on independent perturbations delivered to neurons (rather than to individual plastic synapses, as in weight perturbation). From a neurobiological perspective, this scheme is more realistic, for two reasons. First, it is in better agreement with the microanatomy of LMAN–RA synapses because it only requires one independent LMAN input per RA neuron, rather than per HVC–RA synapse. Second, the perturbation to each neuron in our model is temporally varying on a rapid timescale, not static, during song. This is consistent with activity in LMAN during song production and song learning.


Figure 2
View larger version (21K):
[in this window]
[in a new window]

 
FIG. 2. A proposal for plasticity rules at HVC -> RA synapses. Synaptic plasticity rule for gradient estimation by dynamic perturbation of conductances. We use the actor–critic–experimenter schema (Fig. 1B) and distinguish between plastic and empiric synapses (Fig. 1C). A: neurons in the experimenter (LMAN) dynamically perturb the conductances of RA neurons through empiric synapses. Critic signals improvements in performance and globally broadcasts a reinforcement signal to all plastic synapses (HVC -> RA). B: if coincident activation of a plastic synapse and empiric synapse onto the same RA neuron is followed by reinforcement, then the plastic synapse is strengthened. If activation of the plastic synapse without the empiric synapse is followed by reinforcement, the plastic synapse is weakened.

 
We assume that each RA neuron receives many plastic synaptic inputs from HVC, in addition to a single empiric synapse from LMAN that dynamically drives the postsynaptic RA conductance throughout song by ordinary excitatory neurotransmission.5 The dynamic postsynaptic conductance perturbations must somehow be translated into appropriate instructions for plasticity in the incoming plastic synapses. Because each RA neuron continually receives conductance inputs from both HVC and LMAN, and both vary with time during song, the challenge is to understand how an RA neuron might use dynamic LMAN perturbations to extract information about the correct long-term weight changes for its HVC inputs.

In our proposal, the conductance of the plastic synapse from neuron j in HVC to neuron i in RA is given by WijsijHVC(t), where the synaptic activation Formula(t) determines the time course of conductance changes, and the plastic parameter Wij determines their amplitude. Changes in Wij are governed by the plasticity rule

Formula 1(1)
The positive parameter {eta}, called the learning rate, controls the overall amplitude of synaptic changes. The eligibility trace eij(t) is a hypothetical quantity present at every plastic synapse. It signifies whether the synapse is "eligible" for modification by reinforcement and is based on the recent activation of the plastic synapse and the empiric synapse onto the same RA neuron

Formula 2(2)
Here Formula 2 is the conductance of the empiric (LMAN -> RA) synapse onto the i th RA neuron. The temporal filter G(t) is assumed to be nonnegative and its shape determines how far back in time the eligibility trace can "remember" the past.

An important aspect of Eq. 2 is that the instantaneous activation of the empiric synapse is measured relative to its own expected activity Formula 2(t)>. This subtraction of average activation in the empiric synapse enables bidirectional synaptic changes, even if the reinforcement signal R(t) is constrained to be nonnegative.6 In our model, each empiric synapse is driven by a Poisson spike train from an LMAN neuron with constant firing rate, so Formula 2 is a fixed constant throughout song and throughout learning (and thus easy to estimate by a simple time average) for every RA neuron.

The preceding equations have the advantage of mathematical precision, but it is helpful to have verbal formulations of the conditions for synaptic strengthening and weakening, illustrated in Fig. 2B. Suppose an empiric synapse and a plastic synapse onto the same RA neuron are activated at the same time. By Eq. 2, the eligibility trace tends to be positive for some time after. If positive reinforcement arrives during this time interval, then Wij is increased by Eq. 1. Therefore the condition for synaptic strengthening can be summarized by the following rule.

Now suppose that the plastic synapse is active without the empiric synapse at time t. Then the eligibility trace tends to be negative for some time after that. If positive reinforcement arrives during this time interval, then Wij is decreased. So the condition for synaptic weakening is summarized by this rule. For negative reinforcement, the signs of the synaptic changes in R1 and R2 are reversed.7 Having described our model of synaptic plasticity both in equations and words, let us now examine why it is appropriate for improving performance during trial-and-error learning. First, consider the intuitive justification for R1. Activation of an empiric synapse constitutes "extra" input to an RA neuron at a particular time. Subsequent positive reinforcement suggests that this "extra" input is better. However, the activation of the empiric synapse onto that particular neuron at that time was a chance event. To consolidate this chance occurrence, which led to positive reinforcement, and thereby ensure that in future song trials that specific RA neuron fires a little extra at that specific moment in time, the plastic input synapses active at that moment are strengthened. This plasticity rule causes synaptic changes that allow modifications in song to be local both in time (during the song trajectory) and space (at a neuron level).

To understand rule R2, note that each empiric synapse has a nonzero average level of activation, which is determined by the firing rate of the presynaptic LMAN neuron. If the empiric synapse is not active at a particular time, it means that the RA neuron is receiving less input than usual for that moment in time. Subsequent positive reinforcement suggests that this deficit of input is better. This LMAN-driven chance deficit is consolidated for future trials within the HVC–RA pathway by weakening the plastic synapses that were active at that time.

R1 and R2 describe how the presence or absence of chance LMAN input to RA, if followed by positive reinforcement, causes HVC–RA synapses to undergo either long-term potentiation (LTP, R1) or long-term depression (LTD, R2). Because the presence or absence of empiric (LMAN) input determines the sign of synaptic change when reinforcement is present, LMAN's role in the preceding rules might be mistaken as supervisory. We note, however, that in our theoretical formulation and birdsong model, output performance does not affect patterns of activity in the empiric (LMAN) input, which would be a requirement if LMAN were sending supervisory signals to RA based on output performance. Furthermore, if reinforcement is held constant, or if it varies independently of eligibility, then rules R1 and R2 produce no net (average) change in synaptic weights: over many trials, synaptic strengthening and weakening due to R1 and R2 cancels, even when LMAN is active. This can readily be seen from Eq. 2, where the average of synaptic eligibility alone is always zero. It is only when reinforcement actually covaries with fluctuations of the synaptic eligibility that there is a net nonzero change in synaptic weight.

Let us more closely examine how the demands of the desired trajectory—reflected in the reinforcement signal—set the balance between R1 and R2 to determine the actual direction of net synaptic change. Consider a scenario where overall performance would improve with an increase in the activity of RA neuron A at time t in the trajectory, a decrease in its activity at time t' in the trajectory, and be unaffected by changes in its activity at time t''. How do plasticity rules R1 and R2 combine to produce these changes? In this hypothetical scenario the network will tend to receive positive reinforcement in song trials where neuron A happens to be more active at time t than usual for that time, due to chance input from an empiric LMAN synapse. In trials where the empiric synapse to neuron A is quiescent at t, the network will tend to get less or no positive reinforcement because the neuron is less active than usual for that time. In short, for this scenario reinforcement is greater after empiric input to neuron A at time t than without, causing R1 to dominate over R2 and resulting in a net LTP of those regular inputs to neuron A that were active at time t. Conversely, because the trajectory would be better with less-than-usual activity in neuron A at time t', reinforcement will be larger in trials where LMAN inputs to A are quiescent at t', meaning that R2 will dominate and produce a net LTD of HVC synapses to A that were active at t'. Finally, because reinforcement does not depend on the activity of neuron A at t'', then reinforcement will arrive with equal likelihood after quiescence or activity in the LMAN input at t'', and the effects of R1 and R2 will cancel, resulting in zero average synaptic change for inputs to A that were active at t''.

Gradient learning

In the preceding text our synaptic plasticity rules were justified with intuitive arguments. They can also be understood using a formal mathematical theory developed elsewhere (Fiete and Seung 2006Go). Under reasonable assumptions, the rules—based on dynamic conductance perturbations of the actor neurons—perform stochastic gradient ascent on the expected value of the reinforcement signal.8 The antagonism between plasticity rules R1 and R2 ensures that they compute the subtraction that is the essence of the definition of a gradient. This means that song performance as evaluated by the critic is guaranteed to improve on average. The guarantee holds even if the synapses are embedded in a network that is very complex: for example, the network may be recurrent and consist of conductance-based spiking neurons with synapses that display short-term plasticity. The guarantee is also broadly independent of model details or parameter choices.

Gradient learning can be regarded as a method for (approximately) solving a computational problem: finding a configuration of synaptic strengths that optimizes the performance of a network as evaluated by a critic. In general, this optimization problem is nontrivial. The performance of the network is determined by the collective effects of a large number of synapses and neurons. The role of any given synapse in performance may not be obvious, given that its effect may be exerted through multiple polysynaptic or even recurrent pathways involving both excitation and inhibition. Furthermore, this role may shift over time as the network changes during learning.

Is the principle of gradient learning also used by the brain? One might be skeptical that such a formal principle is relevant for neurobiology. However, gradient learning has a property that is important for brains: it is very robust. Even when properties of the actor and critic are varied, the plasticity rules are still guaranteed to improve average network performance.

The role of numerical simulations

This paper contains the results of many numerical simulations, which might seem irrelevant given that the principle of gradient learning guarantees that the plasticity rules will improve performance. Why are the simulations important? Although there are mathematical guarantees that gradient learning will improve performance, there is no assurance about how fast these improvements will be. If learning turns out to take longer than the lifetime of a zebra finch, then our model of learning, based on the general principle of random single-neuron experimentation and global reinforcement, could be rejected. Thus learning speed is the main issue explored in our numerical simulations. We explore how learning time scales with the number of neurons (to obtain an estimate of learning speed in a realistically sized song network) and with the precision and delay of the reinforcement signal.

Reinforcement learning in its essence is a parallel blind local search in the space of plastic parameters to climb a hill (the reinforcement function, which reflects overall performance on the desired task). The number of search dimensions equals the number of independently perturbed parameters. In algorithms based on synaptic weight perturbation (Dembo and Kailath 1990Go; Seung 2003Go), the search dimension is the number of weights, whereas in algorithms based on node perturbation (Fiete and Seung 2006Go; Xie and Seung 2004Go), like the one proposed here, the search dimension is the number of perturbed neurons multiplied by the number of independent time steps in the trajectory. Because optimization by blind multiparameter local search is slow, reinforcement learning might similarly be too slow. Indeed, previous theoretical work on reinforcement learning algorithms shows that in certain feedforward networks, learning time scales proportionally with the number of plastic parameters or with the dimensionality of the input perturbations (Cauwenberghs 1993Go; Werfel et al. 2005Go).

Existing models of song learning are far from biologically realistic in network size, output degrees of freedom, neural dynamics, and characteristics of the reinforcement signal (temporal delay or broadening), and do not explore how convergence speed and final error would be affected if these properties were made to approach those found in the actual songbird. In fact, even in a small, simplified neural network model with small numbers of output degrees of freedom, Doya and Sejnowski (2000)Go reported that learning with independent random perturbations from LMAN resulted in relatively poor convergence to the tutor song. To remedy this situation, they assumed that LMAN computes and carries an instructive gradient signal for HVC–RA synaptic change, in addition to a random component. In addition, learning with a weight-perturbation scheme can be significantly slower and scale more poorly with network size than node-perturbation–like rules such as ours, as demonstrated in a network similar to the birdsong network (Werfel et al. 2005Go). Thus existing work provides few results on the possibility or accuracy of song learning based on uncorrelated random perturbations from LMAN in full-scale, realistic network models of birdsong acquisition.

In the bird there are as many as 8,000 RA neurons (and therefore as many potentially independent exploratory perturbations) and 20,000 x 8,000 ~ 108 plastic HVC–RA weights. We show that even in such large networks, it is possible at least in principle for independent random neural perturbation to produce biologically realistic learning.

To challenge our plasticity rules, we have made our model of song production quite complex. Unlike any existing models of sensorimotor learning in the song pathway, the model neurons in HVC and RA are biophysically realistic, generating spikes and interacting through synaptic conductances. The spiking activity of the network is converted into an acoustic signal by a simple model of the vocal organ. To further challenge our plasticity rules, we have intentionally "crippled" the critic's reinforcement signal, to make it more difficult to learn from. The critic is modeled as a template matcher that compares the acoustic signal with a template drawn from real zebra finch song. The critic's signal reaches the actor only after a temporal delay, is temporally imprecise, and is binary rather than analog. These features could be realistic if the critic's signal is broadcast by secretion of a neuromodulator. The question is whether the plasticity rules will still be able to learn in a reasonable amount of time.

Although our models of song production and evaluation are highly complex, one should not forget that the underlying model of synaptic plasticity is extremely simple: it consists of the two equations (Eqs. 1 and 2). It is this simple model that is being tested herein. The complexities are there to make the test challenging.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 APPENDIX A
 APPENDIX B
 GRANTS
 ACKNOWLEDGMENTS
 REFERENCES
 
The model

ACTOR. In our model of song production, a model neural network controls a source–filter model of the avian vocal organ. Neurons interact through synaptic conductances and generate spikes, unlike past models based on nonspiking neurons (Doya and Sejnowski 1998Go; Troyer and Doupe 2000Go).

The network is composed of layers that represent HVC, RA, and motor neurons (Fig. 1). The connectivity of the network is feedforward, except for weak global inhibition in RA. Two output units represent motor neuron pools. They low-pass filter and sum the synaptic currents from RA, to produce a pair of time-varying control signals for the vocal organ.

In zebra finches, each RA-projecting HVC neuron generates a single burst of spikes at a stereotyped time during a song motif (Hahnloser et al. 2002Go). The burst onset times of the population of neurons are distributed throughout the song motif. To simulate these short bursts, we stimulate each HVC neuron in our model with a single current pulse during the song. This pattern of activity remains unchanged during learning.

Our source–filter model of the syrinx, the avian vocal organ, is mathematically similar to digital models of speech production (Rabiner and Schafer 1978Go). Oscillatory motions of the syrinx are driven by air flow, yielding an acoustic output of a set of harmonically related frequencies. The pitch or fundamental frequency of the harmonics is adjusted by muscles that control the tension of the syringeal fold (Goller and Larsen 1997Go; Suthers et al. 1999Go; Warner 1972Go; Wild 1997Go), whereas amplitude is partially controlled by air flow. The source in our source–filter model is a pulse train, yielding an acoustic output of a set of harmonically related frequencies, with pitch and amplitude controlled by the two time-varying outputs of the motor network. In the bird, the vocal tract and beak filter the broad spectral content of the syringeal output, and may also directly affect the syringeal oscillations (Beckers et al. 2003Go; Nowicki 1987Go; Suthers et al. 1999Go). The filter in our source–filter model is based on ten linear predictive coefficients, which are generated from zebra finch song recordings to produce a broad spectral envelope similar to that of real songs. For simplicity, the filter is static over the duration of the simulated song and does not change with learning.

Our use of the source–filter model is a compromise between simplicity and realism. More realistic models of the syrinx have relied on physics-based simulations (Fletcher 1988Go; Titze 1988Go), and display both quasiperiodic or chaotic behaviors. The quasiperiodic behaviors are similar to that of our source–filter model, but are much more time consuming to simulate.

CRITIC. The critic compares the pitch and amplitude of the generated song against those of the template, which is a recording of real zebra finch song, and sends a delayed comparison of the two back to the song network. At every instant in time, the error of the model song with respect to the template is computed as the sum of the squares of the pitch and amplitude differences. The critic's signal is "crippled" in several ways to make learning more difficult and thus to test the capabilities of our model: First, the critic's signal is binarized, rather than analog. Whenever the error is below a similarity threshold, then the critic provides a reinforcement of strength one; otherwise its signal is zero. Second, the signal is temporally delayed by 50 ms. Third, the signal is temporally broadened in some simulations.

There is a similarity threshold for each moment of song, set by the average performance at that moment in the last few trials. This adaptive threshold ensures that the critic gives positive reinforcement roughly 50% of the time. If the threshold were set improperly, then the critic would be hypercritical (never reinforcing anything) or uncritical (reinforcing everything). Our use of an adaptive threshold is similar to baseline comparison in reinforcement learning, which can result in faster learning and lower final error (Dayan 1990Go).

In our model, the critic's signal reaches HVC -> RA synapses after a delay of Tdelay = 50 ms relative to the RA neural activities that gave rise to it. This number was inferred as follows. First, the delay from RA activity to acoustic output is estimated to lie in the range from 20 ms (Fee et al. 2004Go) to 45 ms (Troyer and Doupe 2000Go). The lower of these two numbers, when added to an estimated auditory processing delay of 30 ms (Troyer and Doupe 2000Go), yields Tdelay = 50 ms.

In some simulations, the critic's signal is temporally broadened in addition to being delayed. This is done by low-pass filtering with a 50 ms time constant (see Numerical details).

EXPERIMENTER. In each time interval [t, t + dt] during song, LMAN neurons fire a spike with probability p = {lambda}dt, with firing rate {lambda} = 80 Hz chosen to be consistent with the averaged spiking rate of putative RA-projecting single LMAN units recorded in the singing bird (Leonardo 2004Go). This underlying firing rate is taken to be constant throughout song and over learning. LMAN spike trains are regenerated, and thus vary, from iteration to iteration.

Synaptic plasticity

As described earlier, the reinforcement signal R(t) is delayed by 50 ms after the neural events that gave rise to the song that it evaluates. Therefore reinforcement starts 50 ms after the song has begun and ends 50 ms after the song has ended. Equation 1 is applied during this period. The learning rate is {eta} = 0.0002.

The temporal filter G(t) = tnet/{tau}e was used in Eq. 2, with {tau}e = 10 ms and n = 5. The peak of this filter is at Tdelay = n{tau}e, so the eligibility trace can be regarded as a version of the instantaneous eligibility that is delayed by Tdelay = 50 ms to match the time delay in the reinforcement signal. However, to be realistic we assume that delaying the eligibility trace comes at the cost of introducing temporal imprecision. The width of the filter, defined as the time between the two inflection points flanking the delta-function response peak, is 2Formula 2, so a temporal imprecision of 45 ms is introduced by filtering to produce a 50 ms delay. In the simulations, the time average Formula 2 is computed by averaging the LMAN spike train of the current trial. It could be implemented instead by a low-pass filter at every LMAN -> RA synapse.

There is no clear experimental evidence for plasticity in the RA -> motor output connections, although it is possible these weights are also learned. In addition, the rules described in R1 and R2 could be used in the recurrent RA synapses at the same time as in the HVC -> RA synapses, and would drive gradient learning on the whole network. We have focused our attention on the HVC -> RA synapses because they are widely expected to be involved in song learning (Herrmann and Arnold 1991Go; Kittelberger and Mooney 1999Go; Mooney 1992Go; Sakaguchi and Saito 1996Go; Stark and Scheich 1997Go).

Numerical details

VOLTAGE AND CONDUCTANCE DYNAMICS. The membrane potentials V of all neurons in HVC and RA are governed by

Formula 3(3)
with intrinsic leak conductance gL so that Cm/gL defines the membrane time constant, and with excitatory and inhibitory synaptic conductances gE and gI, respectively. The reset condition is Vi -> Vreset when Vi crosses the threshold voltage V{theta}; this threshold-reset event represents a voltage spike followed by repolarization. Following a spike in the ith neuron in HVC or RA, the synaptic activation ski(t) in the synapse from neuron I to neuron k is incremented by one. Between spikes it decays with time constant {tau}s

Formula 4(4)
In our simulations, ski(t) = si(t). For notational clarity, we denote synaptic activations in HVC, RA, and LMAN by Formula 4(t), Formula 4(t), and Formula 4 respectively. Note that although we have used integrate-and-fire neurons and relatively simple time courses for synaptic dynamics, the learning rule is guaranteed to perform stochastic gradient ascent on the reinforcement R even for more complicated neuron models (e.g., Hodgkin–Huxley) and synaptic time courses (Fiete and Seung 2006Go).

RA neurons receive excitatory synaptic inputs from HVC and LMAN, and global (recurrent) inhibitory inputs due to activity in RA.

Two nonspiking motor output units with time constant {tau}m and tonic activations bi sum the synaptic activations from RA, through a fixed set of RA–output weights A

Formula 4
The weights A are chosen so that the RA neurons have myotopic connections to the outputs and have push–pull control over each output.

PREMOTOR NETWORK PARAMETERS. For all HVC and RA neurons, Cm = 1 µF/cm2, VL = –60 mV, VE = 0 mV, and VI = –70 mV. The leak conductance is gL = 0.3 mS/cm2 for HVC neurons and gL = 0.44 mS/cm2 for RA neurons. The threshold membrane potential is V{theta} = –50 mV, and Vreset = –55 mV. The synaptic time constant is {tau}s = 5 ms for HVC -> RA, LMAN -> RA, and RA -> motor output connections. We also assume {tau}m = 5 ms. In all simulations, the time grain is dt = 0.2 ms, so Eqs. 3 and 4 are discretized, and {delta}(t Formula 4) -> {delta}Formula 4 There are NHVC HVC neurons, NRA RA neurons, and NLMAN LMAN neurons in our simulations. In all cases, NLMAN = NRA. The synaptic conductances in HVC are gI,i(t) = 0 for all neurons at all times; gEi(t) = 0 for all neurons at most times in the motif, except for one brief excitatory pulse of duration 6 ms and magnitude 0.13 mS/cm2 per neuron per motif. The onset times for the pulses for different HVC neurons are distributed evenly across the simulated motif, and this pattern of HVC inputs stays fixed throughout learning. In RA, the synaptic conductances are gEi(t) = 0.0024[{sum}j Formula 4+ Formula 4(t)], and gI,i(t) = (0.2/NRA) {sum}i siRA(t) for all i. With these numerical values, the average excitatory drive to each RA neuron is approximately eightfold stronger than the average inhibitory drive from global inhibition. However, results reported here do not depend on the existence of global inhibition in RA; we have performed simulations with no inhibition in RA, and the results remain qualitatively unchanged. The HVC -> RA synaptic weights W are initialized randomly with uniform probability on the interval [0, 1.5] in all the simulations shown herein. RA–output weight matrix A: half of all RA neurons, randomly chosen, project to m1; the other half project to m2. Of the set projecting to m1, half the weights are of uniform strength 440/NRA and half are –440/NRA. Similarly, of the set projecting to m2, half the weights are uniformly 640/NRA and the other half are uniformly –640/NRA. These values were chosen to be large enough so that the maximum range of the network outputs could span the amplitudes and pitches present in the recorded tutor song. The opposing signs of the weights A to the output pools are meant to represent bidirectional muscle control from some resting position (Suthers et al. 1999Go)—rather than literal excitatory or inhibitory synapses. The strengths scale inversely with NRA to keep the mean output drive the same when NRA is varied. The baseline or "resting" values of the outputs in the absence of any drive from RA are b1 = 60 and b2 = 40.

All microscopic parameters such as individual neural leak conductances, time constants, and so forth are kept fixed, while scaling the size of the network and generating learning curves for the scaled network. To do this correctly, we have to scale some other macroscopic parameters together with network size. For example, if the RA layer is scaled up by a factor of 4 in size, then all weights from RA to the motor outputs are globally scaled downward by the same factor of 4 to keep the maximum summed drive to the output units, and thus the range of allowed vocal pitch and amplitude, fixed. Such scaling is described in both the preceding and subsequent text.

The total length of the simulated song motif is T = 300 ms in Fig. 4. In Fig. 6, we study the effects of song length and HVC size on learning time. To make the comparison reasonable, we change song length and HVC size while keeping total HVC drive per song-moment constant, so we scale NHVC with song length. Both are reduced fourfold, so T = 75 ms and NHVC = 180; all other parameters are kept unchanged. In Fig. 7, we study the effects of scaling RA size on learning time. Because of the result that song learning does not depend on song length and HVC size, and because it is currently infeasible to run simulations with larger networks, both curves are trained with the short-duration song (T = 75 ms) with small HVC (NHVC = 180). In one curve, NRA = 200; in the other, RA size is increased fourfold, to NRA = 800. NLMAN and the weights A rescale automatically as described earlier. To keep the total variance of the output motor pools fixed as NRA is scaled, we rescale the size of the experimental pulses from LMAN to be larger by a factor of Formula 4. The learning rate {eta} is empirically adjusted in both cases to give the fastest possible stable (monotonically nonincreasing on a coarse scale) learning curves for each case.


Figure 4
View larger version (48K):
[in this window]
[in a new window]

 
FIG. 4. Song spectrograms, song pressure waves, and the learning curve. A: song spectrograms and sound pressure waves. Left: template is a 300-ms recording of an actual zebra finch song. Middle: before learning, the song is a harmonic stack with randomly varying pitch and amplitude. Right: at 1,200 iterations, the model produces a reasonable copy of the template song. B: time course of song learning in the model song network. Song learning has neared its asymptotic value at around 1,000 iterations.

 

Figure 6
View larger version (12K):
[in this window]
[in a new window]

 
FIG. 6. Learning time and the reinforcement signal. Learning time and baseline error increase with temporally imprecise reinforcement and decrease when reinforcement has a small mean. In all preceding simulations, the reinforcement signal of 0's and 1's is delayed but temporally precise. If it is temporally broadened by 50 ms to mimic the effects of a temporally imprecise neuromodulatory signal, then learning suffers a significant slowdown even after the learning rate parameter is adjusted to find the fastest stable learning curve (top learning curve, black). Baseline error also increases. Temporally broadened reinforcement delivers less-specific information about song performance, further exacerbating, by a roughly equal amount, the temporal credit assignment problem already incurred in all previous simulations, from broadening of the delayed synaptic eligibilities. If the song evaluation of 0's and 1's (top box) is translated to –1's and 1's (bottom box) before being temporally broadened, the resulting reinforcement signal has a far smaller mean value. Bottom learning curve (blue) shows the dramatic effects of this simple mean-subtraction operation on the reinforcement signal: learning time even with temporally broadened reinforcement is fast, converging in about 1,000 iterations and the baseline error is in fact lower than past simulations where the reinforcement signal was temporally precise but consisted of 0's and 1's—compare with learning curves from Fig. 5, A and B.

 

Figure 7
View larger version (29K):
[in this window]
[in a new window]

 
FIG. A1. Sample neural activities in the model network. A: spectrogram of the model network's output song after 1,200 iterations of learning, as in Fig. 4. B : voltage traces of HVC neurons. CF: voltage traces of RA neurons. Spikes are omitted for better resolution of the subthreshold voltages. Sharp downward resets in the voltages mark the spike times. B: voltage traces of 10 randomly selected neurons HVC neurons; these HVC neural activities are enforced inputs to the song-learning network. CF: voltage traces of neurons from each of the 4 RA neuron pools that project to the 2 output motor pools, after learning is complete. C: voltage trains of 6 different neurons in RA that project to the half of motor pool m1 responsible for driving song pitch in one direction. D: traces from RA neurons projecting to the other half of motor pool m1, which drives song pitch in the opposite direction. E: traces from RA neurons that drive motor pool m2, responsible for song amplitude, in one direction. F: traces from neurons driving motor pool m2, and song amplitude, in the opposite direction. Simulations shown here include recurrent inhibition in RA to mimic the functional connectivity of the bird song pathway (see METHODS), but similar results are obtained if such inhibition is removed entirely (not shown).

 
SOUND GENERATOR. Due to the 0.2-ms time discretization used to integrate the preceding network dynamics, the outputs m1(t) and m2(t) are generated at a resolution of 5 kHz only. We linearly interpolate these output trains to generate a pair of output command signals, Formula 41(t) and Formula 42(t), sampled at 44 kHz. Formula 41(t) specifies the delta-pulse spacing (pitch period); for period to pulse conversion, a counter sums 1/Formula 41(t) until it crosses 1, which triggers a pulse of duration (1/44,000) s, and the counter is reset to 0. The height of each pulse is specified by the value of Formula 42(t) x 10–3 at the time of the pulse. We use a fixed 10-parameter linear predictive coding (lpc) filter derived from a concatenated sample of three arbitrarily selected zebra finch song recordings. The filter parameters are static and do not change over the course of the song or over the course of song learning. The real part of the filtered pulse train is the student song.

CRITIC. Pitch extraction: The songs are windowed into overlapping segments by multiplication with a 300-sample (6.8-ms) Hanning window that shifts by 10 samples (0.23 ms) at a time until the entire length of simulated song is covered. To obtain a value for the pitch from each windowed segment, we compute the autocorrelation of that segment; the pitch period is assigned to be the number of samples between the highest peak (at zero time lag) and the second-highest peak, so long as this value is between 12 and 80; if outside this range, the distance to the next-highest peak is computed, until a value is found that falls in the allowed range. The middle 10 samples of the current windowed segment are assigned this value of estimated pitch. This procedure is repeated for each segment. The beginning of the first windowed segment and the end of the last windowed segment of the song are assigned the same pitch values as their closest assigned neighbors. Amplitude extraction: The songs are windowed into 100-sample (2.3-ms) disjoint segments. All 100 samples of each disjoint segment are assigned an amplitude of 0.3 x max |song segment|. Let p(t), a(t) represent the student song pitch and amplitude, and let Formula 4(t), a(t) represent the tutor song pitch and amplitude.

The reinforcement signal R is computed by thresholding the delayed estimate of performance

Formula 5(5)
In simulations where the evaluation signal is temporally broadened, it is low-pass filtered according to

Formula 6(6)
with {tau}R = 50 ms.

In the preceding expressions, {theta}[D(t) – Formula 6(t)] is 0 when the performance D(t) is worse than a threshold Formula 6(t), and is 1 when it is better. To mimic delays inherent in the transformation of network activity into vocal output and auditory processing, we assume that D(t) is itself a delayed measure of network performance: at time t, it reflects the performance of the network outputs at tTdelay. It is given by D(t + Tdelay) = –{[Formula 6(t) p(t)]2/cp2 + [a(t) – a(t)]2/ca2} when the tutor song is nonsilent, and is D(t + Tdelay) = –2[a(t) – a(t)]2/ca2 during silent intervals in the tutor song. The parameters cp and ca equalize the importance given by the critic to pitch and amplitude; cp = 60, ca = 80 x 10–3, and Tdelay = 50 ms. The critic threshold Formula 6(t) adapts as the model birdsong network learns song, and is time-varying within the song. For each time t0 in the motif, Formula 6(t0) is obtained by linearly low-pass filtering D(t0) over the past five motif iterations. In all the simulations except Fig. 6, {tau}R = 0 ms: in other words, the reinforcement is delayed while eligibility is correspondingly delayed and broadened (temporally imprecise), although the reinforcement signal is not itself not broadened.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 APPENDIX A
 APPENDIX B
 GRANTS
 ACKNOWLEDGMENTS
 REFERENCES
 
Our simulations used model networks of spiking neurons (Fig. 3). A layer of HVC neurons drives a layer of RA neurons. The RA neurons drive two output units, which represent the population activity of two motor neuron pools. The network controls a source–filter model of the avian vocal tract, which consists of a pulse train exciting a linear filter to yield simulated birdsong. The frequency and amplitude of the pulse train are controlled by the two output units of the network. The model network together with the model vocal tract constitute the "actor" of the actor–critic–experimenter schema.


Figure 3
View larger version (29K):
[in this window]
[in a new window]

 
FIG. 3. A spiking network model of birdsong learning. HVC and RA are layers of integrate-and-fire neurons (A), controlling the pitch and amplitude of a simple model of avian vocalization (B). LMAN sends empiric synapses to RA neurons. Acoustic output signal (C) is compared by a critic (D) to a template (E) that is a recording of an actual zebra finch "tutor" song. When the match is good, the critic signals the plastic synapses in the RA layer, which change their strengths according to Eq. 1 given in the text. F: activity of 2 typical model HVC neurons, driven by brief current pulses, shown for a segment of the simulated song motif. G: activity of 2 typical model RA neurons, receiving HVC and LMAN inputs, after 1,000 iterations of learning (more traces are available in APPENDIX A). H: amplitude of the tutor song (black) and activity in the motor output controlling model song amplitude (blue).

 
Figure 3 (right) depicts the activity of the network during a simulation. Dynamical variables include the membrane voltages of HVC and RA neurons (shown), as well as synaptic conductances (not shown). Spikes in these neurons are modeled using a leaky integrate-and-fire mechanism. Because the output units represent the population activity of motor neuron pools, rather than single neurons, they are nonspiking, carrying signals that vary smoothly in time.

The songs of the model network before and after learning are compared in Fig. 4A. The network learned to approximate the song template shown in Fig. 4A (left), which was a 300-ms segment of song recorded from a real zebra finch. Before learning, the simulated song looks nothing like the template. After learning, the simulated song is a good approximation to the template (sound files included in Supplemental Materials).9

Before learning, the strengths of the synapses from HVC to RA were initialized randomly. During the learning process, the strengths of these synapses were changed according to Eqs. 1 and 2. The spatiotemporal pattern of HVC neural activity was assumed to remain constant. Changes in the synapses from HVC to RA caused the formation of a "premotor map" that translates HVC spiking into a sequence of vocal commands appropriate for generating song.

Dynamics of learning

The start and end of the learning process are depicted in Fig. 4A. The process did not occur suddenly, but rather happened incrementally. The network generated simulated songs for thousands of trials. During each trial, it received reinforcement signals from the critic, which compared the simulated song with the song template. Whenever the match between the two was good, the critic sent a positive reinforcement signal. This happened many times per song because the critic continuously evaluated the song throughout each trial. Because the threshold for good performance was set by the average over recent trials, the threshold became higher as performance improved.

The "learning curve" of Fig. 4B is a graph of song error versus the number of trials. This error is the mismatch in pitch and amplitude between the simulated song and the real song. It starts high and then converges to a low value within about 2,000 iterations. Is this convergence time fast or slow? It has been estimated that a juvenile zebra finch may practice its song up to 100,000 times over the course of learning (Johnson et al. 2002Go). Therefore the model learns relatively quickly, compared with a real zebra finch. As will be seen later, the learning time of the model may change if the properties of the reinforcement signal are changed.

After convergence there is a residual error that does not vanish. The residual could arise from several sources. First, the network may have converged to the vicinity of a local minimum of the error, rather than a global minimum. Second, even a global minimum might have nonzero error. Third, even if the network converged to a global minimum, such convergence would be probabilistic. As long as the synaptic strengths are governed by the learning rules, they would continue to fluctuate around their optimal values. Fourth, even if the synaptic strengths were frozen at their optimal values, the simulated song would fluctuate randomly because the network continues to be perturbed by random synaptic input from LMAN from trial to trial.

RA size

If many (N) neurons collectively drive the output of a network, the share of any one neuron's activity in the total output and reinforcement is small (~1/N). If all neurons fluctuate independently and simultaneously, any one neuron's contribution to the overall output fluctuations is swamped by all other neural contributions. A neuron would have to correlate its own activity with the output for many trials to determine the sign of its effect on the output. Therefore when learning is based on the correlation of individual neural fluctuations with a global reinforcement signal in large networks, learning may be expected to be quite slow.

In the simulations of Fig. 4, our model learned song substantially faster than a real zebra finch. However, the model network was composed of just 720 HVC neurons and 200 RA neurons. The HVC and RA of a real zebra finch are estimated to contain about 20,000 HVC neurons and 8,000 RA neurons, or 10–100 times more neurons and 500–5,000 times more synapses than in the model. Each RA neuron receives parallel, independent, time-varying perturbations from LMAN. RA neural activities sum to drive the motor pools; thus correlations between conductance fluctuations in a single RA neuron with the reinforcement signal diminish with increasing RA size. What is the learning time in a realistically large birdsong network? Unfortunately, numerical simulations of a model network of this size are currently impractical. Instead, we have taken the approach of varying the size of HVC and RA in our model to empirically determine how learning time scales with network size. This allows us to extrapolate learning time for network sizes larger than we can simulate.

We performed numerical simulations to investigate the dependence of learning time on RA size. Figure 5B shows that the learning curve changes little even if RA size is increased by a factor of 4.


Figure 5
View larger version (28K):
[in this window]
[in a new window]

 
FIG. 5. Learning time does not scale with network size. A: learning time is independent of HVC size and the length of trained song. Learning curves for the model with a long song and large HVC (black), and 4-fold shorter song and smaller HVC (blue). Inset 1: tutor song spectrogram. First, the model song network is trained on 300 ms of tutor song, with 720 RA-projecting neurons in HVC. Data are the same as in Fig. 4. Next, the HVC size of the model network and the length of the template song are reduced 4-fold, to 180 neurons and 75 ms, whereas all other parameters of the network are unchanged. HVC size and song length are scaled together to keep the summed HVC drive at each moment of song fixed. Inset 2: to better compare the time course of learning, both curves are shifted to zero baseline error, then scaled so their initial errors match. Time course of learning is the same for these 2 cases, despite the 4-fold change in trained song length and HVC size. This happens because moments of song separated by {gtrsim}50 ms are produced and learned in parallel by independent sets of HVC -> RA synapses. B: learning time does not scale as a function of RA size. Learning curves for 200 (blue) and 800 (red) RA neurons. Larger network learns at roughly the same rate as the smaller because its extra neural degrees of freedom are redundant for the fixed task, and therefore do not slow down learning. Shorter 75-ms song template fragment from A was used for both curves.

 
This result in the full spiking network is consistent with analytical and numerical results in a reduced model of the birdsong network (APPENDIX B). We find that in a network of linear neurons, if reinforcement is computed relative to a baseline and if RA–output connections are myotopic (each RA neuron projects to just one motor pool), then learning time is independent of the size of the RA layer and thus also does not increase with the number of independent perturbations injected into the system.

These results may be surprising, when compared with theoretical studies indicating that the learning time for a feedforward network can scale linearly with its size, if trained by a reinforcement learning algorithm (Cauwenberghs 1993Go; Werfel et al. 2003Go).

Why is it that learning does not slow down with increasing RA size? In the birdsong network, individual RA–output (and thus RA–reinforcement) correlations do diminish with RA size. If the learning problem depended on each HVC–RA synapse attaining a specific desired value, learning would indeed have slowed down considerably. However, what matters for song production is the summed output from several RA neurons to each motor pool, not the individual contribution of each RA neuron. Consequently there are many configurations of synaptic strengths that will lead to good performance. In other words, the model network is a degenerate or redundant representation. Because it is so large, it has more neurons than necessary to perform the task. Thus although there are more synaptic strengths to learn in a large network, each can be learned more sloppily. These two effects compensate for each other, so that learning time is unchanged.

HVC size and song duration

In Fig. 5A, the dependence of learning time on HVC size is addressed. In our model, HVC size is equivalent to song duration. This is because each HVC neuron bursts only once during song [in accord with experimental findings (Hahnloser et al. 2002Go)], and a fixed number of HVC neurons is assumed to be active at any given moment. Therefore we have scaled song duration in tandem with HVC size.

Learning curves for two model networks are shown. The first network has 720 HVC neurons and is trained on 300 ms of song. The second network has 180 HVC neurons and is trained on 75 ms of song. The learning curves look about the same. This suggests that learning time is independent of HVC size/song duration.

What is the reason for this independence? Because each HVC neuron bursts only once during song, moments of song separated by ≥10 ms are driven by completely separate sets of HVC neurons. Further, the critic evaluates each moment of song, delivering its evaluation continuously in time. This means that the learning of each moment of song occurs independently and in parallel. As a result, when measured in number of trials, learning time has no dependence on HVC size/song duration.10 If the critic delivered a single evaluation for the whole song rather than separate evaluations for each moment,11 then we expect that learning time would become dependent on song length. However, we find it plausible that the critic compares song output with the template continuously throughout time.

Analytical and numerical results in a reduced model of the birdsong network (APPENDIX B) are consistent with the full spiking model network results. In the reduced model as in the spiking network, increasing song length/HVC size has no effect on learning time. This is true only if reinforcement is delivered on-line, and if HVC activity is unary, with each neuron firing exactly once per motif. We find that if the encoding of different time steps in HVC is statistically orthogonal but not unary, learning time will grow linearly with the number of HVC neurons.

Number of muscle groups or output degrees of freedom

It is difficult to systematically vary the complexity or dimensionality of the model sound generator, which uses two network-driven control variables (pitch and amplitude) to produce output sounds that can resemble a recorded finch song. We would encounter the same difficulty if the sound generator were constructed from physics-based parameterized models of the songbird syrinx (Elemans et al. 2004Go; Fletcher 1988Go; Titze 1988Go). Instead, in a reduced model of the song network (APPENDIX B), we can systematically vary the number of output units that independently contribute to performance and thus to the reinforcement, and analytically compute the dependence of learning time on the number of output degrees of freedom.

In this complementary approach (APPENDIX B), we find that learning time grows linearly with the number of outputs that must be independently controlled and that independently contribute to the reinforcement signal. In contrast with scaling of the RA layer, doubling or quadrupling the number of independent outputs affects network size only slightly because the total number of outputs is small compared with the total number of neurons in the song network. However, scaling the number of outputs