|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1Department of Biology, University of Texas at San Antonio, San Antonio, Texas; and 2Department of Physiology and Biophysics, University of Washington, Seattle, Washington
Submitted 2 April 2007; accepted in final form 9 October 2007
|
|
ABSTRACT |
|---|
|
|
|
INTRODUCTION |
|---|
|
|
The general problem outlined above is not addressed by most models of STDP-based learning, and it is not obvious how STDP as it is currently understood could be directly responsible for more general forms of learning. One flexible approach to solving this problem is reinforcement learning, where the solution space is often explored stochastically and learning is driven by a simple scalar evaluation of performance. Models of reinforcement learning typically are abstract algorithms not based on explicit neural modeling (Sutton and Barto 1998
), although that is beginning to change (Izhikevich 2007
; Pfister et al. 2006
; Seung 2003
; Xie and Seung 2004
). Here we present an implementation of reinforcement learning by a biologically plausible neural network, using a simple and novel modification of the SDTP rule. In the most basic version of the approach pursued here, spiking patterns that are relatively similar to some "target pattern" of postsynaptic spikes are accompanied by the normal operation of the STDP rule, strengthening the synapses that contributed to the generation of that pattern, while STDP-driven synaptic changes are suppressed after spike trains that are dissimilar to the target pattern. The stochastic exploration of solution space is driven by variations in presynaptic activity. We evaluate this basic idea in a simple yet reasonably biologically plausible feedforward network and identify the factors that are needed to make it work.
|
|
METHODS |
|---|
|
We consider a two layer feedforward network (Fig. 1B), where the activity in the input layer is simply modeled as independent inhomogeneous Poisson processes. This input layer projects in an all-to-all pattern to an output layer of explicitly modeled neurons. We sought a model for these output neurons that is reasonably realistic yet "generic," lacking characteristic physiological properties that vary dramatically depending on cell type. We chose a single compartment conductance-based model whose membrane voltage is governed by
![]() |
gAHP. A spike fired by presynaptic unit i triggered an increment gij in the ge of each postsynaptic neuron j, where gij is the strength of the synapse from input unit i to output neuron j. ge and gAHP decayed exponentially with time constants
syn and
AHP, respectively. Spike refractoriness was ensured by the AHP conductance, which kept interspike intervals above 5 ms over the entire range of synaptic strengths we considered. The cellular parameters used in our simulations were R = 100 M
, Erest = –70 mV, Esyn = 0 mV, EAHP = –90 mV,
gAHP = 10 nS,
syn = 3 ms,
AHP = 10 ms, and T = –45 mV, with the capacitance adjusted to yield a membrane time constant of 50 ms. This produces a model neuron with nearly linear subthreshold responses and a roughly linear spiking response to injected current and active synaptic conductance with no spike rate accommodation (Fig. 1, C and D).
For the first part of this study, networks consisted of 1,000 input units and a single output neuron. Initial synaptic strengths were chosen from a gaussian distribution with a mean of 0.32 nS and an SD of 0.05 nS. Simulations were divided into epochs ("trials") lasting 1 s. Throughout this paper, time t always denotes the time relative to the onset of the current trial, and thus 0
t
1,000. One half of the input units were governed by homogenous Poisson processes at a rate of 5 Hz, supplying random "background" synaptic input. The remaining input units followed Poisson processes whose rate parameter varied over the course of a trial. For the first 100 ms and final 100 ms of a trial, these units remained largely silent, but for the remainder of the trial, their time-dependent rate parameter consisted of gaussian peaks that placed their spikes at particular times within a trial. The temporal precision of the spiking was controlled by the width of the peaks, set to a SD of 10 ms. Within a given simulation, these rate functions did not change across trials. Thus these units were effectively following the same "script" on each trial, with a different script for each unit. We used two methods for creating rate functions in our simulations. The first method, yielding what we call a "regular" script, generated a random 800-ms spike train (homogeneous Poisson process at 5 Hz) and placed a gaussian centered on each spike in the train. The height of each gaussian was adjusted to give one spike per peak on average, although the actual number of spikes did of course vary from trial to trial. Note that gaussians with peaks centered near the 100- or 900-ms boundaries would give these units a small chance to fire outside of those bounds. An example of a regular script for an input unit is shown in Fig. 2 A. In the second method, each input unit was randomly assigned a single "burst time" somewhere between 100 and 900 ms into the trial (a different time for each unit), and a 10-ms-wide gaussian was placed at that time. The height of this gaussian was adjusted to yield five spikes on average. Thus the second method yields scripts (called "1-burst" scripts) that cause each input unit to fire a single high-frequency burst of spikes during each trial.
|
For modeling synaptic plasticity, we used the STDP rule described by Froemke and Dan (2002)
, based on recordings of layer 2/3 pyramidal cells in rat visual cortex. We chose this particular implementation of STDP because of its simplicity, its examination of the effects of whole spike trains rather than just isolated spike pairs, and because it is appropriate for plasticity at corticocortical synapses that could plausibly be involved in the kind of learning that we are interested in. However, studies of STDP at other synapses in the isocortex and hippocampus have revealed substantial differences in the factors controlling the induction of synaptic plasticity. For example, induction of LTP at synapses connecting pairs of layer 5 pyramidal neurons requires higher frequency pairing (>10 Hz) in both visual (Sjöström et al. 2001
) and somatosensory (Markram et al. 1997
) cortices of rats, whereas Froemke and Dan (2002)
could induce LTP with pairing at 0.2 Hz. In the hippocampus, induction of LTP requires both higher frequency pairing (
5 Hz) and burst firing in the postsynaptic cell (Magee and Johnston 1997
; Pike et al. 1999
). Because we could not choose one STDP model that incorporates and consolidates these disparate findings, we simply selected one of them, that of Froemke and Dan (2002)
, with the understanding that our results may not apply to synapses where this specific formulation is not accurate. On the other hand, our model does require that STDP be modulated or gated, conditions that of course are not a part of the Froemke and Dan (2002)
formulation. As we argue in the DISCUSSION, additional induction requirements, such as the need for postsynaptic burst firing, may supply the mechanism for this modulation.
The basic rule for STDP-based changes (Fig. 1A) is given by
![]() | (1) |
t is the timing of the postsynaptic spike relative to the presynaptic spike. If the pre- and postsynaptic spikes are perfectly synchronous (
t = 0), we assume that the synaptic strength does not change. To this basic formulation, Froemke and Dan (2002)
, where tspike is the time to the preceding spike. The final change in synaptic strength between units i and j induced by a single spike pair is given by
, where
pre and
post are the separate pre- and postsynaptic efficacies, governed by distinct time constants
spre and
jpost. Froemke and Dan (2002)
+ = 13 ms,
– = 35 ms,
ipre= 28 ms, and
ipost = 88 ms. We did not directly adopt their values for A+ and A–, because they are expressed as the percentage change in synaptic strength after 60–80 pairings. It is not clear if synaptic changes generally scale in that way (where stronger synapses experience larger absolute changes in conductance), and because most models of STDP to date have expressed the changes in absolute terms, we continue that practice. Thus our A+ and A– have units of conductance, rather than dimensionless fractional changes as in Froemke and Dan (2002)
Although we do not implement synaptic changes as a percentage of current strength, we do consider the possibility that the size of changes scales in some way with the strength of the synapse. We impose maximum and minimum strengths on the synapses, gmax and gmin, equal to 10 times (3.2 nS) and 1/10th (0.032 nS) of the average initial synaptic strength, respectively. Under our "additive" model, the gij are simply clipped to their maximum/minimum value if the application of STDP would push them outside of that range. However, some STDP modeling studies use a rescaling in which the size of
gij is reduced as gij approaches its limits (Gütig et al. 2003
; Rubin et al. 2001
; van Rossum et al. 2000
). This kind of rescaling is sometimes called a "multiplicative rule" (Gütig et al. 2003
; Rubin et al. 2001
), although this does not correspond to the additive/multiplicative terminology of Froemke and Dan (2002)
, and the multiplicative rule given here differs from the one described in Kepecs et al. (2002)
. We study the behavior of our model under both the additive rule described above and a multiplicative rule that rescales potentiating changes by the factor (gmax – gij)/(gmax – gmin) and depressing changes by the factor (gij – gmin)/(gmax – gmin), which is simply a generalization of the method of Rubin et al. (2001)
to cases in which gmax
1 and gmin
0.
In most of our simulations, we incorporated activity-dependent scaling of synaptic strength, modeling the phenomenon reported in pyramidal neurons cultured from rat visual cortex (Turrigiano et al. 1998
). Our model of activity-dependent scaling was based on the approach of van Rossum et al. (2000)
, with some modifications. Postsynaptic activity in each output neuron j was tracked using a variable aj(t) obeying the equation
![]() |
![]() |
gij were calculated at the end of each trial. Because we were not trying to maintain activity at one specific level agoal, we did not need to use the "integral controller" correction used by van Rossum et al. (2000)
![]() |
a=10 s, amax=100 (because
a
=
ax rate, this sets the maximum firing rate to roughly 10 Hz), and β=10–3 (synaptic scaling) or β = 10–2 mV (excitability changes). In most simulations, amin = 9.5 (a minimum firing rate of just under 1 Hz). However, for simulations in which output neurons were trained to produce specific spike trains containing more than one spike (i.e., the results shown in Fig. 5), amin = 9.5 x Nspikes, where Nspikes is the number of spikes in the target spike train.
|
Initially, reinforcement learning was implemented by choosing "target" spike trains for each output unit (representing the goal of the training), calculating the difference between those target spike trains and the network's actual output, and transforming that difference into a reward signal that modulated synaptic plasticity. The difference
j(t) between the actual and target spike trains for neuron j as a function of time t in the current trial was determined by convolving the spike trains (represented by a temporal series of 1s where spikes occur and 0s otherwise) with a gaussian of unit height and SD
(typically 10 ms) and subtracting one of the smoothed spike trains from the other. The reward signal Rwd(
j) is
![]() | (2) |
|
[0,
) into the interval (0, 1], with Rwd(0) = 1. The angled brackets denote an average over all output neurons j. We used
= 3 for all simulations. One should note that if the interspike intervals in the actual output and target output are substantially greater than the smoothing parameter
, as they were for most of our simulations, the maximum value
can take is
1, and thus the minimum possible reward is e–
. If one wanted the minimum reward to be 0, one could redefine the reward as Rwd =
e–
|
j(t)|
– e–
, but because learning performance is not qualitatively improved by this definition, we did not adopt it for the simulations presented in this paper. Initially, reward-dependent modulation of STDP was implemented by setting the change in synaptic strength to the product of the reward signal and the change that would be produced by unmodulated STDP. Hence, for synaptic changes triggered by a postsynaptic spike in output unit j occurring at time t,
However, in most simulations we implemented an adaptation of the temporal difference algorithm for reinforcement learning where adaptive changes are driven by the difference
R between the reward received and the reward expected (Sutton and Barto 1998
). In the most general implementation of this algorithm, the system's task is to adopt a policy that leads it to choose actions that will maximize its total future reward given the current state of its environment, s(n), where n is the trial number. It uses a "value function" V[s(n)] to estimate the future reward given the current environment s(n): V[s(n)] =
where E[Rwd(n)] is the expected reward resulting from the action triggered by the current state s(n) under the system's current policy, and
is a "discount factor" (0
1) that assigns smaller weights to expected rewards further in the future. The "temporal difference error" used to improve the current policy is the sum of the actual reward and the updated expected future reward resulting from the chosen action minus the total future reward expected before that action was taken:
(Sutton and Barto 1998
).
Translating our model into the language of the temporal difference algorithm, the environmental state s(n) is the input pattern presented on trial n, the "policy" is determined by the synaptic strengths, and the action chosen is the set of output spike trains. Our model constitutes a special case in which future states s are independent of the action chosen, so the only reward prediction possible is the average reward given the network's current "policy." Thus
and
In our model, both the "environmental state" s (spike trains provided by the input units) and the "action chosen" (spike trains generated by the output units) are functions of time t in the trial. Thus the reward, average reward, and temporal difference error are all functions of time in the trial:
. Ideally,
Rwd(t)
would be the reward received under a fixed "policy" (fixed synaptic strengths), averaged over many trials. Because the synaptic strengths change on every trial, this ideal is unobtainable and
Rwd(t)
is instead a running average of the reward recently received. At the end of each trial, after
R(t) has been calculated,
Rwd(t)
is updated as follows
![]() |
To use the temporal difference error to drive learning in our model, we simply multiply the synaptic changes of the unmodulated STDP rule by
R(t) instead of Rwd(t)
![]() | (3) |
R(t) can be negative, this learning rule permits anti-Hebbian synaptic plasticity, where pre–post pairings induce LTD and post–pre pairings yield LTP. It is not difficult to envision circumstances under which formerly LTP-triggering patterns of activity are made to induce LTD instead (see RESULTS), but we feel that the conversion of LTD into LTP is less plausible. For this reason, Eq. 3 is applied with the following exception: if
R < 0 and F(
tij) < 0,
gij = 0.
We quantified model performance using a modified version of the reward signal. To obtain a performance metric that did not depend on the number of spikes in the target spike train, we normalized the difference
j(t) between the target spike train and the actual spike train by the number of spikes in the target train, Njspikes, thus replacing
j(t) in Eq. 2 with
. To obtain a single number characterizing the performance of the network over a trial, we averaged this modified reward measure (denoted Rwd*) over the time in trial. Unfortunately, this results in a performance measure that is restricted to a relatively narrow range of values—performance in random networks is already
0.65, and networks that do not fire at all get an average modified reward of 0.88–0.92, depending on the target pattern. We therefore scaled the performance measure to range between 0 and 1: Rwd*: Performance = (
Rwd*
– 0.6) x 2.5. This performance measure was used only to quantify the success at learning target patterns; it was never used to modulate plasticity or drive the learning process.
After exploring the capabilities of networks containing a single output neuron, we considered networks with multiple output neurons. At first, multineuron reinforcement learning was implemented as described above: each output neuron was assigned a distinct target spike train to reproduce, but all output units received the same reinforcement signal, which was simply derived from an average of the individual neuron rewards. As shown in RESULTS, this was not particularly successful for networks containing more than three or four output neurons. In subsequent multineuron training, the target output activity no longer took the form of distinct spike trains assigned to specific output neurons; instead, the target activity was expressed as the fraction of output neurons that were to fire at different times in the trial. For example, if the target pattern specifies that 25% of the output neurons be active at a particular time, the network's performance is evaluated without regard to which output neurons are firing; only the number active is relevant. To implement this idea, we define Oj(t) as the spike train generated by output neuron j convolved with a gaussian waveform of
= 10 ms, and let G(t) denote the goal of learning, the "target pattern" that specifies what fraction of the output neural population should be active as a function of time in trial (naturally,
The difference between the actual output and the target output is given by
(t) =
Oj(t)
– G(t), where the angled brackets denote an average over the output neurons j. Then the reward is again Rwd(t) = e–
|
(t)|, and the reinforcement signal used to modulate synaptic plasticity is once again
R(t) = Rwd(t) –
Rwd(t)
.
This procedure for comparing the output of a neural population to a desired population response G(t) and computing the reinforcement signal
R(t) is fairly straightforward, but it introduces a new complication. The magnitude of
R(t) depends on the magnitude of fluctuations in Rwd(t) across trials, and that in turn depends on the magnitude of fluctuations in
Oj(t)
. As the number of output neurons increases, the variability in individual output neurons remains the same, and hence the variability of
Oj(t)
across trials should decrease as more neurons are included in the average. That will cause
R(t) to grow smaller in networks with more output neurons, and because the amplitude of
gij is directly proportional to
R(t), synaptic plasticity will be suppressed in larger networks. The obvious solution is to add a factor to Eq. 2 that compensates for the shrinkage in
R(t) caused by increasing the number of output neurons N. If the Oj(t) varied independently from trial to trial, an approximate solution to this problem would be obtained by multiplying the
gij calculated from Eq. 2 by the factor
. However, the Oj(t) generally do not vary independently, because variation in Oj(t) is driven by variation in input activity, and all output neurons are driven by the same input units. The precise degree to which trial-to-trial fluctuations in Oj(t) are correlated depends on the synaptic matrix gij, which makes the appropriate choice of "correction factor" rather complicated. Preliminary simulations indicated that the
factor overcompensates for diminished
gij in networks containing >25 output neurons for most gij attained over the course of training, and that, on average, |
gij| was roughly one fifth of the mean magnitude occurring in one-neuron networks for any N
25. In view of these results, we adopted the effective, if inelegant, solution of multiplying the result of Eq. 3 by five in these simulations
![]() | (4) |
|
|
|
= 100 ms. This population response yields an average firing rate among output neurons of 1 Hz. The second type (bursty) consisted of four brief populations bursts (each described by a gaussian of
= 10 ms) placed randomly in the central 800 ms of the trial, but with a minimum interval of 50 ms between bursts to ensure that the bursts remained distinct. The burst heights were drawn from a normal distribution of mean 5 and variance 1, and the resulting G(t) was normalized to yield an average firing rate of 1 Hz across output neurons. The third type of target pattern was designed to assess our model's ability to learn an "arbitrary" waveform G(t). These "random" target patterns were produced in three stages (Fig. 7B, left). First, we generated 1,000 ms of zero-mean noise with a gaussian amplitude distribution of unit variance and a correlation time of 100 ms. Second, this noisy waveform was converted into a "probability of spiking." All negative portions of the waveform were set to zero, and regions approaching the bounds of temporally patterned synaptic input (at 100 and 900 ms) were rapidly—but not instantly—forced to zero by multiplication with a sigmoidal envelope E(t): E(t) = 1 + e125 ms – t)–1(1 + et – 875 ms)–1. The result was normalized to give a probability distribution for "potential output spike times." In the third and final stage, N spike times (where N is the number of output neurons) were drawn from this probability distribution, and G(t) became the sum of N gaussians, each of height
,
= 10 ms, and centered on the randomly selected spike times. Like the other two types of target pattern, these G(t) correspond to an average firing rate of 1 Hz among the output neurons. Because all G(t) considered in this study specified an average rate of 1 Hz, we could directly adopt as our performance measure in these simulations the value of Rwd(t) averaged over time in trial, without correcting for the number of spikes expected in the output.
|
|
|
RESULTS |
|---|
|
Unmodulated STDP destabilizes established mappings between spatiotemporal patterns of input and output activity
Our goal is to explore the possibility of using a modulated version of STDP to train a postsynaptic neuron to produce a desired spike train in response to a specific spatiotemporal pattern of input activity. To help motivate this, we first show the effect of the continuous, unmodulated application of STDP on a neuron that already generates a specific response to its patterned input. Figure 2 shows an example of a neuron that fires a single spike
600 ms into each trial, receiving input from 500 units that fire in a stereotypical pattern (Fig. 2A shows the probability of spiking for 1 such unit over the course of a trial) and 500 background units that fire randomly at an average rate of 5 Hz (uniform spike probability). The response pattern exhibited by the postsynaptic neuron (Fig. 2B) was created by making the synapses of presynaptic units active around 600 ms much stronger than all other synapses.
The operation of the normal STDP rule causes synapses active before 600 ms to grow stronger, so that the postsynaptic neuron eventually begins to fire shortly before the 600-ms mark. That, in turn, causes the depression of the synapses originally responsible for making the neuron fire at 600 ms and the potentiation of other synapses that were active earlier in the trial. In this way, the postsynaptic response occurs earlier and earlier, until it approaches the onset of the temporally patterned activity, 100 ms after the start of the trial (Fig. 2C). This phenomenon is well known in the STDP modeling literature and is typically presented as a boon: it is "predictive learning," whereby a neuron learns to respond to synaptic inputs that provide the earliest reliable prediction of its original response (Abbott and Blum 1996
; Blum and Abbott 1996
; Rao and Sejnowski 2001
; Roberts 1999
). However, there will inevitably be cases in which such "predictive learning" is not appropriate, and it seems likely that such cases could occur in cortical areas where STDP operates. It seems that some modulation of STDP is necessary simply to maintain stable mappings from presynaptic activity to postsynaptic response.
The simulation described above and shown in Fig. 2C assumes that changes in synaptic strength are made "additively," i.e., the magnitude of the change is independent of synaptic strength. This results in a strongly bimodal distribution of synaptic strengths (data not shown) that is characteristic of the additive implementation of STDP (Gütig et al. 2003
; Kepecs et al. 2002
; Rubin et al. 2001
; Song et al. 2000
; van Rossum et al. 2000
). Some modeling studies of STDP assume that synaptic changes depend on current synaptic strength, with the magnitude of the changes biased toward potentiation or depression as synaptic strength approaches its lower or upper bounds, respectively (Gütig et al. 2003
; Rubin et al. 2001
). This is sometimes called "multiplicative" STDP (Gütig et al. 2003
; Rubin et al. 2001
), although that term is also applied to cases in which the magnitudes of both LTP and LTD increase with synaptic strength (Kepecs et al. 2002
). Here, we adopt the terminology of Rubin et al. (2001)
and Gütig et al. (2003)
. There is some experimental evidence for this phenomenon, at least for depressing changes in cultured hippocampal neurons (Bi and Poo 1998
), and unlike additive STDP, it yields a unimodal distribution of synaptic strengths that can resemble the distribution of quantal amplitudes measured experimentally (Gütig et al. 2003
; Rubin et al. 2001
; van Rossum et al. 2000
). Testing our model with multiplicative STDP, we found that there was still bias toward firing earlier as the simulation proceeded, but response changes were dominated by a large increase in firing rate (Fig. 2D). Multiplicative STDP causes an overall increase in synaptic strength, because the initial strengths of most synapses were relatively close to the lower bound. Although the details differ, the continuous application of either additive or multiplicative STDP inevitably destroys any specific patterned response to temporally patterned presynaptic activity.
Simplest implementation of STDP-driven reinforcement learning is only partially successful
We begin with an extremely simple implementation of STDP-driven reinforcement learning. The spike trains generated by the output neurons are compared with some desired "target" output, and from the difference, a reward signal is computed. We calculated the difference
(t) between the target output and the actual output by subtracting smoothed versions of their respective spike trains, generated by convolving the spike trains with a gaussian of an SD of
(10 ms, in most cases). In choosing a specific form for the reward signal, we required that it depend only on the absolute difference between the target output and the actual output, i.e., it could not convey any "instructive" information about the kinds of changes needed such as whether the probability of firing at a particular time should be raised or lowered. We also wanted the reward function to map differences
(t) onto the interval (0, 1], with
= 0 generating a reward signal Rwd = 1 and Rwd
0 as
increases. For networks containing a single output neuron, we chose to define the reward signal as Rwd(t) = e–
|
(t)|.
The reward signal was used to modulate synaptic plasticity simply by multiplying the synaptic changes triggered by a postsynaptic spike at time t according to the standard STDP rule by the value of the reward signal at time t. Thus STDP-driven changes are largest during times when the actual output matches the target output, and grow smaller as the difference between them increases. This modulation of STDP could be implemented biologically, for example, by modulation of N-methyl-D-aspartate (NMDA)-type glutamate receptors (NMDARs); many neuromodulators are known to affect NMDARs (Köles et al. 2001
; MacDonald et al. 1998
). One should note that this implementation of STDP-driven reinforcement learning requires that the appropriate modulatory signal be present at the same time the output spike train is being generated. That in turn implies that the system providing the modulatory signal must somehow predict how closely the spike train will match the target output before they can be compared directly. This is an onerous task, but it is not impossible. Because variations in the output are driven by variations in the input activity, a modulatory system that monitored activity in the input layer could in principle use that information to predict how well the resulting output will match the target, although such a system would have to constantly adapt as synaptic plasticity changes the mapping between input activity and output activity. Arguments about the plausibility of such a system are reserved for the Discussion.
An example of the performance of this kind of model is shown in Fig. 3A, using the "additive" form of STDP. With the network in its initial state, the output neuron fired at a mean rate of 5.13 Hz (averaged over the entire 1-s trial), firing at fairly regular intervals starting shortly after the onset of temporally patterned input. The target spike train was a single spike fired 500 ms into the trial (Fig. 3A, bottom). Under training, the output neuron came to reliably fire an AP shortly before the 500-ms mark and stopped firing during most other times (Fig. 3A, top). However, it also fired consistently
800 ms after trial onset.
This example highlights a fundamental problem with the model in its current form: it has no mechanism to remove "unwanted" spikes, e.g., the second spike fired in many trials shown in Fig. 3A. This spike arose because the network in its initial state had a high probability of firing at that time (800 ms), enough to potentiate synapses active just before that time even with minimal reward (see METHODS). If the network already reliably fires a spike at a certain time, there is no guarantee that it will cease to do so under training, even if the reward signal at that time is strictly zero. There is another problem that is not shown in Fig. 3A, but which is readily apparent. If the network begins in a state in which it never fires a spike at a particular time, it can never learn to fire at that time no matter how large the "reward"—in STDP, no plastic changes occur in the absence of postsynaptic spikes.
Inclusion of activity-dependent synaptic scaling and anti-Hebbian STDP enables accurate reinforcement learning
There are forms of synaptic plasticity that do not depend on correlations between pre- and postsynaptic activity, such as the activity-dependent scaling of synaptic strength reported in isocortical neurons (Turrigiano et al. 1998
). This form of plasticity causes synaptic strength, as measured by the amplitude distribution of miniature excitatory postsynaptic currents, to increase if postsynaptic activity is suppressed by application of tetrodotoxin. Activity-dependent synaptic scaling or other homeostatic mechanisms for maintaining postsynaptic activity could solve one of the problems our current model faces—the inability to learn to fire at times when the starting network never fires. We incorporated activity-dependent scaling of synaptic strength into our model to test this hypothesis. The other major problem with our current model, difficulty in removing unwanted spikes in the output train, might be solved by allowing some form of "anti-Hebbian" STDP to occur under certain conditions. Examples of anti-Hebbian STDP, in which postsynaptic APs following EPSPs induce LTD rather than LTP, has been reported in a cerebellum-like structure in the electric fish (Bell et al. 1997
) and at some synapses in the mouse dorsal cochlear nucleus (Tzounopoulos et al. 2004
).
If anti-Hebbian STDP is to be used in our model, we must carefully consider how to apply it in a way that supports reinforcement learning. Guidance on this question can be found in the literature on an important algorithm for reinforcement learning known as temporal difference learning (Sutton and Barto 1998
). In temporal difference learning, adaptive changes are not directly driven by the reward; rather, they are driven by the difference between the future expected reward at one trial and the actual reward (plus an updated future expected reward) received on the next trial. In our model, this difference, denoted
R(t), is the difference between reward as a function of time in trial (the actual reward) and the average reward received over the last few trials (the expected reward):
R(t) = Rwd(t) –
Rwd(t)
. This temporal difference signal is used to modulate synaptic plasticity by multiplying the synaptic changes triggered by a postsynaptic spike at time t according to the standard STDP rule by
R(t). Whenever
R(t) < 0, i.e., whenever network performance is worse than its average performance over the last few trials, synaptic plasticity is anti-Hebbian, with one wrinkle noted below.
Although one of the first experimental studies of STDP observed anti-Hebbian STDP in the brain stem (Bell et al. 1997
), until recently, cortical STDP studies uniformly reported Hebbian plasticity. This raises the question of whether it is plausible enough to be included in a model aiming at a moderate degree of biological realism—could anti-Hebbian STDP be implemented by forms of neuromodulation that have already been observed in the isocortex? The fact that both LTP and LTD are triggered by increases in postsynaptic [Ca2+] suggests that it could. Because LTD is triggered by small increases in [Ca2+], whereas LTP appears with larger Ca2+ transients (Cho et al. 2001
; Cormier et al. 2001
; Ismailov et al. 2004
; Yang et al. 1999
), reducing the amount of Ca2+ influx resulting from pairing EPSPs with postsynaptic APs could make normally potentiating patterns of activity induce LTD instead. Indeed, partial blockade of NMDARs does exactly that (Cummings et al. 1996
; Froemke et al. 2005
; Nishiyama et al. 2000
), indicating that simple modulation of NMDARs, already proposed above, could suffice to implement our STDP-based version of temporal difference learning. However, such a simple mechanism could not support fully anti-Hebbian plasticity—although formerly LTP-inducing pairings would yield LTD, it is hard to see how formerly LTD-inducing pairings could cause potentiation. Furthermore, most recent studies that have found anti-Hebbian STDP in the isocortex report mainly pre–post LTD (Sjöström and Häusser 2006
), although some post–pre LTP has been reported at distal synapses, attributed to the delay between the first postsynaptic spike and the time of maximal dendritic depolarization (Letzkus et al. 2006
). Consequently, we do not incorporate fully anti-Hebbian plasticity into our model; if
R(t) < 0, pairings that would normally trigger synaptic depression do not change synaptic strength at all.
Figure 3B shows the performance of our model with activity-dependent synaptic scaling and anti-Hebbian STDP included, where the initial synaptic weights, patterns of input activity, and target output are the same as in Fig. 3A. As training progresses, the network now learns to produce the target output, using either additive (Fig. 3B) or multiplicative (data not shown) STDP, with quite accurate performance after fewer than 1,000 trials of training. Figure 3C shows how activity-dependent synaptic scaling permits the output neuron to learn to generate spikes at times when it originally never fired. The simulation starts after 5,000 trials of training to fire at 500 ms (the point reached at the end of the raster in Fig. 3B), but now the target output is switched to a single spike at 700 ms (Fig. 3C, bottom). Anti-Hebbian STDP causes the neuron to stop firing at 500 ms, at which point synaptic scaling increases overall synaptic strength until new spikes appear, including spikes near 700 ms (Fig. 3C, left). In addition to synaptic scaling, cortical neurons can also adjust their intrinsic excitability in response to lasting changes in activity level (Desai et al. 1999
). We wanted to know if alternative forms of activity homeostasis like this could substitute for synaptic scaling in our model. We modeled excitability homeostasis by having the AP threshold adapt if postsynaptic activity levels remained too low or too high. We found that excitability homeostasis could substitute for synaptic scaling (Fig. 3C, right); our model requires some form of activity homeostasis, but is not especially sensitive to the specific form it takes. However, excitability homeostasis would ultimately have to be supplemented by some other process, because the AP threshold drops every time the target output is changed and no circumstances normally arise to bring it back up. For this reason, we used synaptic scaling for all subsequent simulations.
Model performance is sensitive to the width of gaussians used to smooth spike trains
As explained above, the reward signal that drives learning is calculated from the difference between a target spike train and the actual spike train generated by the output neuron. To compute that difference, the spike trains are convolved with a Gaussian whose width specifies the temporal precision demanded of the model. Thus far, we have used a gaussian with a SD (
) of 10 ms. This choice is arbitrary; therefore we examined model performance under different values of
. One might expect that if
is too small (if the level of temporal precision demanded is too high), the model will be unable to learn to produce the target spike train. That is indeed the case: performance is substantially degraded if
is just 5 ms, as shown in Fig. 4A. On the other hand, one might expect that model performance would improve—or at least remain unchanged—with larger
. That is not the case. The "predictive" aspect of the STDP rule shown in Fig. 2 manifests itself as
increases: rather than firing spikes near the peak of the gaussian, at 500 ms, output neurons learn to fire earlier in the trial with larger
. If
is large enough, the network will sometimes come to fire two spikes, neither of which occurs at the target time of 500 ms (Fig. 4B). The final average reward achieved after training, our chosen measure of overall performance, is plotted in Fig. 4C for different values of
(see Supplemental Fig. 1 to see how this performance measure is related to the activity generated by the network)1 . To maintain a consistent measure of model performance, the final rewards plotted in Fig. 4C were computed using a 10-ms gaussian, although the reward signal used to drive learning was computed using the specified
(5–100 ms). Figure 4D shows PSTHs showing the output activity of the model after training with different values of
.
|
Thus far, we only examined the ability of the model to learn to generate a "spike train" consisting of a single spike. The model can learn to produce more arbitrary spike trains, but not as consistently. Figure 5 shows two examples of the model trained on target trains containing three spikes: one successfully (Fig. 5A) and one not (Fig. 5B). Figure 5C shows the final performance of networks taught to produce randomly generated spike trains containing no more than five APs; each bar represents the average final performance of 10 networks (each with a different target spike train). Average performance declines as the number of spikes in the target train increases—the means are significantly different (ANOVA, P = 0.004) and there is a trend toward lower performance with more spikes (P = 0.002, slope = –0.0035, r2 = 0.17). We suggest two possible explanations for this trend. First, spike trains with more APs are more likely to contain short interspike intervals (ISIs), and these could pose a problem because 1) the AHP makes it more difficult to reach spike threshold again shortly after spiking and 2) synapses active shortly after the first AP that could help trigger a second AP will be subjected to LTD caused by the depressing portion of the STDP rule. Furthermore, target trains with more spikes may be more difficult to learn as synaptic adjustments that drive spiking at one time may interfere with the model's ability to remain silent at other times. This is because presynaptic units fire at several distinct times throughout the trial, and thus strengthening the synapse of a presynaptic unit because it is active at one time may also increase synaptic drive during times when the output neuron should not fire. As the number of spikes in the target train increases, one might expect this problem to grow less manageable. This problem would be circumvented if each presynaptic unit fired only a single burst during a trial. Indeed, if we use networks receiving this kind of input, performance is improved for all target spike trains (P < 0.0001 for all 5 groups, Mann-Whitney test), with spike trains containing more spikes showing the largest improvement (Fig. 5D). With this pattern of presynaptic input, there are no longer any significant differences in average performance among target trains with different numbers of spikes (ANOVA, P = 0.78). This suggests that short ISIs may not pose any serious difficulty for this model, but when the data in Fig. 5D are plotted as a function of the minimum ISI occurring in the target spike train (Fig. 5E), one sees that the worst performance occurs with target trains containing shorter ISIs. We systematically explored the effect of ISI on performance using two-spike target trains (Fig. 5F) and found that average performance on shorter ISIs (<60 ms) was significantly worse than on longer ISIs (80–100 ms; ANOVA followed by Tukey's multiple comparison test, using P
0.05 as the criterion for significance). Results using multiplicative STDP were similar, except that networks trained under multiplicative STDP systematically fired output spikes a few milliseconds earlier in the trial than those trained with additive STDP (Supplemental Fig. 2).
Learning in networks with multiple output neurons
We showed that our reward-modulated version of STDP is capable of training a single output neuron to produce an arbitrary spike train in response to temporally patterned synaptic input and that performance is best when 1) the input units fire only one burst per trial, 2) the target spike train does not contain ISIs shorter than 80 ms, and 3) the additive implementation of STDP is used. However, realistic learning tasks will entail training a population of output neurons to produce some target pattern of activity. This target pattern might specify distinct target spike trains for each output neuron, in a direct extension of our single-neuron model. This task would be trivial if each output neuron received its own individually tailored reinforcement signal, but it proves to be quite difficult if one global reinforcement signal is broadcast to all output neurons, calculated from the average of the rewards that each output neuron would have been assigned had they were being trained individually (Supplemental Fig. 3).
Although the model fails to accurately learn target spike trains with as few as five neurons in the output layer, demanding that every output neuron in the network learn to produce a specific spike train is probably unreasonable and unrealistic. For most realistic tasks, the necessary pattern of output activity can probably be realized by many different sets of specific spike trains generated in the output population. To model this situation, we defined the target output as a time-varying function specifying the fraction of output neurons that should be active over the course of a trial, regardless of which specific output neurons are active at any time. For example, the task being learned could require that output neurons gradually become more active over a trial, peaking in the middle of the trial and declining as the trial concludes. Such a situation is shown in Fig. 6, where the target pattern of activity is a 100-ms-wide gaussian centered at 500 ms into the trial, with a peak value of 0.1. Network output is represented by smoothing the individual spike trains with a 10-ms-wide gaussian and averaging over all output neurons, yielding a single waveform that gives the fraction of cells active over time in trial. The reinforcement signal is then calculated in a manner directly analogous to the single neuron case: we calculate Rwd(t) by subtracting the "fraction active" waveform from the target pattern, taking the absolute value, and exponentiating the result, which is used to calculate the temporal difference reinforcement signal,
R(t).
In the example shown in Fig. 6, the output layer contains 100 neurons, input units fire only one burst per trial, and additive STDP is used. After 5,000 trials of training, an aggregate PSTH generated by adding together the PSTHs of all output neurons (Fig. 6A, top, collected over 500 trials) reveals that the output neurons collectively generate a reasonable copy of the target pattern (Fig. 6A, middle). The ability to learn target patterns like this requires that output neurons be trained together as a population. This is shown by examining the aggregate behavior of an ensemble of 100 networks, each containing one output neuron (Fig. 6A, bottom). Each of these networks was separately trained on the broad gaussian target pattern, but they could not reproduce this pattern individually (also shown in Fig. 4) or collectively (Fig. 6A, bottom).
Although the activity of output neurons in this example collectively approximates the target pattern, the individual neurons within the population do not. The majority of output neurons consistently fire at or near a particular, neuron-specific time on every trial after training ("temporally specific" neurons; example shown in Fig. 6B, top). Other output neurons can fire at almost any time in a trial (Fig. 6B, middle), whereas a third group combine these two response patterns (Fig. 6B, bottom) or tends to fire at two or more discrete times during a trial. If we define the "temporal specificity" of a neuron's response pattern as the percentage of spikes it fires within 15 ms of its most probable firing time, we find that there is a bimodal distribution of temporal specificity among output neurons trained on this target pattern (Supplemental Fig. 4B, top). If we classify those neurons firing >60% of their spikes within 15 ms of their most probable firing time as "temporally specific," we find that 64% of output neurons can be so designated. The distribution of times at which these temporally specific neurons are most likely to fire reproduces the central part of the gaussian pattern that the network as a whole generates (Supplemental Fig. 4B, bottom).
The top PSTH in Fig. 6A shows how well the output of this network averaged over 500 trials reproduces the target pattern. However, it does not tell us whether the network reproduces this pattern on individual trials; it is possible that the output is highly variable and that the gaussian shape of Fig. 6A emerges only after summing the results of many trials, which would not constitute a very successful example of learning. To show the activity on individual trials, we instead plot the smoothed spike trains averaged over all output neurons, i.e., the "fraction active" waveform that is used to compare output activity to target activity. The top graph of Fig. 6C plots three examples of output activity on individual trials (thin black lines) along with the mean "fraction active" waveform (thick black line) and the target pattern (thick gray line). Although the average activity pattern does closely resemble the target pattern, the activity on individual trials varies considerably from trial to trial. The bottom graph of Fig. 6C shows the range of activity exhibited on individual trials by plotting the 95% CI bounds (thin black lines) for these waveforms; the thick black line is the median activity, which differs somewhat from the mean activity.
Some trial-to-trial variability in output activity is driven by variations in input activity, and indeed the model's ability to learn depends on this variability. We might also expect this variability to decrease as the number of output neurons increases, because variations in the activity of individual output neurons would make smaller fractional contributions to the overall activity pattern and would tend to average out. However, this is not true under the conditions pertaining to this model (see METHODS). This is because the activity in the output neurons does not vary independently; the all-to-all connectivity pattern between the input and output layers causes correlations in these variations in output neuron activity (to a degree that depends on the synaptic matrix gij). Although the all-to-all connectivity pattern was a practical choice for modeling purposes, it probably does not reflect a connection pattern common in the vertebrate CNS. Even if, for example, one cortical area projects strongly to another, it is unlikely that all individual neurons in the recipient region receive input from exactly the same set of presynaptic neurons. We tested whether the high trial-to-trial variability of Fig. 6C is caused by correlated fluctuations in synaptic drive received by output neurons by "decorrelating" the synaptic input. We generated a separate set of presynaptic spikes for each output unit, but where each set is drawn from the same probability distribution (script). Thus the output neurons receive the same average synaptic input as before, but the trial-to-trial fluctuations in presynaptic activity are now independent across output neurons. When the network shown in Fig. 6C is driven by such "uncorrelated" input, the trial-to-trial variability in output activity is greatly reduced and the output more closely matches the target activity on individual trials (Fig. 6D).
We now consider the model's capabilities for learning a wider range of target patterns. We begin with a class of target patterns that is quite distinct from the single broad gaussian used thus far: a series of large but brief population bursts. These "bursty" patterns consist of four randomly placed 10-ms-wide gaussians whose height specifies the fraction of neurons that should participate in that burst. Figure 7A shows an example of a network trained on such a target pattern, with the two leftmost graphs directly paralleling Fig. 6C—the top left graph plots activity from individual trials (thin black lines) and the mean activity (thick black line), whereas the bottom left graph shows the 95% CI for output activity (thin black lines) and the median activity (thick black line); the target pattern is shown on all graphs as a thick gray line. This bursty target pattern is reproduced with considerably greater fidelity than the broad hump of Fig. 6, so much so that it is difficult to distinguish the individual lines on the leftmost graphs of Fig. 7A. The middle graphs of Fig. 7A zoom in on the two central bursts, showing how both the height and timing of the bursts are fairly well matched to the target pattern, with comparatively little trial-to-trial variation even with normal "correlated" inputs. If the output neurons are driven by "uncorrelated" inputs as in Fig. 6D, the output variability drops further (Fig. 7A, right).
Having considered target patterns with both widely and narrowly temporally distributed patterns, we now examine our model's ability to reproduce "random" target patterns, shown in the leftmost column of Fig. 7B. From a 1,000-ms waveform drawn from a gaussian noise distribution with a correlation time of 100 ms (Fig. 7B, top left), we generate a "probability of spiking" by clipping the portions of the waveform that are negative or that approach the limits of temporally patterned input, 100 and 900 ms into the trial, and normalize the result (Fig. 7B, middle left). We use this probability waveform to randomly select locations for N 10-ms-wide gaussians, each of height
, where N is the number neurons in the output layer of the network; the sum of these N gaussians gives us the target pattern of activity (Fig. 7B, bottom left). This last part of the procedure guarantees that the target pattern can actually be generated by N output neurons whose spike trains are smoothed with 10-ms-wide gaussians. Our model can learn to reproduce such patterns on average, but with a substantial amount of trial-to-trial variability (Fig. 7B, middle). With "uncorrelated" inputs, variability is decreased and network activity on individual trials now more closely resembles the target pattern (Fig. 7B, right).
We assessed our model's performance on these three types of target pattern (100-ms-wide gaussian, bursty, random) as the number of neurons in the output layer was varied (Fig. 7C). A two-way ANOVA showed that both neuron number and pattern type contribute to the variation in final performance and that these two factors interact (P < 0.0001 in all cases). The results of Fig. 7C suggest that the dependence of performance on neuron number, and its interaction with pattern type, is caused entirely by the fact that networks with fewer output neurons do a relatively poor job of reproducing the broad gaussian target pattern. This is confirmed by rerunning a two-way ANOVA with the 100-ms-wide gaussian data omitted; now only pattern type (P < 0.0001) and not neuron number (P = 0.86) contributes to performance differences, with no interaction between the two factors (P = 0.96). Unlike the bursty and random pattern types, the broad gaussian cannot be precisely mimicked by any network; it can only be approximated, and the best achievable approximation improves as the number of output neurons increases.
Thus far, networks trained to reproduce a "population response" have used only additive STDP, with input units governed only by "one-burst" scripts, i.e., input units that each fire just one short burst of spikes on each trial. When networks using multiplicative STDP were tested on this task, we found that they consistently failed to reproduce any of the target pattern types we considered (Fig. 7D). The multiplicative rule's bias toward LTP was probably a factor, because these networks always overshot the target activity pattern (examples shown in Fig. 7D, right). Networks using "regular scripts," where a given input unit could fire at multiple distinct times within a trial, were also unable to learn most types of population reponses (Supplemental Fig. 5).
In the simulations described above, we trained each network on just one target pattern, as might be the case in, for example, song learning in birds whose repertoire is limited to one song. However, more generally a neural population may learn to produce different responses to different patterns of synaptic input or to generate a continuous mapping between input and output. Although we will not attempt a full exploration of our model's capacity for learning multiple input–output pairings, we did establish that this model can learn to produce at least eight distinct output patterns in response to distinct input patterns (Supplemetal Fig. 6).
Model performance with simplified versions of the STDP rule
The implementation of the STDP rule used in our model is fairly complicated, incorporating not only the relative timing of pre- and postsynaptic spikes, but also a dependence on the firing history of the presynaptic and postsynaptic neurons. We chose this implementation not because of its value for reinforcement learning but because experimental studies suggest that these additional factors influence the synaptic changes induced by STDP protocols (Froemke and Dan 2002
; Froemke et al. 2006
; Wang et al. 2005
; Wittenberg and Wang 2006
). Our results showed that this particular form of spike history dependence is not fatal to our model. However, the history dependence used, taken from Froemke and Dan (2002)
, is not unique; Froemke and Dan themselves published a modified version of this rule (Froemke et al. 2006
) for the same kind of synapses. Furthermore, different kinds of synapses could show distinct forms of history dependence. We did not attempt to investigate our model's performance under all reasonable forms of history dependence. Rather, we simply sought to determine whether our model's success depends on the specific form used here. Thus we examined the performance of our model in the absence of any history dependence.
The spike history dependence of the STDP rule of Froemke and Dan (2002)
is derived from their "spike suppression model," where the efficacy of a spike at inducing synaptic changes is suppressed by the occurrence of preceding spikes in the same neuron. To remove spike history dependence from the STDP rule, we omit the "spike efficacy" factors
and
from the rule (Eq. 4). We tested this simplified STDP rule in five networks containing 100 output neurons and trained on "random" target patterns. As shown in Fig. 8A, these networks (filled circles) initially learn faster than control networks that include spike suppression (open circles), but this performance peaks after
500 trials and begins a slow decline, ending after 5,000 trials at a performance level that is lower on average than the control networks. Although this difference is not quite significant (P = 0.056, Mann-Whitney test), it is an alarming trend for our model's success. The example shown in Fig. 8B illustrates the proximate cause of this steady decline in performance. After 500 trials, this network, lacking the spike suppression mechanism, generates an output pattern that is as good a copy of the target pattern (Fig. 8B, top) as the full model could generate after 10 times as much training. As training proceeds, however, the network fails to maintain the sustained elevation of activity appearing in the first half of the target pattern; the activity plateau generated by the network gradually shortens and by the end of training is reduced to a brief population burst at the onset of the early activity plateau demanded by the target pattern (Fig. 8B, bottom). Because this loss of activity would cause many of the output neurons to fire at average rates significantly <1 Hz, the homeostatic plasticity mechanism is engaged and instigates a compensatory increase in baseline firing rate (Fig. 8B, bottom). The process shown here is repeated in the other simulations using the STDP rule without spike suppression—sustained bouts of activity specified in the target patterns are gradually shortened, accompanied by an increase in baseline firing.
This effect is caused by the fact that our performance-modulated version of the STDP rule is biased toward LTD relative to the traditional unmodulated form of the rule because its anti-Hebbian regimen (applied whenever
R < 0) includes only LTD, whereas both LTP and LTD are possible when
R > 0. This issue becomes pertinent when the network generates a sustained bout of activity. Most of the spikes in presynaptic bursts that are responsible for maintaining this activity occur before the postsynaptic spikes they trigger, but when the presynaptic activity patterns consist of high-frequency bursts (as in the 1-burst scripts used here), one or two of the later spikes in the burst can occasionally occur after the postsynaptic spike triggered by earlier spikes. If
R > 0 at this time on a given trial, these final presynaptic spikes will induce LTD at the relevant synapse. On such trials, this LTD is more than counterbalanced by LTP induced by the majority of spikes in the burst occurring before the postsynaptic spike, but on trials in which
R < 0 (i.e., performance is worse than the recent average performance), the presynaptic spikes occurring before the postsynaptic spike trigger LTD without any compensatory LTP induced by the few presynaptic spikes that may appear after the postsynaptic spike. Thus once the network has reached a plateau in performance when
R is just as likely to be negative as positive, the changes induced at these synapses averaged over several trials will be slightly depressing, gradually eroding the sustained bout of activity that the network is supposed to generate. The inclusion of the spike suppression mechanism avoids this by suppressing the contributions of later spikes in presynaptic bursts, the only spikes that can trigger LTD when
R > 0, since the average ISI in these bursts (6.3 ms) is considerably shorter than the recovery time constant for presynaptic spike suppression (28 ms). With the spike suppression mechanism in place, a plateau in performance (
R
= 0) now produces an approximate balance between LTP and LTD.
If the main advantage of spike suppression is to counter the depressing portion of the basic STDP rule, networks lacking both spike suppression and post-before-pre-LTD should perform at least as well as control networks incorporating both features. We tested this by running simulations in which the "spike efficacy" factors
and
are omitted, as above, and the parameter governing the size of post-before-pre LTD (A–, Eq. 1) is set to zero. Now learning is quite rapid (Fig. 8C, filled circles), and performance achieves an asymptotic level well above the values attained by control networks (P = 0.008, Mann-Whitney test; 2 examples shown in Fig. 8D). If spike suppression is used while A– (Fig. 8C, filled triangles), performance is significantly worse (P = 0.008, Mann-Whitney test) and is not significantly different from control performance (P = 0.22). In summary, one aspect of the spike history dependence of the STDP rule of Froemke and Dan (2002)
, presynaptic spike suppression, does in fact assist reinforcement learning under the conditions prevailing here (postsynaptic spike suppression is rarely engaged because the average firing rate among output neurons is roughly 1 Hz). However, it does so by mending an imbalance between LTP and LTD caused by combining anti-Hebbian plasticity with a conjunction of post-before-pre LTD and burst firing. The solution, to make only the first spike in a high-frequency presynaptic burst "count" in the induction of synaptic plasticity, is not specific to this particular form of history dependence. In the absence of post-before-pre LTD, this history dependence actually slightly impedes performance. In that sense, the particular form of history dependence used is not integral to the success of our model.
|
|
DISCUSSION |
|---|
|
These accomplishments do come with a list of requirements and restrictions. First, a form of activity-regulating homeostasis is needed to guarantee the presence of postsynaptic spikes, because STDP alone can do nothing without activity in both the presynaptic and postsynaptic cells. This can be achieved through the inclusion of known physiological processes: homeostatic regulation of either intrinsic excitability or, the choice we favored here, synaptic strength. Another relatively minor requirement is the exclusion of the form of multiplicative STDP examined here. Although this form of strength-dependent synaptic modification is a staple of the STDP modeling literature, it has relatively little experimental support, and it poses a dilemma between two unrealistic alternatives: either synaptic changes must be strongly biased toward LTP (gij << gmax) or the maximum achievable strength can be only about twice the starting strength (gij
gmax). In addition, our model requires anti-Hebbian synaptic plasticity to ensure that unwanted spikes fired by the postsynaptic cell can always be removed. A more challenging requirement, the need for reward prediction, is discussed in the context of possible biological implementations of the model.
Although our mechanism for reinforcement learning works using the full STDP rule, it is disconcerting that the LTD portion of this rule contributes nothing to the model's success; indeed, it actually impedes performance and would do so disastrously were it not for the spike suppression mechanism built into the STDP rule we used. On the other hand, there is no reason why the two halves of the STDP rule—pre–post LTP and post–pre LTD—should serve the same functions. LTD induced by the recurrence of postsynaptic spikes preceding EPSPs may serve wholly distinct functions that are not represented in our model. There is growing evidence that the LTP and LTD portions of STDP rule are mechanistically quite distinct, at least at some cortical synapses, with post–pre LTD using a different method for detecting coincident pre- and postsynaptic activity (involving postsynaptic endocannabinoid release and presynaptic NMDARs), possibly using different calcium sources for induction (internal stores instead of extracellular calcium admitted through postsynaptic NMDARs), and perhaps with a different site of expression (Bender et al. 2006
; Nevian and Sakmann 2006
; Sjöström et al. 2003
, 2004
). This makes it more likely that they can be regulated independently, and a recent study of hippocampal STDP was able to identify induction protocols that could engage these two forms of synaptic plasticity separately (Wittenberg and Wang 2006
). We suggest that post–pre LTD might be suppressed in vivo during reinforcement learning.
One of the strengths of our model is the fact that it is based on an experimentally defined form of synaptic plasticity, but it does require additional conjectures concerning the modulation of that plasticity that are not experimentally established. A more parsimonious model that avoided these conjectures while retaining the ability to train a neural population to map its synaptic inputs into a wide range of possible outputs would be preferable. Legenstein et al. (2005)
studied the learning capabilities of the unmodulated STDP rule and describe a method whereby a network using this rule can learn a wide range of mappings from input patterns to output activity. A recently published model by Davison and Frégnac (2006)
implements a version of this method to model the learning of coordinate transformations between different frames of reference, where the neural population being trained receives all-to-all inputs from an input layer, encoding untransformed coordinates, and topographic inputs from a "training layer" encoding the desired output. Although this model offers a plausible way to learn a coordinate transformation, it cannot supplant our model in the full range of learning tasks we consider. The method Legenstein et al. (2005)
describe for learning arbitrary mappings and the Davison and Frégnac (2006)
model are both effectively "instructive," since inputs from the training layer directly bias the output layer toward generating the desired output, whereas our training signal is based on merely the similarity between desired output and actual output. Furthermore, the fact that the projections from the training layer are topographic in the Davison and Frégnac (2006)
model means that the evaluation signal is not global; local populations in their output layer receive individually tailored training signals. These conditions are reasonable for learning coordinate transformations, but are probably too demanding for all forms of cortical learning.
Two other models have been published recently that are concerned with the marriage of STDP and reinforcement learning. One, proposed by Izhikevich (2007)
, posits the modulation of STDP by a reward signal mediated by dopamine. In this model, the relative timing of pre- and postsynaptic spikes generates a synaptic "eligibility trace" governed by the STDP rule, but synaptic changes are implemented only if dopamine is delivered before the eligibility trace decays. The Izhikevich (2007)
model offers a solution to the problem of delayed reward, whereas we assume that this problem is solved elsewhere by a system that provides a reward prediction to the network in advance of the actual reward. On the other hand, Izhikevich (2007)
considers a much more limited set of potential input patterns and desired output patterns. Because Izhikevich (2007)
did not use input patterns with strong, long-range temporal correlations, he did not encounter the problems that required the use anti-Hebbian plasticity coupled to a temporal difference learning signal. A second study, by Pfister et al. (2006)
, calculates the synaptic changes that increase the likelihood of obtaining a set of target output spike trains given the set of input spike trains, thereby deriving STDP-like learning rules. Pfister et al. (2006)
note that if the problem is instead cast in the form of maximizing reward, a similar rule can be derived. However, the STDP rules derived by Pfister et al. (2006)
are functions of the "desired" spike times of postsynaptic neurons, not their actual spike times. Although Pfister et al. (2006)
provided considerable insight into why the STDP rule might take the form it does, they rely on more abstracted (and more analytically tractable) neural models than we do and do not explore the specific issue of reinforcement learning in great detail. Both Izhikevich (2007)
and Pfister et al. (2006)
offer valuable approaches to the problem of STDP and reinforcement learning, yet are complementary to our model.
Biological implementation of the model
The most challenging characteristic of our model with regard to credible implementation is probably the need for "reward prediction," i.e., the reinforcement signal must arrive at roughly the same time the activity it evaluates is being generated. This problem is not unique to our model and can be viewed as one specialized facet of the general "temporal credit assignment problem" all models of reinforcement learning face (Sutton and Barto 1998
), but nonetheless is a major obstacle to the implementation of our model by a real neural system. The problem might be solved by giving the evaluation system that calculates the reinforcement signal access to the input activity that drives variations in output activity. The evaluation system could, in principle, use the pattern of input activity generated on a particular trial to predict whether the output activity on that trial will be a better or worse match to the target pattern than average, permitting the timely arrival of appropriate reinforcement. This would be an extraordinary feat of neural computation, but there is a neural population known to do something rather like it: the midbrain dopaminergic neurons. These neurons fire bursts in response to unexpected reward and to stimuli that predict reward; these neurons can also signal the absence of predicted reward through pauses in their spontaneous firing (reviewed in Schultz 1998
). If these neurons predict reward based on internal factors, like an efference copy of noisy motor commands, as well as external stimuli, then they could potentially provide the kind of reward prediction required by our model.
Dopamine released by midbrain neurons could provide the reinforcement signal for our model, but dopaminergic innervation of the telencephalon is quite heterogeneous, and is most prominent outside of the isocortex, namely in the striatum, the input structure of the basal ganglia. Within the striatum, dopamine does modulate synaptic plasticity, and although the effects of dopamine are still poorly understood and vigorously debated, it may do so in a way consistent with role of
R in our model, with increased dopamine promoting LTP at corticostriatal synapses and decreased dopamine promoting LTD (Reynolds and Wickens 2002
). This might make our model plausible within the striatum, but how could it apply to the isocortex, which receives far less dopaminergic input? We begin by asking how midbrain dopaminergic cells generate their reward-predicting responses. It seems unlikely that this complex calculation could be performed entirely by these neurons themselves, and of the various potential sources for this information, one of the best candidates is itself the major target of dopaminergic innervation: the basal ganglia. The basal ganglia receive input from virtually the entire cortex, and thus have access to the primary information needed to predict rewards. Many factors affect the activity of basal ganglia neurons, including of course sensory stimuli and motor plans, but these responses are often modulated by reward expectation (Arkadir et al. 2004
; Hikosaka et al. 2006
). It is not unreasonable to hypothesize that reward-predicting information can be found not just in midbrain dopaminergic neurons, but also in basal ganglia outputs that are relayed to the isocortex. In this way, almost the entire cortex could receive the reward-predicting information demanded by our model.
Basal ganglia output could conceivably reach the cortex via the GABAergic and cholinergic projections of the basal forebrain (Gritti et al. 1997
), and acetylcholine has been reported to modulate cortical synaptic plasticity (Rasmusson 2000
). However, the most obvious conduit of basal ganglia output to the cortex is the thalamus. That raises the question of how a glutamatergic thalamocortical projection could modulate corticocortical synaptic plasticity. In rats, the primary thalamic relay from basal ganglia to cortex is the ventromedial nucleus (Gerfen 1992
; Gerfen et al. 1982
; Kha et al. 2001
), which projects to almost the entire cortical mantle, but specifically to layer 1 (Herkenham 1979
). This is intriguing from the point of view of our model because layer 1 inputs to the dendritic tufts of pyramidal neurons can trigger dendritic calcium spikes, accompanied by bursts of sodium spikes, when combined with action potentials initiated in the soma (Larkum et al. 1999
), and such calcium spikes could influence plastic changes induced at the corticocortical synapses that helped initiate the somatic spike. As we noted in our Methods section, and as emphasized in a recent review (Lisman and Spruston 2005
), a number of studies report that low frequency pairing of individual pre- and postsynaptic spikes does not suffice to induce synaptic plasticity. High-frequency pairing is evidently necessary to induce LTP at some cortical synapses (Markram et al. 1997
), and this may reflect a requirement for sustained depolarization (Sjöström et al. 2001
). Dendritic spikes triggered by layer 1 excitation may help meet this requirement, and a recent report indicates that the requirement for high-frequency pairing is waived when the postsynaptic neuron fires bursts rather than individual spikes (Nevian and Sakmann 2006
). Another study of layer 5 pyramidal neurons found that EPSPs followed by single APs induced anti-Hebbian LTD, whereas EPSPs followed by high-frequency bursts—triggering large dendritic spikes—induced LTP (Letzkus et al. 2006
). In our view, reports that the induction of synaptic plasticity requires more than is accounted for by the basic STDP rule does not necessarily undermine the STDP concept per se; rather, they indicate that STDP is modulated, that this modulation may even encompass the possibility of anti-Hebbian STDP, and that this modulation may be accomplished by a system that is capable of providing the reward-predicting reinforcement signal we require.
Reward-modulated STDP as a model for song learning in oscine birds
This speculative hypothesis would be more plausible if we could identify a specific example featuring a learned behavior with a known neural substrate to which our model might be applied. There is as yet no good example in mammals of a learned behavior whose specific cortical and subpallial substrates have been identified and characterized, but such an example does exist in songbirds. These birds must learn the songs they sing, and the neural substrate for this behavior consists of two well-described forebrain pathways: a "motor pathway" from HVC to the robust nucleus of the acropallium (RA) required for singing per se, and an "anterior forebrain pathway" (AFP) that is required for song learning (for reviews, see Brainard 2004
; Farries 2004
; Fee et al. 2004
). The AFP is hypothesized to evaluate the bird's vocal performance and transmit information to RA that enables learning, and it contains basal ganglia circuitry very similar to that of mammals (Farries and Perkel 2002
; Farries et al. 2005
). Thus the AFP could play the role of the "evaluation system" in our model, while HVC and RA correspond to the input and output layers, respectively. Furthermore, HVC projects to the AFP and supplies both auditory and premotor information (e.g., Doupe 1997
; Hessler and Doupe 1999
), giving the AFP the information it would need to predict performance from premotor activity. HVC neurons projecting to RA even fire in the one-burst pattern that works best for our model; these neurons fire a single high-frequency burst during a song motif (Hahnloser et al. 2002
). For these reasons, the song system could be an ideal testing ground for our STDP-based model of reinforcement learning and its implementation via the basal ganglia.
Conversely, the song system does differ in certain critical ways from the basal ganglia-thalamocortical system we propose for mammals. First, feedback from the AFP reaches the motor pathway via a pallial (cortex-like) nucleus, the lateral magnocellular nucleus of the medial nidopallium (LMAN), rather than directly from the thalamus. Furthermore, the avian pallium is not organized into laminae; thus there is no "layer 1" to receive modulatory inputs. Even so, the LMAN-RA projection has an unusual property that could help it play the same functional role as the one we propose for VM's innervation of layer 1: the postsynaptic receptors at LMAN-RA synapses are almost exclusively NMDARs (Mooney and Konishi 1991
; Stark and Perkel 1999
). This fact has long been touted as a possible link between behavioral plasticity (dependent on LMAN) and synaptic plasticity, which in other systems depends on calcium influx through NMDARs. However, NMDARs are not just conduits for calcium; they are also dendritic voltage-gated ion channels whose availability is controlled extrinsically, by glutamate. As voltage-gated channels, NMDARs might help generate dendritic spikes in RA neurons, as they are known to do in mammalian cortical neurons (Schiller et al. 2000
). We suggest that activity in LMAN, controlled by basal ganglia circuitry upstream in the AFP, could influence the occurrence of dendritic spikes in RA neurons, and thereby control the magnitude and polarity of plasticity induced at HVC-RA and intrinsic RA-RA synapses.
This perspective, wherein the AFP's primary role is to evaluate performance and modulate plasticity but not to directly influence behavior, is an old one in the songbird literature, supported by early lesion studies demonstrating that while the AFP is required for song learning, it is not required for singing in birds that have already learned their song (Bottjer et al. 1984
; Scharff and Nottebohm 1991
; Sohrabji et al. 1990
). However, this view has been challenged recently by two observations. First, the AFP does in fact influence behavior; specifically, activity in LMAN (the output station of the AFP) contributes to song variability (Kao et al. 2005
; Ölveczky et al. 2005
). Second, LMAN activity recorded during singing does not appear to be influenced by auditory feedback (Leonardo 2004
), as it should if LMAN is transmitting a signal derived from comparing the actual song to an auditory representation of the target song. But our model posits a reinforcement signal that is derived from a prediction of performance based on premotor activity—a direct auditory comparison of actual song to target song would arrive too late to be of service in our model. Thus our model is perfectly consistent with Leonardo's (2004)
results. Of course, auditory feedback is necessary in the long run to establish and maintain the putative mapping between premotor activity and performance prediction, consistent with the known effects of deafening on the acquisition and maintenance of song (Konishi 1965
; Nordeen and Nordeen 1992
). As for the behavioral variability, Ölveczky et al. (2005)
note that this is just as important for reinforcement learning as the evaluation of the variants, and suggest that the generation of variability may be the prime function of the AFP, with the evaluation performed elsewhere. Although AFP output undeniably enhances behavioral variability, it is possible that this is simply an epiphenomenon, a side effect that occurs as the AFP performs its primary task of modulating plasticity. On the other hand, there is no reason why the AFP could not serve both functions, helping to generate variants and evaluating them. Indeed, if the AFP is able to "predict" which variants will better match the tutor song, then it may well bias variation in a way that accelerates learning, a possibility also raised by Ölveczky et al. (2005)
. This may prove to be a line of convergence between the roles traditionally ascribed to the songbird AFP (evaluation of behavior) and to the mammalian basal ganglia (control of behavior).
Our model can claim two accomplishments hitherto rare in the modeling literature: 1) it proposes a mechanism for reinforcement learning employing known physiological phenomena with relatively modest modifications, and 2) it uses STDP, albeit in modified form, to achieve a general-purpose form of learning. Along the way, we identified a number of requirements which can also be regarded as predictions, i.e., things that must be true of any system that implements the model. The most prominent of these are 1) STDP can be modulated, 2) this modulation includes the possibility of anti-Hebbian STDP, 3) this modulation is "predictive" in the sense discussed above, and 4) STDP is not multiplicative in the sense of Rubin et al. (2001)
. This list is necessarily incomplete; there are many things that could impact this model's performance that were not examined. Future studies will have to evaluate the effects of such things as recurrent excitatory connections, nonrandom patterns of connectivity, inhibitory networks, and intrinsic physiological properties. Even with the model as it stands, some important questions remain unanswered, including the number of input-output pairs that can be "stored" in these networks, the factors that control this capacity, and the extent to which these networks can learn continuous mappings between input and output (as opposed to a list of discrete input–output pattern pairs). Independent of particular details of implementation, we hope that this model can serve as a starting point from which we can understand how neural systems learn to generate appropriate responses to the inputs they receive.
|
|
FOOTNOTES |
|---|
1 The online version of this article contains supplemental data. ![]()
Address for reprint requests and other correspondence: M. A. Farries, Dept. of Biology, Univ. of Texas at San Antonio, One UTSA Circle, San Antonio, TX 78249 (E-mail: michael.farries{at}utsa.edu)
|
|
REFERENCES |
|---|
|
Andersen P, Sundberg SH, Sveen O, Wigström H. Specific long-lasting potentiation of synaptic transmission in hippocampal slices. Nature 266: 736–737, 1977.[CrossRef][Medline]
Arkadir D, Morris G, Vaadia E, Bergman H. Independent coding of movement direction and reward prediction by single pallidal neurons. J Neurosci 24: 10047–10056, 2004.
Bell CC, Han VZ, Sugawara Y, Grant K. Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature 387: 278–281, 1997.[CrossRef][Medline]
Bender VA, Bender KJ, Brasier DJ, Feldman DE. Two coincidence detectors for spike timing-dependent plasticity in somatosensory cortex. J Neurosci 26: 4166–4177, 2006.
Bi G-q, Poo M-m. Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. J Neurosci 18: 10464–10472, 1998.
Blum KI, Abbott LF. A model of spatial map formation in the hippocampus of the rat. Neural Comput 8: 85–93, 1996.[Web of Science][Medline]
Bottjer SW, Miesner EA, Arnold AP. Forebrain lesions disrupt development but not maintenance of song in passerine birds. Science 224: 901–903, 1984.
Brainard MS. Contributions of the anterior forebrain pathway to vocal plasticity. Ann NY Acad Sci 1016: 377–394, 2004.[CrossRef][Web of Science][Medline]
Cho K, Aggleton JP, Brown MW, Bashir ZI. An experimental test of the role of postsynaptic calcium levels in determining synaptic strength using perirhinal cortex of rat. J Physiol 532: 459–466, 2001.
Cormier RJ, Greenwood AC, Connor JA. Bidirectional synaptic plasticity correlated with the magnitude of dentritic calcium transients above a threshold. J Neurophysiol 85: 399–406, 2001.
Cummings JA, Mulkey RM, Nicoll RA, Malenka RC. Ca2+ signaling requirements for long-term depression in the hippocampus. Neuron 16: 825–833, 1996.[CrossRef][Web of Science][Medline]
Davison AP, Frégnac Y. Learning cross-model spatial transformations through spike timing-dependent plasticity. J Neurosci 26: 5604–5615, 2006.
de Ruyter van Steveninck RR, Bialek W. Real-time performance of a movement-sensitive neuron in the blowfly visual system: coding and information transfer in short spike sequences. Proc R Soc Lond B 234: 379–414, 1988.
Debanne D, Gähwiler BH, Thompson SM. Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J Physiol 507: 237–247, 1998.
Desai NS, Rutherford LC, Turrigiano GG. Plasticity in the intrinsic excitability of cortical pyramidal neurons. Nat Neurosci 2: 515–520, 1999.[CrossRef][Web of Science][Medline]
Doupe AJ. Song- and order-selective neurons in the songbird anterior forebrain and their emergence during vocal development. J Neurosci 17: 1147–1167, 1997.
Farries MA. The avian song system in comparative perspective. Ann NY Acad Sci 1016: 61–76, 2004.[CrossRef]
Farries MA, Ding L, Perkel DJ. Evidence for "direct" and "indirect" pathways through the song system basal ganglia. J Comp Neurol 484: 93–104, 2005.[CrossRef][Web of Science][Medline]
Farries MA, Perkel DJ. A telencephalic nucleus essential for song learning contains neurons with physiological characteristics of both striatum and globus pallidus. J Neurosci 22: 3776–3787, 2002.
Fee MS, Kozhevnikov AA, Hahnloser RHR. Neural mechanisms of vocal sequence generation in the songbird. Ann NY Acad Sci 1016: 153–170, 2004.[CrossRef][Web of Science][Medline]
Feldman DE. Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron 27: 45–56, 2000.[CrossRef][Web of Science][Medline]
Froemke RC, Dan Y. Spike-timing-dependent synaptic modifiation induced by natural spike trains. Nature 416: 433–438, 2002.[CrossRef][Medline]
Froemke RC, Poo M-m, Dan Y. Spike-timing-dependent synaptic plasticity depends on dendritic location. Nature 434: 221–225, 2005.[CrossRef][Medline]
Froemke RC, Tsay IA, Raad M, Long JD, Dan Y. Contribution of individual spikes in burst-induced long-term synaptic modification. J Neurophysiol 95: 1620–1629, 2006.
Gerfen CR. The neostriatal mosaic: multiple levels of compartmental organization in the basal ganglia. Annu Rev Neurosci 15: 285–320, 1992.[CrossRef][Web of Science][Medline]
Gerfen CR, Staines WA, Arbuthnott GW, Fibiger HC. Crossed connections of the substantia nigra in the rat. J Comp Neurol 207: 283–303, 1982.[CrossRef][Web of Science][Medline]
Gerstner W, Kempter R, van Hemmen JL, Wagner H. A neuronal learning rule for sub-millisecond temporal coding. Nature 383: 76–78, 1996.[CrossRef][Medline]
Gritti I, Mainville L, Mancia M, Jones BE. GABAergic and other noncholinergic basal forebrain neurons, together with cholinergic neurons, project to the mesocortex and isocortex in the rat. J Comp Neurol 383: 163–177, 1997.[CrossRef][Web of Science][Medline]
Gütig R, Aharonov R, Rotter S, Sompolinsky H. Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity. J Neurosci 23: 3697–3714, 2003.
Hahnloser RHR, Kozhevnikov AA, Fee MS. An ultra-sparse code underlies the generation of neural sequences in a songbird. Nature 419: 65–69, 2002.[CrossRef][Medline]
Herkenham M. The afferent and efferent connections of the ventromedial thalamic nucleus in the rat. J Comp Neurol 183: 487–518, 1979.[CrossRef][Web of Science][Medline]
Hessler NA, Doupe AJ. Singing-related neural activity in a dorsal forebrain-basal ganglia circuit of adult zebra finches. J Neurosci 19: 10461–10481, 1999.
Hikosaka O, Nakamura K, Nakahara H. Basal ganglia orient eyes to reward. J Neurophysiol 95: 567–584, 2006.
Ismailov I, Kalikulov D, Inoue T, Friedlander MJ. The kinetic profile of intracellular calcium predicts long-term potentiation and long-term depression. J Neurosci 24: 9847–9861, 2004.
Izhikevich EM. Solving the distal reward problem through linkage of STDP and dopamine signaling. Cereb Cortex 17: 2443–2452, 2007.
Kao MH, Doupe AJ, Brainard MS. Contributions of an avain basal ganglia-forebrain circuit to real-time modulation of song. Nature 433: 638–643, 2005.[CrossRef][Medline]
Kempter R, Gerstner W, van Hemmen JL. Hebbian learning and spiking neurons. Phys Rev E 59: 4498–4514, 1999.[CrossRef]
Kempter R, Gerstner W, van Hemmen JL. Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Comput 13: 2709–2741, 2001.[CrossRef][Web of Science][Medline]
Kepecs A, van Rossum MCW, Song S, Tegnér J. Spike-timing-dependent plasticity: common themes and divergent vistas. Biol Cybern 87: 446–458, 2002.[CrossRef][Web of Science][Medline]
Kha HT, Finkelstein DI, Tomas D, Drago J, Pow DV, Horne MK. Projections from the substantia nigra pars reticulata to the motor thalamus of the rat: single axon reconstructions and immunohistochemical study. J Comp Neurol 440: 20–30, 2001.[CrossRef][Web of Science][Medline]
Köles L, Wirkner K, Illes P. Modulation of ionotropic glutamate receptor channels. Neurochem Res 26: 925–932, 2001.[CrossRef][Web of Science][Medline]
Konishi M. The role of auditory feedback in the control of vocalization in the white-crowned sparrow. Z Tierpsychol 22: 770–783, 1965.[Medline]
Larkum ME, Zhu JJ, Sakmann B. A new cellular mechanism for coupling inputs arriving at different cortical layers. Nature 398: 338–341, 1999.[CrossRef][Medline]
Legenstein R, Naeger C, Maas W. What can a neuron learn with spike-timing-dependent plasticity? Neural Comput 17: 2337–2382, 2005.[CrossRef][Web of Science][Medline]
Leonardo A. Experimental test of the birdsong error-correction model. Proc Natl Acad Sci USA 101: 16935–16940, 2004.
Letzkus JJ, Kampa BM, Stuart GJ. Learning rules for spike timing-dependent plastcity depend on dendritic synapse location. J Neurosci 26: 10420–10429, 2006.
Lisman JE, Spruston N. Postsynaptic depolarization requirements for LTP and LTD: critique of spike timing-dependent plasticity. Nat Neurosci 8: 839–841, 2005.[Web of Science][Medline]
MacDonald JF, Xiong X-G, Lu W-Y, Raouf R, Orser BA. Modulation of NMDA receptors. Prog Brain Res 116: 191–208, 1998.[Web of Science][Medline]
Magee JC, Johnston D. A synaptically controlled, associative signal for Hebbian plasticity in hippocampal neurons. Science 275: 209–212, 1997.
Markram H, Lübke J, Frotscher M, Sakaguchi H. Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science 275: 213–215, 1997.
Mooney R, Konishi M. Two distinct inputs to an avian song nucleus activate different glutamate receptor subtypes on individual neurons. Proc Natl Acad Sci USA 88: 4075–4079, 1991.
Nevian T, Sakmann B. Spine Ca2+ signaling in spike-timing-dependent plasticity. J Neurosci 26: 11001–11013, 2006.
Nishiyama M, Hong K, Mikoshiba K, Poo M-m, Kato K. Calcium stores regulate the polarity and input specificity of synaptic modification. Nature 408: 584–588, 2000.[CrossRef][Medline]
Nordeen KW, Nordeen EJ. Auditory feedback is necessary for the maintenance of stereotyped song in adult zebra finches. Behav Neural Biol 57: 58–66, 1992.[CrossRef][Web of Science][Medline]
Ölveczky BP, Andalman AS, Fee MS. Vocal experimentation in the juvenile songbird requires a basal ganglia circuit. PLoS Biol 3: 902–909, 2005.[Web of Science]
Pfister J-P, Toyoizumi T, Barber D, Gerstner W. Optimal spike-timing-dependent plasticity for precise action potential firing in supervised learning. Neural Comput 18: 1318–1348, 2006.[CrossRef][Web of Science][Medline]
Pike FG, Meredith RM, Oldingand AWA, Paulsen O. Postsynaptic bursting is essential for Hebbian induction of associative long-term potentiation at excitatory synapses in rat hippocampus. J Physiol 518: 571–576, 1999.
Rao RPN, Sejnowski TJ. Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Comput 13: 2221–2237, 2001.[CrossRef][Web of Science][Medline]
Rasmusson DD. The role of acetylcholine in cortical synaptic plasticity. Behav Brain Res 115: 205–218, 2000.[CrossRef][Web of Science][Medline]
Reinagel P, Reid RC. Temporal coding of visual information in the thalamus. J Neurosci 20: 5392–5400, 2000.
Reynolds JNJ, Wickens JR. Dopamine-dependent plasticity of corticostriatal synapses. Neural Netw 15: 507–521, 2002.[CrossRef][Web of Science][Medline]
Roberts PD. Computational consequences of temporally asymmetric learning rules: I. Differential Hebbian learning. J Comput Neurosci 7: 235–246, 1999.[CrossRef][Web of Science][Medline]
Rubin J, Lee DD, Sompolinsky H. Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys Rev Lett 86: 364–367, 2001.[CrossRef][Web of Science][Medline]
Scharff C, Nottebohm F. A comparative study of the behavior deficits following lesions of various parts of the zebra finch song system: implications for vocal learning. J Neurosci 11: 2896–2913, 1991.[Abstract]
Schiller J, Major G, Koester HJ, Schiller Y. NMDA spikes in basal dendrites of cortical pyramidal neurons. Nature 404: 285–289, 2000.[CrossRef][Medline]
Schultz W. Predictive reward signal of dopamine neurons. J Neurophysiol 80: 1–27, 1998.
Seung HS. Learning in spiking neural networks by reinforcement of stochastic synaptic tranmission. Neuron 40: 1063–1073, 2003.[CrossRef][Web of Science][Medline]
Sjöström PJ, Häusser MA. A cooperative switch determines the sign of synaptic plasticity in distal dendrites of neocortical pyramidal neurons. Neuron 51: 227–238, 2006.[CrossRef][Web of Science][Medline]
Sjöström PJ, Turrigiano GG, Nelson SB. Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron 32: 1149–1164, 2001.[CrossRef][Web of Science][Medline]
Sjöström PJ, Turrigiano GG, Nelson SB. Neocortical LTD via coincident activation of presynaptic NMDA and cannabinoid receptors. Neuron 39: 641–654, 2003.[CrossRef][Web of Science][Medline]
Sjöström PJ, Turrigiano GG, Nelson SB. Endocannabinoid-dependent neocortical layer-5 LTD in the absence of postsynaptic spiking. J Neurophysiol 92: 3338–3343, 2004.
Sohrabji F, Nordeen EJ, Nordeen KW. Selective impairment of song learning following lesions of a forebrain nucleus in juvenile zebra finches. Behav Neural Biol 53: 51–63, 1990.[CrossRef][Web of Science][Medline]
Song S, Abbott LF. Cortical development and remapping through spike timing-dependent plasticity. Neuron 32: 339–350, 2001.[CrossRef][Web of Science][Medline]
Song S, Miller KD, Abbott LF. Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat Neurosci 3: 919–926, 2000.[CrossRef][Web of Science][Medline]
Stark LL, Perkel DJ. Two-stage, input-specific synaptic maturation in a nucleus essential for vocal production in the zebra finch. J Neurosci 19: 9107–9116, 1999.
Suri RE, Sejnowski TJ. Spike propagation synchronized by temporally asymmetric Hebbian learning. Biol Cybern 87: 440–445, 2002.[CrossRef]
Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998.
Tegnér J, Kepecs A. An adaptive spike-timing-dependent plasticity rule. Neurocomputing 44–46: 189–194, 2002.[CrossRef][Web of Science]
Turrigiano GG, Leslie KR, Desai NS, Rutherford LC, Nelson SB. Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature 391: 892–896, 1998.[CrossRef][Medline]
Tzounopoulos T, Kim Y, Oertel D, Trussell LO. Cell-specific, spike timing-dependent plasticities in the dorsal cochlear nucleus. Nat Neurosci 7: 719–725, 2004.[CrossRef][Web of Science][Medline]
van Rossum MCW, Bi G-q, Turrigiano GG. Stable Hebbian learning from spike timing-dependent plasticity. J Neurosci 20: 8812–8821, 2000.
Wang H-X, Gerkin RC, Nauen DW, Bi G-q. Coactivation and timing-dependent integration of synaptic potentiation and depression. Nat Neurosci 8: 187–193, 2005.[CrossRef][Web of Science][Medline]
Wittenberg GM, Wang SS-H. Malleability of spike-timing-dependent plasticity at the CA3-CA1 synapse. J Neurosci 26: 6610–6617, 2006.
Xie X, Seung HS. Learning in neural networks by reinforcement of irregular spiking. Phys Rev E 69: 041909, 2004.
Yang S-N, Tang Y-G, Zucker RS. Selective induction of LTP and LTD by postsynaptic [Ca2+]i elevation. J Neurophysiol 81: 781–787, 1999.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Visit Other APS Journals Online |