|
|
||||||||
INVITED REVIEW
1Department of Statistics and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, Pennsylvania; and 2Neuroscience Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, and the Division of Health Sciences and Technology, Harvard Medical School, Massachusetts Institute of Technology, Boston, Massachusetts
Submitted 25 June 2004; accepted in final form 19 February 2005
ABSTRACT
Analysis of data from neurophysiological investigations can be challenging. Particularly when experiments involve dynamics of neuronal response, scientific inference can become subtle and some statistical methods may make much more efficient use of the data than others. This article reviews well-established statistical principles, which provide useful guidance, and argues that good statistical practice can substantially enhance results. Recent work on estimation of firing rate, population coding, and time-varying correlation provides improvements in experimental sensitivity equivalent to large increases in the number of neurons examined. Modern nonparametric methods are applicable to data from repeated trials. Many within-trial analyses based on a Poisson assumption can be extended to non-Poisson data. New methods have made it possible to track changes in receptive fields, and to study trial-to-trial variation, with modest amounts of data.
INTRODUCTION
Technical advances have made available new methods for collecting, storing, and manipulating electrophysiological data. Investigations may now not only characterize neuronal activity in anatomically well defined regions, but they can also examine dynamics of neuronal response and their relationship to behavior. Although elementary methods of data analysis [such as t-tests or visual examination of the peristimulus time histogram (PSTH)] remain useful for many purposes, the growing complexity of neuroscientific experiments, often examining subtle changes on a comparatively fine timescale, requires careful attention to statistical methods for data analysis. In this overview we discuss some of the fundamental data analytical issues that face researchers in neurophysiology, illustrating the general points with the problems of describing the evolution of a neuron's firing rate across time, finding accurate population codes, and assessing time-varying correlation between 2 neurons. In each case recent work has provided a statistical technique that outperforms previous methodology, boosting the scientific information as effectively as if the number of experimental trials, or the number of neurons, had been increased by a substantial factor. We also indicate some of the ways modern statistical procedures can accommodate important complexities, such as dynamic changes in temporal and spatial aspects of hippocampal place cell firing and trial-to-trial variability in cortical neurons recorded from behaving animals. Our review supplements the brief and general guidance offered by Curran-Everett and Benos (2004)
, and may be regarded as an update to the early work of Perkel et al. (1967a
,b
).
The new field of computational neuroscience uses detailed biophysical models and artificial neural networks to study emergent behavior of neural systems and the way neural systems represent and transmit information (e.g., Dayan and Abbott 2001
). Statistical methods have become an essential complement: in addition to their ubiquitous role in summarizing experimental data, they provide estimates of biologically relevant parameters, assessments of uncertainty about them, and a formalism for evaluating the fit of theoretical predictions to observed data. The statistical paradigm begins with informal investigation of the data, a process that has been named exploratory analysis (Tukey 1977
). Exploratory results, together with judgment based on experience, help guide construction of an initial probability model to represent variability in observed data. Typically, this model posits some underlying, and unobserved, regularity that is coupled with known or unknown sources of irregularity. Every such model, and every statistical method, makes some assumptions. These lead to a reduction of the data to some typically small number of interpretable quantities. The data may be used, again, to check the probabilistic assumptions, and to consider ramifications of departures from them. Should serious departures from the assumptions be found, a new model may be formed. Figure 1 attempts to highlight the iterative nature of probability modeling and model assessment, followed by statistical inference, all embedded into the production of scientific conclusions from experimental results (Box et al. 1978
). We demonstrate this process by analyzing data from a hippocampal "place" cell, using a collection of techniques reviewed in subsequent sections.
|
INFORMATION AND STATISTICAL1 EFFICIENCY
Spike trains recorded in vivo tend to be irregular both within and across trials. To capture the available information about stimulus or behavior it is helpful to formulate a reasonably accurate probability model for the noise. In probability theory stochastic sequences of event times are called point processes, and the simplest point process is the Poisson process. A basic property is that counts of Poisson process events follow Poisson distributions. Thus, the simplest assumption one might make about a spike train is that it may be described as a Poisson process, and then the resulting spike counts (the number of spikes in particular windows of time) would follow Poisson distributions. In this section we use Poisson spike counts for pedagogical purposes to discuss some fundamental statistical concepts.
We do not want to give the impression, however, that Poisson models are to be trusted without critical examination. Indeed, there are reasons to think it unlikely in principle that spike trains should follow Poisson processes. Assuming large numbers of excitatory and inhibitory inputs, in the time-homogeneous case (meaning that the inputs do not vary systematically across time, as they would with a time-varying stimulus), Gerstein and Mandlebrot (1964)
showed that a simple model of voltage variation with a fixed spiking threshold leads to interspike intervals that follow an inverse Gaussian distribution. (See also Tuckwell 1988
.) Spike trains that follow this model would be non-Poisson. There has been considerable documentation and discussion of the variability of spike trains and its sources (an early study is Smith and Smith 1965
; see Shadlen and Newsome 1998
, and the references therein). However, whether a particular set of spike train data should be considered Poisson is an empirical matter, subject to statistical examination. Later we discuss methods that either remove the Poisson assumption in treating counts or substitute a more general conception of an eventtime process.
A parametric statistical model is a probability model with a fixed set of parameters that may be estimated from the data
For illustrative purposes we consider the analysis of count data that are assumed to follow a Poisson distribution. If µ is the mean number of times that a neuron fires during some specified time interval (a, b), then the probability of getting y spikes during that interval is
![]() | (1) |
of spikes per trial during the interval (a, b). We call
an estimator of µ. It is not only intuitive, but is also an example of a very general and generally very good method of estimation called maximum likelihood (ML), to which we will return shortly. We will, throughout, follow the standard statistical convention of using
to denote a generic estimator of µ, but often this estimator is obtained by ML. Evaluation of an estimation procedure has four components
Although the sample mean is an intuitive estimator of the Poisson mean µ, one might dream up alternatives. For example, a property of the Poisson distribution is that its variance is also equal to µ; therefore, the sample variance s2 = 1/(n 1)
i (yi
)2 could also be used to estimate the population variance µ. This may seem odd, and potentially inferior, on intuitive grounds because the whole point is to estimate the mean firing rate, not its variance. On the other hand, once we take the Poisson model seriously the population mean and variance become equal and, from a statistical perspective, it is reasonable to ask whether it is better to estimate one rather than the other from their sample analogues. Our purpose here is to present a simple analysis that demonstrates the inferiority of the sample variance to the sample mean as an estimator of the Poisson mean µ. We are going through this exercise so that we can draw an analogy to it later on.
A statistical estimator is itself subject to random variation: it is computed from a sample of data, and a new sample (a new set of n trials) would produce different data and therefore a different value of the estimator. To study the variation of an estimator we may calculate its expectation and variance, both obtained, theoretically, across repeated sets of n trials. The expectation of both
and s2 is µ; on average, in repeated sets of trials, the positive errors tend to cancel the negative errors (they are both unbiased). But how accurate is s2 compared to
? Analytical calculation of the variance of each estimator gives2
![]() |
![]() |
and it is, in this sense, less accurate: s2 tends to be further from the correct value of µ than
. For example, if we take n = 100 trials and µ = 10, we find V (
) = 0.10, whereas V(s2) = 2.12. The estimator s2 has about 21 times the variability of
, so that estimating µ using s2 would require about 2,100 trials of data to gain the same accuracy as using
with 100 trials. See Figure 2.
|
versus s2 should consider their behavior also under alternative assumptions. In this regard, the sample mean remains a reasonably good estimator of the population mean in large samples regardless of the probability distribution of the spike counts: the sample mean gets arbitrarily close to the population mean for sufficiently large samples (it is a consistent estimator). The sample variance, on the other hand, does so only if the population variance is truly equal to the population mean; otherwise, as the sample size increases it will converge to the wrong value (it is, in that case, an inconsistent estimator). We return to this issue below.
We now consider the theoretical framework that allows us to generalize these kinds of results to more complicated situations. We can list 4 components of formal statistical evaluation of an estimator: 1) a quantity to be estimated, which is a numerical characteristic (in mathematical jargon, a functional) of a probability distribution; 2) an estimator; 3) a criterion according to which the accuracy of the estimator will be judged; and 4) a set of assumptions about the (unknown) probability distribution, which allow us to carry out the evaluation of the estimator. In the Poisson case, 1) µ, the average firing rate in our interval of time, was the quantity being estimated; 2)
and s2 were the 2 estimators being evaluated; 3) we considered the variance as our criterion; and 4) we assumed a Poisson distribution, with the trials being probabilistically identical and independent.
Concerning item 3), in general terms, a simple and illuminating criterion is mean squared error (MSE), which is computed by squaring the error and then averaging across infinitely many hypothetical replications of the data. Formally, the MSE of an estimator
is written as
![]() |
![]() |
) = E(
) µ represents the average amount by which the estimator differs from µ. This formula is the basis for what is called the "BiasVariance trade-off" in comparing alternative methods. It often happens that one method has comparatively high bias, whereas the other has comparatively high variance; MSE takes both into consideration. In the case of
and s2, the bias of both is zero, so the MSE is actually equal to the variance. Thus, the criterion we were applying was the MSE. MSE is commonly used in studies of alternative estimation methods. Fisher information measures the optimal precision with which a parameter may be estimated from data
In a 1922 paper that laid the groundwork for much of the statistical theory developed in the 20th century, R. A. Fisher analyzed the general form of the estimation problem. Fisher observed that estimators often may be approximated by the sample mean of some suitably defined random variables so that, according to the central limit theorem, they will tend to be approximately normally distributed (Gaussian) for large sample sizes. Figure 2 shows a pair of histograms of
and s2 values calculated from 1,000 randomly generated samples of size n = 100 when the true Poisson mean was µ = 10. The asymptotic normality of the 2 estimators is indicated by the approximately normal shape of their histograms. (Much of statistical theory is asymptotic in the sense that it considers behavior when the sample size becomes arbitrarily large.) In the case of
and s2, both estimators are also centered at the true value of µ. In general, for large samples, the bias of any reasonable estimator will be small, so that it is at least centered close to the true value of the quantity it is estimating. Fisher considered the class of all estimators that were asymptotically normal, with the correct value as the asymptotic mean and asked what was the smallest possible variance. This is equivalent to asking what is the minimal MSE in large samples.
The answer is that the smallest possible variance is the reciprocal of what is now called Fisher information. Put differently, the Fisher information gives the best possible precision of any asymptotically "good" estimator (where precision is the reciprocal of variance). The statistical efficiency of an estimator is judged by its variance, or asymptotic variance, relative to the bound determined by Fisher information. If an estimator attains the bound it is said to be efficient. Fisher described efficient estimators by saying they contain the maximal amount of information supplied by the data about the value of a parameter. That is, the information in the data pertaining to the parameter value may be used well (or poorly) to make an estimator more (or less) accurate; in using as much information about the parameter as is possible, an efficient estimator uses the data most efficiently and reduces to a minimum the uncertainty attached to it.
For the Poisson model, the Fisher information about µ in n observations is
![]() | (2) |
so it is an efficient estimator. On the other hand, use of s2 rather than
to estimate µ effectively discards 95% of the information in the data. In general, the Fisher information can be computed from the probability model, or more precisely, the likelihood, which we introduce in the next subsection. Maximum likelihood estimators are efficient
In addition to deriving the information bound, Fisher showed that the method of maximum likelihood (ML) produces efficient estimators, i.e., as explained above, this means that an ML estimator (MLE) automatically has the smallest possible uncertainty, for large samples.
The MLE is the value of the parameter that maximizes the likelihood function. To explain what the likelihood function is, and why one might want to maximize it, let us consider the Poisson distribution. For any particular value of µ, Eq. 1 gives the probability of observing y spikes during the interval (a, b) for a single trial. Table 1 displays the value of p(y|µ) for several values of y and µ. The argument of the probability density, that is, the quantity that is varying in Eq. 1, is the spike count y. Because Eq. 1 is a probability density, the sum over all possible values of y is 1. In Table 1 this corresponds to summing across rows. The likelihood function reverses what is held fixed and what is varying: it applies p(y|µ) with the data y held fixed and the unknown parameter µ allowed to vary. Reasoning based on the likelihood function continues along this reversed, or "inverse," path: having observed a spike count y, we would find it implausible to think that the correct value of µ was one that would make y an improbable event. Instead, the MLE maximizes the likelihood function, producing the parameter value µ that makes y the most probable to have occurred.
|
![]() | (3) |
![]() |
In the Poisson case, an easy application of calculus shows that the MLE of µ is
. Earlier we showed that, for estimating the Poisson parameter µ, the mean is much better than the variance, and then we stated that the mean is efficient in the sense of having the smallest possible MSE in large samples. We see here that the efficiency of
may be considered a special case of the general result that the MLE is efficient.
Bayes estimators are also optimal in large samples
Closely related to ML estimation, and more powerful in some circumstances, is Bayesian inference. Bayesian inference is based on Bayes' theorem, which is an elementary formula for computing conditional probabilities of the form P(A|B) from probabilities of the form P(B|A). The profound implication of Bayes' theorem for statistical inference becomes apparent when A signifies an unknown parameter and B signifies the available data: having observed some data y, Bayes' theorem allows us to express our uncertain knowledge about the parameter µ quantitatively, in the form of a probability distribution p(µ|y). The inputs to Bayes' theorem are the likelihood function p(y|µ) and a prior distribution p(µ) that represents a priori knowledge about µ. To contrast with the prior distribution, which represents knowledge that logically precedes the data, the distribution p(µ|y) produced using the data is called the posterior. The usual Bayes estimator is the mean of this posterior distribution.
In the Poisson case, Bayes' theorem provides p(µ|y) in terms of the formula for p(y|µ) given by Eq. 1. Specifically, according to Bayes' theorem, the probability of any set of values of µ is determined by the equation
![]() | (4) |
It is easy to obtain standard errors for maximum likelihood and Bayesian estimation
Earlier we noted that Fisher information is reciprocal of the minimal possible variance of any "good" estimator, and that ML and Bayesian estimators achieve this bound. It follows that Fisher information may be used to obtain SEs for ML and Bayesian estimators. In practice, it is convenient to substitute the observed information instead. In the Poisson case we replace µ with
=
in Eq. 2 to obtain
![]() | (5) |
![]() | (6) |
![]() | (7) |
Likelihood ratio tests have good statistical power
To illustrate statistical considerations concerning estimation procedures, earlier we discussed estimation of the Poisson mean µ based on n repeated trials. Now suppose, under the same circumstances, we wish to test the null hypothesis H0 : µ = µ0. For example, µ0 could be the baseline firing rate of a neuron and we may wish to demonstrate that some stimulus increases (or decreases) this rate, so that the alternative hypothesis HA : µ
µ0 would hold. In analyzing this situation we will follow a standard statistical convention, which we have ignored up to this point: capital and lowercase versions of a letter are used to distinguish a random variable (
) from a particular value (
) taken by the random variable.
The obvious procedure to assess whether H0 holds would be to examine
and if it is sufficiently far from µ0, reject H0. That is, we would define a quantity c and reject H0 whenever |
µ0| > c. To determine c we usually require the probability of falsely rejecting H0 to be small: under H0 (that is, assuming H0 were true), we would obtain c such that Prob (reject H0|H0 true) = Prob (|
= µ0| > c) =
, for a suitably small
, assuming each trial's spike count Y has a Poisson distribution with µ = µ0. The usual criterion is to take
= 0.05. This is called the size of the hypothesis test, or the probability of a type I error. When using data to assess H0 a good practice is to report the P-value, which is the smallest value of
according to which the observed data would reject H0. For example, one might report P = 0.02 rather than simply P < 0.05.
To compute Prob (|
µ0| > c) one could rely on the approximate normality of the sample mean, which is likely to be adequately accurate unless µ0 and n are both small. Specifically, under H0, because each Y has mean µ0 and variance µ0, it follows that
has mean µ0 and variance µ0/n; therefore the calculation of Prob (|
µ0| > c) may be based on a normal distribution with mean 0 and variance µ0/n, so that
when
= 0.05.
Alternatively, if there are doubts about the adequacy of this approximation, Prob (|
µ0| > c) may be obtained (even for small n and µ0) by computer simulation. We simulate samples U1, U2, ..., Un, where each U1 has a Poisson distribution with mean µ0 and for each sample compute the sample mean
. If we repeat this procedure a large number of times to obtain a large number of samples (say, 10,000) we can then approximate Prob (|
µ0| > c) accurately by the proportion of samples for which |
u0| > c. This is an example of a parametric Bootstrap test. We discuss the nonparametric Bootstrap in the next section. Simulation-based calculation of P-values is an important technique because it applies to situations where normal approximations (or other, similar approximations) are either unavailable or of dubious validity.
In complicated settings there may not be an obvious statistical test, or there may be several (as in the hypothetical example of alternative estimators for the Poisson mean µ used above), or one may want some reassurance that, as with ML estimation, the method is a good one. The general approach based on the likelihood function is to define the ratio of the likelihood under H0 to that under HA and to reject H0 whenever this ratio is sufficiently small. This procedure is called the likelihood ratio test. Thus, to test H0 : µ = µ0 versus HA : µ = µ1 we would use LR = L(µ0)/L(µ1) and reject H0 when LR < c, where c is chosen so that Prob (LR < c) =
when H0 is true. Here, µ1 would be some prespecified alternative value of the Poisson mean. Usually, however, no such specific alternative to µ0 is apparent and the more generic alternative µ
µ0 is used. In this case the likelihood ratio L(µ0)/L(
) is used, where
is the MLE. For the Poisson case H0 : µ = µ0, it turns out that the likelihood ratio test yields the "obvious" procedure based on
discussed above. The likelihood ratio test produces, similarly, many familiar statistical tests (such as t-tests and F-tests).
When 2 distinct procedures are available for testing the same null hypothesis they may be compared by first aligning them to have the same size (i.e., the same value of
) and then computing their power, which is the probability of correctly rejecting H0 for particular values of the parameter that satisfy HA. Extensive theory and simulation studies have shown that the likelihood ratio test tends to have good power, often nearly or exactly the best possible power, in a wide variety of situations, at least when sample sizes are reasonably large (e.g., van der Vaart 1998
). It is therefore the most commonly applied statistical testing procedure when there are explicit parametric probability models under both hypotheses. Bayesian methods of hypothesis testing are beyond the scope of our review here, but they have received considerable attention in recent years and the interested reader should consult Kass and Raftery (1995)
; an application to neuronal data analysis is in Behseta and Kass (2005)
.
Applicability of a probability model to a data set should be assessed
Previous subsections have been concerned with fundamental principles of statistical inference. In our introductory discussion, however, we emphasized the inductive process diagrammed in Fig. 1 and, according to that figure, statistical inferences should occur only after determination that the probability model on which the inferences will be based is adequate. A variety of procedures have been developed to assess the adequacy of a probability model in representing the regularity and variability in a particular set of data. One method is to consider a more elaborate model and conduct a likelihood ratio test to choose between the simpler model and the more elaborate model. A second approach is to compare suitable features of the data to predictions made by the model, and a general procedure for doing so is to examine a set of functions of the data that would, if the probability model were correct, constitute a sample from a particular probability distribution, say F(y); when the function values are ordered and plotted against theoretical quantiles of the putative probability distribution F(y), the result, called a QQ (for quantilequantile) plot, should be roughly linear (e.g., Hogg and Tanis 2001
). Both of these methods are discussed later in POISSON AND NON-POISSON MODELS. In addition, chi-squared goodness-of-fit statistics may be used to evaluate fit when applicable (e.g., Sokal and Rohlf 1995
).
NONPARAMETRIC METHODS
In the previous section we discussed parametric statistical methods, first considering estimation and then describing hypothesis testing based on the likelihood ratio test. In the following subsection we review very briefly the purpose of nonparametric methods, and a bit of the terminology, so that readers will be equipped to tackle the literature. We next describe a very general approach to assessing uncertainty, called the Bootstrap, illustrating it by revisiting the simple hypothesis testing problem posed earlier.
Modern nonparametric methods allow greater flexibility in probability models at the cost of some loss of efficiency
The Poisson distribution uses a single parameter µ. Other distributions or models, such as linear regression models, typically depend on a small number of parameters that must be estimated from (or "fitted to") the data at hand. It is important to consider carefully what might happen if the probability model were to be incorrect in some plausible way. For example, as we already mentioned, in the case of estimating µ, the sample mean
remains consistent (and the MSE becomes arbitrarily small with sufficiently much data) for non-Poisson data, whereas s2 will estimate the population variance, which may differ from the population mean. Often a restrictive model, such as the Poisson, is replaced with a more flexible model. Highly flexible models are often called "nonparametric."
Classical nonparametric methods substitute transformed versions of the data (such as ranks), to obtain analogues of standard elementary methods (such as ANOVA). Modern nonparametric methods use models with large numbers of parameters, and their theoretical analysis often assumes there are infinitely many parameters. In some contexts, the parameters are used to characterize a probability distribution, as in the Bootstrap, which we discuss next. Sometimes, large numbers of parameters are used to replace linear relationships with nonlinear relationships whose form is not picked a priori but rather is determined from the data. For example, curve fitting is often accomplished with very flexible models (having large numbers of adjustable parameters), a process statisticians usually call nonparametric regression. We discuss the application of generalized nonparametric regression to neuronal data in the next section, where we view the PSTH as a nonparametric estimator of the firing rate function and contrast it with more efficient alternatives.
Nonparametric methods apply more generally than parametric methods. This generality comes at a cost: when the assumptions of a parametric model hold (or hold to a close approximation), they will be more efficient than nonparametric competitors. In deciding whether to apply a parametric or a nonparametric method, therefore, one must consider the likelihood of a substantial departure from the parametric assumptions. The notion of "substantial departure" will depend on the context, and some parametric methods are more likely than others to perform poorly. Numerical simulations may be used to determine performance of nonparametric methods in various situations (compared, say, to ML, which would be fully efficient). Some modern nonparametric methods can perform very well, with relatively little loss of efficiency (e.g., van der Vaart 1998
). An example concerning the fitting of directional tuning curves is discussed in the next section.
The Bootstrap is a general nonparametric method of assessing uncertainty
Earlier we pointed to the use of simulation in computing P-values. There we were discussing the parametric case in which, under H0, the probability distribution of each observation Yi was Poisson with parameter µ = µ0. However, because neuronal spike counts are known to exhibit non-Poisson behavior, it may be desirable to use a procedure that does not assume Poisson counts. For this purpose, the translated observed spike counts Y1 (
µ0), Y2 (
µ0), ..., Yn (
µ0), with Yi being the count for the ith trial, may be repeatedly resampled. That is, we draw observations U1, U2, ..., Un where each Ui takes on the value of one of the observed values of Yi (
µ0), and all such values occur with probability 1/n. This is known as sampling with replacement. The reason for the translation is explained below. Having thereby obtained a large number of samples we then (as with the Poisson Ui values in the previous section) compute the proportion of samples for which |
µ0| > c, which, again, estimates the desired Prob (|
µ0| > c). This method is known as the Bootstrap. It is nonparametric because it does not require the assumption of a specific parametric family of distributions (here, Poisson). Instead, it requires the data values Yi to be independent replications from the same probability distribution. It works because it implicitly uses the data Y1, Y2, ..., Yn to estimate the probability distribution from which they are assumed to be drawn, and then computes the desired probability from this estimate. With a large simulation, the Poisson-based parametric boot strap method can compute Prob (|
µ0| > c) with arbitrarily good accuracy. The nonparametric Bootstrap P-value, however, is accurate only for large sample sizes n. The great value of the Bootstrap is that it may be applied in many complicated situations. We describe its use in testing for correlated activity between 2 neurons in CORRELATED PAIRS OF NEURONS. Furthermore, a substantial literature has demonstrated both theoretically and in numerical studies that it is widely effective (Davison and Hinkley 1997
; Efron and Tibshirani 1993
). In addition to treating the testing problem, the Bootstrap may be used to obtain SEs and confidence intervals.
The simplicity of the Bootstrap conceals an important point: its properties depend on the precise manner in which the resampling is conducted; arbitrary shuffles of the data do not necessarily accomplish desired statistical goals. In hypothesis testing, the P-value must be obtained under the hypothetical reality imposed by H0. For example, in the problem discussed above, we wished to compute the hypothetical probability Prob (|
µ0| > c) under the supposition that the observations had mean µ = µ0. We cannot merely sample the Yi values because their mean µ may not equal µ0. (Indeed, this is precisely the possibility we are examining when we perform the statistical hypothesis test.) Instead, the calculation may be based on new data that resemble the Yi values except that their mean is shifted to equal µ0. To produce a version of Yi that has mean µ0 rather than µ we would, in principle, want to subtract from Yi the quantity µ µ0. Because we do not know the value of µ we instead subtract (
µ0). Thus, in the procedure outlined above, the observed values of Yi (
µ0) were sampled.
STATISTICAL SUMMARIES OF FIRING RATE
The PSTH communicates firing rate and its evolution across time
Within an experiment, each trial's set of recordings from a single neuron produces a different spike train. By recording many repeated trials the regularity in the neuron's response to a stimulus, or the production of some behavior, may be obtained. A raster plot displays the complete set of spike times for all trials for a single neuron in a particular experimental condition, whereas the peristimulus time histogram (PSTH) accumulates these to show both the overall activity of the neuron and the way its firing rate varies across time (Gerstein and Kiang 1960
).
One reason the PSTH works well as a visual summary of the data is that our eye is able to pick up trends, smoothing the PSTH so that we see the temporal evolution of the firing rate. In Fig. 3 we have overlayed a smooth curve on a PSTH, based on data recorded from a locust antennal lobe neuron. The PSTH jumps somewhat erratically from bin to bin because of noise and does not track the kind of continously varying firing rate one would expect. For this reason, the curve is closer to what we understand from looking at the PSTH than is the PSTH itself.
|
(t) and are interested in estimating the entire curve described by
(t) for t in some interval, which we write as [A, B]. In statistics, one usually writes an estimate obtained from data with a hat, so an estimate of the firing-rate function would be written
(t).
Conceptually,
(t) is the trial-averaged firing-rate function, as opposed to the within-trial firing-rate function. As we note later, the latter may include effects resulting from "membrane memory" (the refractory period) or erratic bursting (bursting that is not time-locked to experimental stimulus or behavior), so that at time t it may depend on the precise timing of spikes that occur before time t. The function
(t) represents the tendency to fire at time t that a neuron would have if such memory or erratic bursting effects were removed, as they would be by averaging (hypothetically) over infinitely many trials. It is often, implicitly, considered to be a representation of the firing of large numbers of independently acting neurons similar to the one being recorded (each trial effectively representing the activity of one such similar neuron). Our focus, in the remainder of this section, on estimation of
(t) is not meant to imply that every analysis should begin with a time-domain representation of firing rate. In many situations, especially when oscillatory stimuli are used, frequency-domain analyses can produce valuable interpretations of the data (Brillinger 1992
; Mechler et al. 1998
; Pesaran et al. 2002
).
Some neuroscience questions involve instantaneous firing rate
Sometimes, questions of scientific interest are posed naturally in terms of instantaneous firing rate. For example, Olson et al. (2000)
examined neurons in the supplementary eye field (SEF) when a monkey moved his eyes in response to either an explicit external cue (the point to which the eyes were to move was illuminated) or an internally generated translation of a complex cue (a particular pattern at a fixation point determined the location to which the monkey was to move his eyes). In one part of their study, they were interested in the time at which maximal firing rate was achieved, and the delay of this maximum for the internally generated cue compared to the external cue. Formally, the problem is to determine the value of t that maximizes
(t). Similarly, when the maximal firing rate, or the difference between the maximal rate and a baseline rate, is of interest the problem involves estimation of
(t).
Probability models in terms of instantaneous firing rate make efficient use of the data
A second reason for estimating the instantaneous firing rate
(t) is that it appears in probability models and, as we have already noted, when coupled with maximum likelihood or Bayesian methods of estimation, probability models reduce the data efficiently. For example, if we assume a set of n spike times s1, ..., sn follow a time-varying Poisson process then their probability density is
![]() | (8) |
(t) for various values of t. For this purpose the function
(t) would have to be estimated from the data. As an estimate of firing rate the PSTH can be improved by smoothing
It would be possible to use the PSTH as an estimate of
(t). However, the PSTH is relatively noisy and, under the assumption that
(t) is itself smooth, it is possible to produce much better estimates
(t) by smoothing ("filtering"). Kass et al. (2003)
provided an illustration in which smoothing increased the efficiency of estimating
(t), compared to the PSTH, by a factor of 14. This situation is analogous to the estimation of the Poisson mean µ we considered earlier. Using the PSTH to estimate
(t) is like using s2 to estimate µ: it would take about 140 trials to get the same accuracy with the PSTH that may be achieved by only 10 trials with a smoothed version of the PSTH. Not all cases are as dramatic as a 14-fold improvement, but in our experience smoothing is likely to provide gains of at least severalfold.
There are many ways to produce a smooth firing-rate function; spline-based methods work very well
One simple way to reduce the noisiness of the PSTH is to use a moving average: one may pick a suitable time window
(such as 30 ms) and at each time t, average the values of the PSTH between t
and t +
. A variant of this, often called a Gaussian filter (or, in the statistics literature, a Gaussian kernel density estimator), uses a weighted average, putting weights (defined by a Gaussian probability density) on each PSTH value that decrease as the values to be averaged get further away from time t.
When
(t) varies slowly, Gaussian filters do a good job of estimating it. When the firing rate varies quickly, however, Gaussian filters are unable to capture the variation without introducing artificial high-frequency fluctuations. An illustration is given in Fig. 3, which uses data from an experiment on olfactory coding in the locust antennal lobe (Stopfer et al. 2003
). Gaussian filters can capture a rapid jump in firing rate only by allowing noise in time periods where the firing rate is varying slowly. When the bandwidth is increased so that the noise is filtered out, the rapid jump in firing rate is underestimated. In other words, to filter high-frequency noise, the Gaussian filter must remove the high-frequency signal. The problem here cannot be solved by a fixed-bandwidth filter. A better estimate of
(t), produced by a method called BARS, is given in the third panel of Fig. 3. It has the desirable characteristic of strongly smoothing the firing rate function, while allowing sudden increases or decreasesBARS effectively filters high-frequency noise while retaining high-frequency signal. BARS stands for Bayesian adaptive regression splines (DiMatteo et al. 2001
). A spline is a collection of polynomial curves that are joined at selected points (here, time points) called "knots." BARS uses a Bayesian Monte Carlo method to allocate knots optimally, to adapt to sharp variations in the intensity function. An application of BARS to an analysis of many single-unit firing rate functions is provided by Behseta et al. (2005)
.
An alternative approach to estimating
(t) is to define a plausible parametric form for it, with a small number of parameters, and then estimate these parameters (by ML). For example, Olson et al. (2000)
used 6 parameters to characterize firing rate intensity functions in 84 single units from the supplementary eye field. However, nonparametric approaches using BARS or related methods (cf. Hansen and Kooperberg 2002
; Loader 1999
) are often easier to implement, remain accurate when the parametric form is incorrect, and suffer relatively little loss of efficiency even when the parametric form holds. For example, in a closely related context, Kaufman et al. (2005)
considered the problem of fitting tuning curves to spike count data collected during wrist movement in 8 2-dimensional directions, with the goal of relaxing the usual assumption of cosine tuning. Kaufman et al. modified BARS to make it applicable to functions defined on a circle, and they showed in simulation studies that this nonparametric method was almost as efficient as cosine tuning when the true tuning function was exactly of cosine form, while having the advantage of being able to fit departures from cosine tuning. It is also possible to use partially parametric fitting methods. Barbieri et al. (2001)
showed how Zernike polynomials may be used to characterize hippocampal place fields. These involve both Gaussian and non-Gaussian components, and are able to capture departures from Gaussian place field tuning.
An important subtlety is that all smoothing methods require a choice of degree of smoothness. For example, a histogram requires a choice of bin width and a Gaussian filter requires a choice of bandwidth. These can be determined by statistical methods, but they are often selected based on visual appearance of results, possibly with some past experience in mind. To make a sensible choice, the data analyst must consider limits on the rate at which
(t) can change, based on physiology and properties of the stimulus or behavior. For example, one might ask whether the firing rate is likely to jump substantially within 10 ms, throughout the time interval during which the recording is made. If so (if, for instance, the stimulus is itself rapidly fluctuating), then a less smooth estimate of
(t) is desirable than if slow variations of
(t) are to be expected. More recently developed methods, including BARS, are able to use the data to determine smoothness across the time domain, but they always involve assumptions about how rapid the fluctuation of
(t) might be.
POPULATION CODING AND DECODING
Population coding refers to the information contained in the combined activity of multiple neurons. The general challenge of decoding, meaning to extract this information, reverses the process: it is to determine from a set of spike trains (obtained from multiple neurons) the experimental condition or stimulus that produced them. In the simplest case, some small set of conditions is used, and the problem is to infer which condition led to the observed response. Decoding problems arise in several contexts. Examples include position representation by ensembles of rat hippocampal neurons (Brown et al. 1998
; Zhang et al. 1998
), velocity encoding by fly H1 neurons (Bialek et al. 1991
), velocity and position encoding by M1 neurons (Georgopoulos et al. 1986
; Moran and Schwartz 1999
; Serruya et al. 2002
), and natural scene representations by cat lateral geniculate nucleus neurons (Stanley et al. 1999
). Aside from their use to study how populations of neurons represent information, decoding algorithms are being studied for their use in the brain-controlled neural prosthetic devices (Schwartz et al. 2004
; Serruya et al. 2002
, Wessberg et al. 2000
).
Statistical analysis proceeds in 2 stages. First, the neural firing that results from the relevant stimulus or behavioral condition must be estimated from some history of data. Second, this relationship must somehow be "inverted" to provide a prediction of stimulus or behavior based on neuronal activity. These 2 stages may be intertwined, but are conceptually and, usually, analytically separable. The first is sometimes called "learning," "estimation," or "encoding." The second is often called "prediction" or "decoding." In many situations the encoding step takes place in a steady-state environment. This is the most common framework for "supervised learning." With brain-controlled ("closed loop") movement, however, the evolution of the system's state is important because the subject will continue to adapt over time (Taylor et al. 2002
; Wessberg et al. 2000
) and part of the goal is to enable the subject to learn.
After reviewing some simple, linear decoding methods we discuss Bayesian decoding. The Bayesian decoding framework is based on "state-space" modeling, where data (spike counts) are assumed to depend on an underlying internal state (such as planned velocity of hand movement), which in turn is assumed to vary according to some specified dynamic model. We find the framework appealing because it facilitates incorporation of useful particulars, such as spike count distributions and movement smoothness constraints. In addition it is widely applicable because it can accommodate internal states that are multidimensional and relationships of data to state that are highly nonlinear.
The stimulus may be reverse-predicted from spiking activity using linear regression
Reverse regression (also called reverse correlation) is a very simple and widely used decoding method (e.g., Stanley et al. 1999
; Warland et al. 1997
). The "reverse" part of the terminology comes from the reversal of roles played by the stimulus and spike-activity response: the spike count data are treated as if they were the inputs, i.e., the fixed explanatory variables (the x values in the usual regression notation), whereas the stimulus is considered the output, i.e., the response variable (the Y variable, which in the regression formulation is assumed to be subject to random error).
Let us first describe the procedure as it might be applied to data from a single neuron. The explanatory variables are defined by forming a series of successive bins of spike counts at some suitable resolution, such as 30 ms. Thus, (x1, x2, ..., xT) would represent the vector of spike counts in T successive bins after the stimulus. Given a training set of many stimulus and firing-rate combinations one computes the usual least-squares coefficients
![]() | (9) |
![]() | (10) |
The population vector algorithm is simple and often effective
One widely successful decoding method is the population vector algorithm (PVA; Georgopoulos et al. 1982
, 1986
) and its modifications (Taylor et al. 2002
). The PVA has enabled investigation of cortical control of arm movement (e.g., Moran and Schwartz 1999
) including phenomena such as illusion perception (Schwartz et al. 2004
) and has also been used in quite different contexts, such as the representation of moving tactile stimuli in sensory cortex (Ruiz et al. 1995
). The method originated from the observation that motor cortex neurons are directionally tuned, with broad tuning curves that may be characterized reasonably well by 2 parameters, average firing rate and preferred direction. The preferred direction
is the direction in which the neuron's firing rate is highest. The preferred directions are obtained by fitting tuning curves to observed training data. Once one knows a neuron's preferred direction and its average firing rate, its firing rate for an arbitrarily specified direction may be determined, subject to some error. The PVA reverses the process, predicting direction from firing rate, by combining the observed activity from a large number of neurons. Specifically, the movement velocity
at a suitable time lag
after spiking activity is predicted by the "population vector"
according to the equation
![]() | (11) |
i is the ith neuron's preferred direction and the "weight" wi(t) is the neuron's firing rate at time t (after being normalized in some fashion). The intuitive interpretation of the population vector is that it represents the population when each neuron sends in its "vote" for its preferred direction, which is then weighted by its activity. The population vector algorithm is a special case of linear regression
We may connect the PVA to least-squares regression by thinking of the preferred direction
i as an explanatory variable, which, for a given velocity
, predicts spiking activity wi according to the linear regression model
![]() |
i ·
is the dot product. If we now collect the firing rates together into a vector Y = (w1, ..., wn) and the preferred directions (along with the intercepts) into a corresponding matrix X, once we take
=
we obtain the usual matrix form of the linear regression model
![]() | (12) |
![]() | (13) |
=
given by Eq. 9 reduces to the PVA in Eq. 13. Having made this observation, one may immediately generalize to the case in which the preferred directions are non-uniform by replacing the PVA estimator XTY in Eq. 13 with the least-squares estimator (XTX)1XTY in Eq. 9. In addition, it should be noted that ordinary least squares assumes the responses, in this case the firing rates of the various neurons, have equal variances and are independent. Otherwise, classical statistical analysis shows that weighted least squares (WLS) should be used
![]() |
is the variance matrix of Y (Kutner et al. 2004Bayesian decoding methods are efficient and flexible
We indicated earlier that Bayesian methods are efficient and we also mentioned that they can effectively incorporate a priori information. To implement Bayesian decoding for velocity, as an alternative to the PVA outlined above, we must apply Bayes' theorem (analogously to Eq. 4). This requires 1) a probability model for the data and 2) a prior probability assumption for the unknown velocities (which are to be predicted).
A simple version of Bayesian decoding obtains the predicted velocity vector at time t +
, which we now write as
t+
, from a pair of equations. (One simplification in this exposition is that we are assuming the time lag
between neural firing and movement remains fixed throughout the experiment and is the same for all neurons.) The first equation specifies the posterior distribution given by Bayes' theorem (analogously to Eq. 4)
![]() | (14) |
t+
) of a firing rate yt when a movement
t+
will be made at time t +
. The second probability density on the right-hand side of Eq. 14 plays the role of the prior for velocity at time t +
based on the spiking activity that has preceded time t. This density is determined, recursively, from the equation
![]() | (15) |
t1+
|y1, ...,| yt1) from the previous step of Eq. 14. The factor p(
t+
|
t1+
) is where the prior information (information separate from, or "prior" to, any spiking activity) is inserted: we assume the velocity
t+
will tend to resemble
t1+
, but will deviate from it by a small amount (governed by a smoothing parameter, which may also be obtained from the data); as a result the velocity will be smoothed across time. The details we are skipping may be found in Brockwell et al. (2004)
Equations 14 and 15 are very general. Alternative evolving behavioral parameters may take the place of velocity. If the likelihood function [the probability p(yt|
t+
) in Eq. 14] describes reasonably well the dependency of spiking activity on these parameters, there are sufficient data, and the prior smoothness condition [the probability p(
t+
|
t1+
) in Eq. 14] is appropriate, then Bayesian decoding will predict their evolution accurately.
Brockwell et al. (2004)
compared the PVA to optimal linear estimation (OLE) and Bayesian decoding for a computer-simulated hand movement using 200 neurons with tuning that was similar to directional tuning observed in motor cortex data. In their simulation study PVA was less efficient than OLE by a factor of 2 and less efficient than Bayesian decoding by a factor of 10. Thus, for example, Bayesian decoding of 25 neurons would be roughly as accurate as PVA decoding of 250 neurons. Brockwell et al. (2004)
also obtained roughly 7-fold improvement in MSE for a set of motor cortical data. Large gains in accuracy with Bayesian decoding have also been reported by Gao et al. (2002)
.
To illustrate the flexibility of the approach, we provide a second example of Bayesian decoding in the context of reconstructing the path of a foraging rat based on the activity of pyramidal neurons in the CA1 region of the hippocampus. It is well known that when a rat moves through its environment these neurons fire only in certain regions of space (O'Keefe and Dostrovsky 1971
; Wilson and McNaughton 1993
) and, as a result, they are called place cells and regions of space in which they fire are termed place receptive fields. Brown et al. (1998)
analyzed place cell spike trains and position data from a rat freely foraging in a circular environment. Here, we contrast Bayesian decoding with reverse correlation, as in Eq. 9 and 10. We analyze similar data in An example of iterative probability modeling.
To display results we show the X and Y components of the animal's position separately. Figure 4 displays the fit of the reverse correlation model (green line) and the Bayesian decoding algorithm (red line) to the X and Y components of position (blue line) for the decoding stage. Reverse correlation, which makes no assumption of smoothness in the animal's path, provides a very noisy prediction, and strays substantially from the correct path during several epochs. The proportion of variability explained by reverse correlation is R2 = 0.23. Compared with the reverse correlation fit, Bayesian decoding predicts the path of the animal in the separate directions very closely. Its deviations from the model fit, which are much smaller than those of the reverse correlation technique, occur at the extremes of the individual position components. The proportion of variability in the path data explained by Bayesian decoding is R2 = 0.87.
|
At the beginning of INFORMATION AND STATISTICAL EFFICIENCY we noted that non-Poisson variation in spike trains is to be expected, and has been documented, under particular conditions (see also Barbieri et al. 2001
; Kass and Ventura 2001
; Reich et al. 1998
; and the references therein). One phenomenon leading to non-Poisson behavior is the refractory period: immediately after a spike there is a short interval of time during which another spike is impossible and a longer interval of time during which the probability of a spike is reduced. For high firing rates refractory effects become detectable and may be important for some analyses. Another point, mentioned earlier, is that theoretical integrate-and-firetype models produce non-Poisson behavior. In addition, either bursting or excess trial-to-trial variation (arising from changes in stimuli or internal state of the subject) may be present. These may require specialized techniques that either complement or extend Poisson-based statistical methods.
In the first subsection we briefly describe the Poisson process and emphasize its usefulness in analyzing data pooled across trials. In the next subsection we indicate ways in which probability models may be extended so that they may be applied to non-Poisson data to produce within-trial analyses. Finally, we illustrate assessment of fit for Poisson and non-Poisson models. These methods are then used to demonstrate the presence of non-Poisson bursting in a hippocampal place cell in the next section on iterative probability modeling.
Time-varying Poisson processes are suitable for analyzing data pooled across many trials
The simplest conception of neuronal firing is that it follows a Poisson process with a time-invariant firing rate
. In this case, the spike-generating process is said to be stationary and Eq. 8 specializes to
![]() | (16) |
(t) is replaced by a single number
. When Eq. 16 holds, the probability that a spike will occur in the infinitesimal interval (t, t + dt) is
· dt, regardless of the time t (i.e., there is a constant firing rate over time). Because spike trains are very rarely stationary, Eq. 8 which allows for a time-varying firing rate, is an important generalization.
Equation 8 still involves a strong simplifying assumption: the probability of a spike at time t [i.e., in the infinitesimal interval (t, t + dt)] is given by
![]() | (17) |
One of the reasons the general Poisson model Eq. (8) is important is that it holds, at least approximately, when data are pooled across tria