JN Fuel your research with LabChart
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


J Neurophysiol 94: 8-25, 2005; doi:10.1152/jn.00648.2004
0022-3077/05 $8.00
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (26)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Kass, R. E.
Right arrow Articles by Brown, E. N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kass, R. E.
Right arrow Articles by Brown, E. N.

INVITED REVIEW

Statistical Issues in the Analysis of Neuronal Data

Robert E. Kass1, Valérie Ventura1 and Emery N. Brown2

1Department of Statistics and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, Pennsylvania; and 2Neuroscience Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, and the Division of Health Sciences and Technology, Harvard Medical School, Massachusetts Institute of Technology, Boston, Massachusetts

Submitted 25 June 2004; accepted in final form 19 February 2005

ABSTRACT

Analysis of data from neurophysiological investigations can be challenging. Particularly when experiments involve dynamics of neuronal response, scientific inference can become subtle and some statistical methods may make much more efficient use of the data than others. This article reviews well-established statistical principles, which provide useful guidance, and argues that good statistical practice can substantially enhance results. Recent work on estimation of firing rate, population coding, and time-varying correlation provides improvements in experimental sensitivity equivalent to large increases in the number of neurons examined. Modern nonparametric methods are applicable to data from repeated trials. Many within-trial analyses based on a Poisson assumption can be extended to non-Poisson data. New methods have made it possible to track changes in receptive fields, and to study trial-to-trial variation, with modest amounts of data.

INTRODUCTION

Technical advances have made available new methods for collecting, storing, and manipulating electrophysiological data. Investigations may now not only characterize neuronal activity in anatomically well defined regions, but they can also examine dynamics of neuronal response and their relationship to behavior. Although elementary methods of data analysis [such as t-tests or visual examination of the peristimulus time histogram (PSTH)] remain useful for many purposes, the growing complexity of neuroscientific experiments, often examining subtle changes on a comparatively fine timescale, requires careful attention to statistical methods for data analysis. In this overview we discuss some of the fundamental data analytical issues that face researchers in neurophysiology, illustrating the general points with the problems of describing the evolution of a neuron's firing rate across time, finding accurate population codes, and assessing time-varying correlation between 2 neurons. In each case recent work has provided a statistical technique that outperforms previous methodology, boosting the scientific information as effectively as if the number of experimental trials, or the number of neurons, had been increased by a substantial factor. We also indicate some of the ways modern statistical procedures can accommodate important complexities, such as dynamic changes in temporal and spatial aspects of hippocampal place cell firing and trial-to-trial variability in cortical neurons recorded from behaving animals. Our review supplements the brief and general guidance offered by Curran-Everett and Benos (2004)Go, and may be regarded as an update to the early work of Perkel et al. (1967aGo,bGo).

The new field of computational neuroscience uses detailed biophysical models and artificial neural networks to study emergent behavior of neural systems and the way neural systems represent and transmit information (e.g., Dayan and Abbott 2001Go). Statistical methods have become an essential complement: in addition to their ubiquitous role in summarizing experimental data, they provide estimates of biologically relevant parameters, assessments of uncertainty about them, and a formalism for evaluating the fit of theoretical predictions to observed data. The statistical paradigm begins with informal investigation of the data, a process that has been named exploratory analysis (Tukey 1977Go). Exploratory results, together with judgment based on experience, help guide construction of an initial probability model to represent variability in observed data. Typically, this model posits some underlying, and unobserved, regularity that is coupled with known or unknown sources of irregularity. Every such model, and every statistical method, makes some assumptions. These lead to a reduction of the data to some typically small number of interpretable quantities. The data may be used, again, to check the probabilistic assumptions, and to consider ramifications of departures from them. Should serious departures from the assumptions be found, a new model may be formed. Figure 1 attempts to highlight the iterative nature of probability modeling and model assessment, followed by statistical inference, all embedded into the production of scientific conclusions from experimental results (Box et al. 1978Go). We demonstrate this process by analyzing data from a hippocampal "place" cell, using a collection of techniques reviewed in subsequent sections.



View larger version (10K):
[in this window]
[in a new window]
 
FIG. 1. Formal statistical inference within the process of drawing scientific conclusions. Probability model building is a prerequisite to formal inference procedures. Model building is iterative in the sense that tentative models must be assessed and, if necessary, improved or abandoned.

 
Within the field of statistics there are standards for evaluating alternative data-reduction procedures. Sometimes methods that seem intuitive may be shown to work well. In particular, simple data summaries and graphical displays are often sufficient for demonstrating striking experimental findings, when noise is small relative to signal, for example, or when additional sources of variation (beyond those summarized) need not be made explicit. In more subtle situations, however, intuitive data summaries may be inconclusive, possibly because they may make inefficient use of the data. We will illustrate by considering first an elementary framework for spike count analysis, where a peculiar yet plausible method of estimating firing rate will be shown to make inefficient use of the data. We then work by analogy, showing how, in important problems, intuitive methods used by neurophysiologists may be equivalent to discarding most of the available data. Our overview focuses on 3 main points: 1) maximum likelihood and Bayesian methods are able to use all of the available information to estimate an unknown parameter (such as a firing rate), 2) modern nonparametric methods, including the Bootstrap, are often applicable to neuronal data, and 3) many analyses based on a Poisson assumption can be extended to non-Poisson data.

INFORMATION AND STATISTICAL1 EFFICIENCY

Spike trains recorded in vivo tend to be irregular both within and across trials. To capture the available information about stimulus or behavior it is helpful to formulate a reasonably accurate probability model for the noise. In probability theory stochastic sequences of event times are called point processes, and the simplest point process is the Poisson process. A basic property is that counts of Poisson process events follow Poisson distributions. Thus, the simplest assumption one might make about a spike train is that it may be described as a Poisson process, and then the resulting spike counts (the number of spikes in particular windows of time) would follow Poisson distributions. In this section we use Poisson spike counts for pedagogical purposes to discuss some fundamental statistical concepts.

We do not want to give the impression, however, that Poisson models are to be trusted without critical examination. Indeed, there are reasons to think it unlikely in principle that spike trains should follow Poisson processes. Assuming large numbers of excitatory and inhibitory inputs, in the time-homogeneous case (meaning that the inputs do not vary systematically across time, as they would with a time-varying stimulus), Gerstein and Mandlebrot (1964)Go showed that a simple model of voltage variation with a fixed spiking threshold leads to interspike intervals that follow an inverse Gaussian distribution. (See also Tuckwell 1988Go.) Spike trains that follow this model would be non-Poisson. There has been considerable documentation and discussion of the variability of spike trains and its sources (an early study is Smith and Smith 1965Go; see Shadlen and Newsome 1998Go, and the references therein). However, whether a particular set of spike train data should be considered Poisson is an empirical matter, subject to statistical examination. Later we discuss methods that either remove the Poisson assumption in treating counts or substitute a more general conception of an event–time process.

A parametric statistical model is a probability model with a fixed set of parameters that may be estimated from the data

For illustrative purposes we consider the analysis of count data that are assumed to follow a Poisson distribution. If µ is the mean number of times that a neuron fires during some specified time interval (a, b), then the probability of getting y spikes during that interval is

(1)
If we assume a particular value for µ we may use Eq. 1 to calculate the probability of observing any specified number of spikes during the interval (a, b). In this situation µ is called a parameter. Having observed n trials of data y1, y2, ..., yn the obvious estimate of µ would be the average number of spikes per trial during the interval (a, b). We call an estimator of µ. It is not only intuitive, but is also an example of a very general and generally very good method of estimation called maximum likelihood (ML), to which we will return shortly. We will, throughout, follow the standard statistical convention of using to denote a generic estimator of µ, but often this estimator is obtained by ML.

Evaluation of an estimation procedure has four components

Although the sample mean is an intuitive estimator of the Poisson mean µ, one might dream up alternatives. For example, a property of the Poisson distribution is that its variance is also equal to µ; therefore, the sample variance s2 = 1/(n – 1) {Sigma}i (yi)2 could also be used to estimate the population variance µ. This may seem odd, and potentially inferior, on intuitive grounds because the whole point is to estimate the mean firing rate, not its variance. On the other hand, once we take the Poisson model seriously the population mean and variance become equal and, from a statistical perspective, it is reasonable to ask whether it is better to estimate one rather than the other from their sample analogues. Our purpose here is to present a simple analysis that demonstrates the inferiority of the sample variance to the sample mean as an estimator of the Poisson mean µ. We are going through this exercise so that we can draw an analogy to it later on.

A statistical estimator is itself subject to random variation: it is computed from a sample of data, and a new sample (a new set of n trials) would produce different data and therefore a different value of the estimator. To study the variation of an estimator we may calculate its expectation and variance, both obtained, theoretically, across repeated sets of n trials. The expectation of both and s2 is µ; on average, in repeated sets of trials, the positive errors tend to cancel the negative errors (they are both unbiased). But how accurate is s2 compared to ? Analytical calculation of the variance of each estimator gives2


where n is the number of repeated trials. Therefore, the variance of s2 is always larger than that of and it is, in this sense, less accurate: s2 tends to be further from the correct value of µ than . For example, if we take n = 100 trials and µ = 10, we find V () = 0.10, whereas V(s2) = 2.12. The estimator s2 has about 21 times the variability of , so that estimating µ using s2 would require about 2,100 trials of data to gain the same accuracy as using with 100 trials. See Figure 2.



View larger version (5K):
[in this window]
[in a new window]
 
FIG. 2. Histograms displaying distributions of and s2 based on 1,000 randomly generated samples of size n = 100 from a Poisson distribution with mean parameter µ = 10. In these repeated samples both and s2 have distributions that are approximately Normal (represented by the overlaid curves). Both distributions are centered at 10 but the values of s2 fluctuate much more than do the values of .

 
This simple analysis is compelling, as long as we are sure that the data follow a Poisson distribution. Because the distribution of real spike counts may well depart from Poisson, a realistic comparison of versus s2 should consider their behavior also under alternative assumptions. In this regard, the sample mean remains a reasonably good estimator of the population mean in large samples regardless of the probability distribution of the spike counts: the sample mean gets arbitrarily close to the population mean for sufficiently large samples (it is a consistent estimator). The sample variance, on the other hand, does so only if the population variance is truly equal to the population mean; otherwise, as the sample size increases it will converge to the wrong value (it is, in that case, an inconsistent estimator). We return to this issue below.

We now consider the theoretical framework that allows us to generalize these kinds of results to more complicated situations. We can list 4 components of formal statistical evaluation of an estimator: 1) a quantity to be estimated, which is a numerical characteristic (in mathematical jargon, a functional) of a probability distribution; 2) an estimator; 3) a criterion according to which the accuracy of the estimator will be judged; and 4) a set of assumptions about the (unknown) probability distribution, which allow us to carry out the evaluation of the estimator. In the Poisson case, 1) µ, the average firing rate in our interval of time, was the quantity being estimated; 2) and s2 were the 2 estimators being evaluated; 3) we considered the variance as our criterion; and 4) we assumed a Poisson distribution, with the trials being probabilistically identical and independent.

Concerning item 3), in general terms, a simple and illuminating criterion is mean squared error (MSE), which is computed by squaring the error and then averaging across infinitely many hypothetical replications of the data. Formally, the MSE of an estimator is written as

where E(·) represents the expectation (or expected value). By the linearity of the expectation we also obtain

where Bias () = E() – µ represents the average amount by which the estimator differs from µ. This formula is the basis for what is called the "Bias–Variance trade-off" in comparing alternative methods. It often happens that one method has comparatively high bias, whereas the other has comparatively high variance; MSE takes both into consideration. In the case of and s2, the bias of both is zero, so the MSE is actually equal to the variance. Thus, the criterion we were applying was the MSE. MSE is commonly used in studies of alternative estimation methods.

Fisher information measures the optimal precision with which a parameter may be estimated from data

In a 1922 paper that laid the groundwork for much of the statistical theory developed in the 20th century, R. A. Fisher analyzed the general form of the estimation problem. Fisher observed that estimators often may be approximated by the sample mean of some suitably defined random variables so that, according to the central limit theorem, they will tend to be approximately normally distributed (Gaussian) for large sample sizes. Figure 2 shows a pair of histograms of and s2 values calculated from 1,000 randomly generated samples of size n = 100 when the true Poisson mean was µ = 10. The asymptotic normality of the 2 estimators is indicated by the approximately normal shape of their histograms. (Much of statistical theory is asymptotic in the sense that it considers behavior when the sample size becomes arbitrarily large.) In the case of and s2, both estimators are also centered at the true value of µ. In general, for large samples, the bias of any reasonable estimator will be small, so that it is at least centered close to the true value of the quantity it is estimating. Fisher considered the class of all estimators that were asymptotically normal, with the correct value as the asymptotic mean and asked what was the smallest possible variance. This is equivalent to asking what is the minimal MSE in large samples.

The answer is that the smallest possible variance is the reciprocal of what is now called Fisher information. Put differently, the Fisher information gives the best possible precision of any asymptotically "good" estimator (where precision is the reciprocal of variance). The statistical efficiency of an estimator is judged by its variance, or asymptotic variance, relative to the bound determined by Fisher information. If an estimator attains the bound it is said to be efficient. Fisher described efficient estimators by saying they contain the maximal amount of information supplied by the data about the value of a parameter. That is, the information in the data pertaining to the parameter value may be used well (or poorly) to make an estimator more (or less) accurate; in using as much information about the parameter as is possible, an efficient estimator uses the data most efficiently and reduces to a minimum the uncertainty attached to it.

For the Poisson model, the Fisher information about µ in n observations is

(2)
and the large-sample variance bound µ/n is obtained by so it is an efficient estimator. On the other hand, use of s2 rather than to estimate µ effectively discards 95% of the information in the data. In general, the Fisher information can be computed from the probability model, or more precisely, the likelihood, which we introduce in the next subsection.

Maximum likelihood estimators are efficient

In addition to deriving the information bound, Fisher showed that the method of maximum likelihood (ML) produces efficient estimators, i.e., as explained above, this means that an ML estimator (MLE) automatically has the smallest possible uncertainty, for large samples.

The MLE is the value of the parameter that maximizes the likelihood function. To explain what the likelihood function is, and why one might want to maximize it, let us consider the Poisson distribution. For any particular value of µ, Eq. 1 gives the probability of observing y spikes during the interval (a, b) for a single trial. Table 1 displays the value of p(y|µ) for several values of y and µ. The argument of the probability density, that is, the quantity that is varying in Eq. 1, is the spike count y. Because Eq. 1 is a probability density, the sum over all possible values of y is 1. In Table 1 this corresponds to summing across rows. The likelihood function reverses what is held fixed and what is varying: it applies p(y|µ) with the data y held fixed and the unknown parameter µ allowed to vary. Reasoning based on the likelihood function continues along this reversed, or "inverse," path: having observed a spike count y, we would find it implausible to think that the correct value of µ was one that would make y an improbable event. Instead, the MLE maximizes the likelihood function, producing the parameter value µ that makes y the most probable to have occurred.


View this table:
[in this window]
[in a new window]
 
TABLE 1. Equation 1 evaluated at y = 0, 1, ... for several values of µ, where µ is the theoretical mean spike count in the interval (a, b)

 
Suppose, now, that we have spike counts y1, ..., yn for n trials. Let us denote the entire set of spike counts, now taken to be a vector, as y = (y1, ..., yn). The likelihood function is then based on the joint distribution of y

(3)
and the MLE of µ is the value that maximizes L(µ), that is the value of µ that gives the observed data its maximal probability of occurring under the assumed probability model. Here, as is usually the case, the parameter µ is allowed to vary continuously, and the MLE is usually obtained numerically by maximizing the loglikelihood function

The MLE is the value of the parameter that provides the best fit of the model to the data, where "fit" is defined in terms of the probability assigned to the data by the model. There are other possible ways to define fit, but in his 1922 paper, Fisher pointed out that the MLE is efficient, in the sense stated earlier. Generally, the great virtues of ML estimation are 1) for large samples it uses the maximal information (in Fisher's sense) available in the data, 2) there are explicit algorithms to compute it even in complicated settings, and 3) there is also an explicit method of assessing the uncertainty of MLEs.

In the Poisson case, an easy application of calculus shows that the MLE of µ is . Earlier we showed that, for estimating the Poisson parameter µ, the mean is much better than the variance, and then we stated that the mean is efficient in the sense of having the smallest possible MSE in large samples. We see here that the efficiency of may be considered a special case of the general result that the MLE is efficient.

Bayes estimators are also optimal in large samples

Closely related to ML estimation, and more powerful in some circumstances, is Bayesian inference. Bayesian inference is based on Bayes' theorem, which is an elementary formula for computing conditional probabilities of the form P(A|B) from probabilities of the form P(B|A). The profound implication of Bayes' theorem for statistical inference becomes apparent when A signifies an unknown parameter and B signifies the available data: having observed some data y, Bayes' theorem allows us to express our uncertain knowledge about the parameter µ quantitatively, in the form of a probability distribution p(µ|y). The inputs to Bayes' theorem are the likelihood function p(y|µ) and a prior distribution p(µ) that represents a priori knowledge about µ. To contrast with the prior distribution, which represents knowledge that logically precedes the data, the distribution p(µ|y) produced using the data is called the posterior. The usual Bayes estimator is the mean of this posterior distribution.

In the Poisson case, Bayes' theorem provides p(µ|y) in terms of the formula for p(y|µ) given by Eq. 1. Specifically, according to Bayes' theorem, the probability of any set of values of µ is determined by the equation

(4)
In words, Eq. 4 says that after the data have been observed, the posterior probability density of µ is the normalized product of the likelihood function and the prior. The prior density of µ is often taken to be slowly varying across relevant parameter values, indicating that little is to be assumed a priori about the potential value of the parameter. For example, in estimating a Poisson mean firing rate, the prior probability density might be taken to be slowly varying across all conceivably realistic values of the firing rate. In such cases the posterior is determined almost entirely by the likelihood function and Bayes estimates become very close to MLEs. For this reason, Bayes estimates share with MLEs the 3 desirable properties listed above. Furthermore, when there is good a priori information, as in many decoding problems where one may assume the stimulus to be varying smoothly over time (see section on Bayesian decoding methods), Bayesian methods can incorporate this information to produce better procedures. In addition, Bayesian methods sometimes offer computational advantages because Monte Carlo simulation methods can be used to implement them.

It is easy to obtain standard errors for maximum likelihood and Bayesian estimation

Earlier we noted that Fisher information is reciprocal of the minimal possible variance of any "good" estimator, and that ML and Bayesian estimators achieve this bound. It follows that Fisher information may be used to obtain SEs for ML and Bayesian estimators. In practice, it is convenient to substitute the observed information instead. In the Poisson case we replace µ with = in Eq. 2 to obtain

(5)
If the Poisson assumption is correct, an approximate 95% confidence interval can be obtained by inserting Eq. 5 into the general formula

(6)
Note that the SE formula in Eq. 5 is not the same as the usual formula for the SE of the sample mean

(7)
Equation 7 does not require the Poisson assumption, but it applies only to the sample mean, whereas Eq. 5 is an instance of a widely applicable formula for the SE of a ML or Bayesian estimator.3 This greatly enhances their practical utility. A method that may be used with virtually any estimator, and does not require a specific distributional assumption, is discussed in NONPARAMETRIC METHODS.

Likelihood ratio tests have good statistical power

To illustrate statistical considerations concerning estimation procedures, earlier we discussed estimation of the Poisson mean µ based on n repeated trials. Now suppose, under the same circumstances, we wish to test the null hypothesis H0 : µ = µ0. For example, µ0 could be the baseline firing rate of a neuron and we may wish to demonstrate that some stimulus increases (or decreases) this rate, so that the alternative hypothesis HA : µ != µ0 would hold. In analyzing this situation we will follow a standard statistical convention, which we have ignored up to this point: capital and lowercase versions of a letter are used to distinguish a random variable () from a particular value () taken by the random variable.

The obvious procedure to assess whether H0 holds would be to examine and if it is sufficiently far from µ0, reject H0. That is, we would define a quantity c and reject H0 whenever | – µ0| > c. To determine c we usually require the probability of falsely rejecting H0 to be small: under H0 (that is, assuming H0 were true), we would obtain c such that Prob (reject H0|H0 true) = Prob (| = µ0| > c) = {alpha}, for a suitably small {alpha}, assuming each trial's spike count Y has a Poisson distribution with µ = µ0. The usual criterion is to take {alpha} = 0.05. This is called the size of the hypothesis test, or the probability of a type I error. When using data to assess H0 a good practice is to report the P-value, which is the smallest value of {alpha} according to which the observed data would reject H0. For example, one might report P = 0.02 rather than simply P < 0.05.

To compute Prob (| – µ0| > c) one could rely on the approximate normality of the sample mean, which is likely to be adequately accurate unless µ0 and n are both small. Specifically, under H0, because each Y has mean µ0 and variance µ0, it follows that has mean µ0 and variance µ0/n; therefore the calculation of Prob (| – µ0| > c) may be based on a normal distribution with mean 0 and variance µ0/n, so that when {alpha} = 0.05.

Alternatively, if there are doubts about the adequacy of this approximation, Prob (| – µ0| > c) may be obtained (even for small n and µ0) by computer simulation. We simulate samples U1, U2, ..., Un, where each U1 has a Poisson distribution with mean µ0 and for each sample compute the sample mean . If we repeat this procedure a large number of times to obtain a large number of samples (say, 10,000) we can then approximate Prob (| – µ0| > c) accurately by the proportion of samples for which |u0| > c. This is an example of a parametric Bootstrap test. We discuss the nonparametric Bootstrap in the next section. Simulation-based calculation of P-values is an important technique because it applies to situations where normal approximations (or other, similar approximations) are either unavailable or of dubious validity.

In complicated settings there may not be an obvious statistical test, or there may be several (as in the hypothetical example of alternative estimators for the Poisson mean µ used above), or one may want some reassurance that, as with ML estimation, the method is a good one. The general approach based on the likelihood function is to define the ratio of the likelihood under H0 to that under HA and to reject H0 whenever this ratio is sufficiently small. This procedure is called the likelihood ratio test. Thus, to test H0 : µ = µ0 versus HA : µ = µ1 we would use LR = L0)/L1) and reject H0 when LR < c, where c is chosen so that Prob (LR < c) = {alpha} when H0 is true. Here, µ1 would be some prespecified alternative value of the Poisson mean. Usually, however, no such specific alternative to µ0 is apparent and the more generic alternative µ != µ0 is used. In this case the likelihood ratio L0)/L() is used, where is the MLE. For the Poisson case H0 : µ = µ0, it turns out that the likelihood ratio test yields the "obvious" procedure based on discussed above. The likelihood ratio test produces, similarly, many familiar statistical tests (such as t-tests and F-tests).

When 2 distinct procedures are available for testing the same null hypothesis they may be compared by first aligning them to have the same size (i.e., the same value of {alpha}) and then computing their power, which is the probability of correctly rejecting H0 for particular values of the parameter that satisfy HA. Extensive theory and simulation studies have shown that the likelihood ratio test tends to have good power, often nearly or exactly the best possible power, in a wide variety of situations, at least when sample sizes are reasonably large (e.g., van der Vaart 1998Go). It is therefore the most commonly applied statistical testing procedure when there are explicit parametric probability models under both hypotheses. Bayesian methods of hypothesis testing are beyond the scope of our review here, but they have received considerable attention in recent years and the interested reader should consult Kass and Raftery (1995)Go; an application to neuronal data analysis is in Behseta and Kass (2005)Go.

Applicability of a probability model to a data set should be assessed

Previous subsections have been concerned with fundamental principles of statistical inference. In our introductory discussion, however, we emphasized the inductive process diagrammed in Fig. 1 and, according to that figure, statistical inferences should occur only after determination that the probability model on which the inferences will be based is adequate. A variety of procedures have been developed to assess the adequacy of a probability model in representing the regularity and variability in a particular set of data. One method is to consider a more elaborate model and conduct a likelihood ratio test to choose between the simpler model and the more elaborate model. A second approach is to compare suitable features of the data to predictions made by the model, and a general procedure for doing so is to examine a set of functions of the data that would, if the probability model were correct, constitute a sample from a particular probability distribution, say F(y); when the function values are ordered and plotted against theoretical quantiles of the putative probability distribution F(y), the result, called a Q–Q (for quantile–quantile) plot, should be roughly linear (e.g., Hogg and Tanis 2001Go). Both of these methods are discussed later in POISSON AND NON-POISSON MODELS. In addition, chi-squared goodness-of-fit statistics may be used to evaluate fit when applicable (e.g., Sokal and Rohlf 1995Go).

NONPARAMETRIC METHODS

In the previous section we discussed parametric statistical methods, first considering estimation and then describing hypothesis testing based on the likelihood ratio test. In the following subsection we review very briefly the purpose of nonparametric methods, and a bit of the terminology, so that readers will be equipped to tackle the literature. We next describe a very general approach to assessing uncertainty, called the Bootstrap, illustrating it by revisiting the simple hypothesis testing problem posed earlier.

Modern nonparametric methods allow greater flexibility in probability models at the cost of some loss of efficiency

The Poisson distribution uses a single parameter µ. Other distributions or models, such as linear regression models, typically depend on a small number of parameters that must be estimated from (or "fitted to") the data at hand. It is important to consider carefully what might happen if the probability model were to be incorrect in some plausible way. For example, as we already mentioned, in the case of estimating µ, the sample mean remains consistent (and the MSE becomes arbitrarily small with sufficiently much data) for non-Poisson data, whereas s2 will estimate the population variance, which may differ from the population mean. Often a restrictive model, such as the Poisson, is replaced with a more flexible model. Highly flexible models are often called "nonparametric."

Classical nonparametric methods substitute transformed versions of the data (such as ranks), to obtain analogues of standard elementary methods (such as ANOVA). Modern nonparametric methods use models with large numbers of parameters, and their theoretical analysis often assumes there are infinitely many parameters. In some contexts, the parameters are used to characterize a probability distribution, as in the Bootstrap, which we discuss next. Sometimes, large numbers of parameters are used to replace linear relationships with nonlinear relationships whose form is not picked a priori but rather is determined from the data. For example, curve fitting is often accomplished with very flexible models (having large numbers of adjustable parameters), a process statisticians usually call nonparametric regression. We discuss the application of generalized nonparametric regression to neuronal data in the next section, where we view the PSTH as a nonparametric estimator of the firing rate function and contrast it with more efficient alternatives.

Nonparametric methods apply more generally than parametric methods. This generality comes at a cost: when the assumptions of a parametric model hold (or hold to a close approximation), they will be more efficient than nonparametric competitors. In deciding whether to apply a parametric or a nonparametric method, therefore, one must consider the likelihood of a substantial departure from the parametric assumptions. The notion of "substantial departure" will depend on the context, and some parametric methods are more likely than others to perform poorly. Numerical simulations may be used to determine performance of nonparametric methods in various situations (compared, say, to ML, which would be fully efficient). Some modern nonparametric methods can perform very well, with relatively little loss of efficiency (e.g., van der Vaart 1998Go). An example concerning the fitting of directional tuning curves is discussed in the next section.

The Bootstrap is a general nonparametric method of assessing uncertainty

Earlier we pointed to the use of simulation in computing P-values. There we were discussing the parametric case in which, under H0, the probability distribution of each observation Yi was Poisson with parameter µ = µ0. However, because neuronal spike counts are known to exhibit non-Poisson behavior, it may be desirable to use a procedure that does not assume Poisson counts. For this purpose, the translated observed spike counts Y1 – ( – µ0), Y2 – ( – µ0), ..., Yn ( – µ0), with Yi being the count for the ith trial, may be repeatedly resampled. That is, we draw observations U1, U2, ..., Un where each Ui takes on the value of one of the observed values of Yi – ( – µ0), and all such values occur with probability 1/n. This is known as sampling with replacement. The reason for the translation is explained below. Having thereby obtained a large number of samples we then (as with the Poisson Ui values in the previous section) compute the proportion of samples for which | – µ0| > c, which, again, estimates the desired Prob (| µ0| > c). This method is known as the Bootstrap. It is nonparametric because it does not require the assumption of a specific parametric family of distributions (here, Poisson). Instead, it requires the data values Yi to be independent replications from the same probability distribution. It works because it implicitly uses the data Y1, Y2, ..., Yn to estimate the probability distribution from which they are assumed to be drawn, and then computes the desired probability from this estimate. With a large simulation, the Poisson-based parametric boot strap method can compute Prob (| – µ0| > c) with arbitrarily good accuracy. The nonparametric Bootstrap P-value, however, is accurate only for large sample sizes n. The great value of the Bootstrap is that it may be applied in many complicated situations. We describe its use in testing for correlated activity between 2 neurons in CORRELATED PAIRS OF NEURONS. Furthermore, a substantial literature has demonstrated both theoretically and in numerical studies that it is widely effective (Davison and Hinkley 1997Go; Efron and Tibshirani 1993Go). In addition to treating the testing problem, the Bootstrap may be used to obtain SEs and confidence intervals.

The simplicity of the Bootstrap conceals an important point: its properties depend on the precise manner in which the resampling is conducted; arbitrary shuffles of the data do not necessarily accomplish desired statistical goals. In hypothesis testing, the P-value must be obtained under the hypothetical reality imposed by H0. For example, in the problem discussed above, we wished to compute the hypothetical probability Prob (| – µ0| > c) under the supposition that the observations had mean µ = µ0. We cannot merely sample the Yi values because their mean µ may not equal µ0. (Indeed, this is precisely the possibility we are examining when we perform the statistical hypothesis test.) Instead, the calculation may be based on new data that resemble the Yi values except that their mean is shifted to equal µ0. To produce a version of Yi that has mean µ0 rather than µ we would, in principle, want to subtract from Yi the quantity µ – µ0. Because we do not know the value of µ we instead subtract ( µ0). Thus, in the procedure outlined above, the observed values of Yi – ( – µ0) were sampled.

STATISTICAL SUMMARIES OF FIRING RATE

The PSTH communicates firing rate and its evolution across time

Within an experiment, each trial's set of recordings from a single neuron produces a different spike train. By recording many repeated trials the regularity in the neuron's response to a stimulus, or the production of some behavior, may be obtained. A raster plot displays the complete set of spike times for all trials for a single neuron in a particular experimental condition, whereas the peristimulus time histogram (PSTH) accumulates these to show both the overall activity of the neuron and the way its firing rate varies across time (Gerstein and Kiang 1960Go).

One reason the PSTH works well as a visual summary of the data is that our eye is able to pick up trends, smoothing the PSTH so that we see the temporal evolution of the firing rate. In Fig. 3 we have overlayed a smooth curve on a PSTH, based on data recorded from a locust antennal lobe neuron. The PSTH jumps somewhat erratically from bin to bin because of noise and does not track the kind of continously varying firing rate one would expect. For this reason, the curve is closer to what we understand from looking at the PSTH than is the PSTH itself.



View larger version (12K):
[in this window]
[in a new window]
 
FIG. 3. Estimates of {lambda}(t) based on smoothed versions of the peristimulus time histogram (PSTH), for data from a locust antennal lobe neuron in response to a stimulating odor. Left: raster plot of the 15 trials. Center: PSTH with 2 Gaussian filter fits, one having small bandwidth and one having larger bandwidth, the former of which undersmooths where the firing rate varies slowly, whereas the latter oversmooths where the firing rate varies rapidly. Right: PSTH and Bayesian adaptive regression splines (BARS) fit.

 
We will later discuss the method we used to generate the fitted curve in Fig. 3. We should emphasize our use of the modifier "fitted." A fundamental conceptualization in statistical theory distinguishes the unknown "true" firing rate curve (which would result from the hypothetical possibility of running infinitely many trials) from an estimate of it obtained from actual data, in the sense described in Information and statistical efficiency. Furthermore, when we speak of estimating the firing rate we mean that we will use the data to produce an estimate of the instantaneous firing rate at each time t, where t varies across the whole range of experimentally interesting values. We write the instantaneous firing rate as {lambda}(t) and are interested in estimating the entire curve described by {lambda}(t) for t in some interval, which we write as [A, B]. In statistics, one usually writes an estimate obtained from data with a hat, so an estimate of the firing-rate function would be written (t).

Conceptually, {lambda}(t) is the trial-averaged firing-rate function, as opposed to the within-trial firing-rate function. As we note later, the latter may include effects resulting from "membrane memory" (the refractory period) or erratic bursting (bursting that is not time-locked to experimental stimulus or behavior), so that at time t it may depend on the precise timing of spikes that occur before time t. The function {lambda}(t) represents the tendency to fire at time t that a neuron would have if such memory or erratic bursting effects were removed, as they would be by averaging (hypothetically) over infinitely many trials. It is often, implicitly, considered to be a representation of the firing of large numbers of independently acting neurons similar to the one being recorded (each trial effectively representing the activity of one such similar neuron). Our focus, in the remainder of this section, on estimation of {lambda}(t) is not meant to imply that every analysis should begin with a time-domain representation of firing rate. In many situations, especially when oscillatory stimuli are used, frequency-domain analyses can produce valuable interpretations of the data (Brillinger 1992Go; Mechler et al. 1998Go; Pesaran et al. 2002Go).

Some neuroscience questions involve instantaneous firing rate

Sometimes, questions of scientific interest are posed naturally in terms of instantaneous firing rate. For example, Olson et al. (2000)Go examined neurons in the supplementary eye field (SEF) when a monkey moved his eyes in response to either an explicit external cue (the point to which the eyes were to move was illuminated) or an internally generated translation of a complex cue (a particular pattern at a fixation point determined the location to which the monkey was to move his eyes). In one part of their study, they were interested in the time at which maximal firing rate was achieved, and the delay of this maximum for the internally generated cue compared to the external cue. Formally, the problem is to determine the value of t that maximizes {lambda}(t). Similarly, when the maximal firing rate, or the difference between the maximal rate and a baseline rate, is of interest the problem involves estimation of {lambda}(t).

Probability models in terms of instantaneous firing rate make efficient use of the data

A second reason for estimating the instantaneous firing rate {lambda}(t) is that it appears in probability models and, as we have already noted, when coupled with maximum likelihood or Bayesian methods of estimation, probability models reduce the data efficiently. For example, if we assume a set of n spike times s1, ..., sn follow a time-varying Poisson process then their probability density is

(8)
We discuss Poisson processes in POISSON AND NON-POISSON MODELS. Here, our point is that to apply the formula in Eq. 8, which is used in many theoretical and data-analytic calculations, we must be able to obtain values of {lambda}(t) for various values of t. For this purpose the function {lambda}(t) would have to be estimated from the data.

As an estimate of firing rate the PSTH can be improved by smoothing

It would be possible to use the PSTH as an estimate of {lambda}(t). However, the PSTH is relatively noisy and, under the assumption that {lambda}(t) is itself smooth, it is possible to produce much better estimates (t) by smoothing ("filtering"). Kass et al. (2003)Go provided an illustration in which smoothing increased the efficiency of estimating {lambda}(t), compared to the PSTH, by a factor of 14. This situation is analogous to the estimation of the Poisson mean µ we considered earlier. Using the PSTH to estimate {lambda}(t) is like using s2 to estimate µ: it would take about 140 trials to get the same accuracy with the PSTH that may be achieved by only 10 trials with a smoothed version of the PSTH. Not all cases are as dramatic as a 14-fold improvement, but in our experience smoothing is likely to provide gains of at least severalfold.

There are many ways to produce a smooth firing-rate function; spline-based methods work very well

One simple way to reduce the noisiness of the PSTH is to use a moving average: one may pick a suitable time window {delta} (such as 30 ms) and at each time t, average the values of the PSTH between t{delta} and t + {delta}. A variant of this, often called a Gaussian filter (or, in the statistics literature, a Gaussian kernel density estimator), uses a weighted average, putting weights (defined by a Gaussian probability density) on each PSTH value that decrease as the values to be averaged get further away from time t.

When {lambda}(t) varies slowly, Gaussian filters do a good job of estimating it. When the firing rate varies quickly, however, Gaussian filters are unable to capture the variation without introducing artificial high-frequency fluctuations. An illustration is given in Fig. 3, which uses data from an experiment on olfactory coding in the locust antennal lobe (Stopfer et al. 2003Go). Gaussian filters can capture a rapid jump in firing rate only by allowing noise in time periods where the firing rate is varying slowly. When the bandwidth is increased so that the noise is filtered out, the rapid jump in firing rate is underestimated. In other words, to filter high-frequency noise, the Gaussian filter must remove the high-frequency signal. The problem here cannot be solved by a fixed-bandwidth filter. A better estimate of {lambda}(t), produced by a method called BARS, is given in the third panel of Fig. 3. It has the desirable characteristic of strongly smoothing the firing rate function, while allowing sudden increases or decreases—BARS effectively filters high-frequency noise while retaining high-frequency signal. BARS stands for Bayesian adaptive regression splines (DiMatteo et al. 2001Go). A spline is a collection of polynomial curves that are joined at selected points (here, time points) called "knots." BARS uses a Bayesian Monte Carlo method to allocate knots optimally, to adapt to sharp variations in the intensity function. An application of BARS to an analysis of many single-unit firing rate functions is provided by Behseta et al. (2005)Go.

An alternative approach to estimating {lambda}(t) is to define a plausible parametric form for it, with a small number of parameters, and then estimate these parameters (by ML). For example, Olson et al. (2000)Go used 6 parameters to characterize firing rate intensity functions in 84 single units from the supplementary eye field. However, nonparametric approaches using BARS or related methods (cf. Hansen and Kooperberg 2002Go; Loader 1999Go) are often easier to implement, remain accurate when the parametric form is incorrect, and suffer relatively little loss of efficiency even when the parametric form holds. For example, in a closely related context, Kaufman et al. (2005)Go considered the problem of fitting tuning curves to spike count data collected during wrist movement in 8 2-dimensional directions, with the goal of relaxing the usual assumption of cosine tuning. Kaufman et al. modified BARS to make it applicable to functions defined on a circle, and they showed in simulation studies that this nonparametric method was almost as efficient as cosine tuning when the true tuning function was exactly of cosine form, while having the advantage of being able to fit departures from cosine tuning. It is also possible to use partially parametric fitting methods. Barbieri et al. (2001)Go showed how Zernike polynomials may be used to characterize hippocampal place fields. These involve both Gaussian and non-Gaussian components, and are able to capture departures from Gaussian place field tuning.

An important subtlety is that all smoothing methods require a choice of degree of smoothness. For example, a histogram requires a choice of bin width and a Gaussian filter requires a choice of bandwidth. These can be determined by statistical methods, but they are often selected based on visual appearance of results, possibly with some past experience in mind. To make a sensible choice, the data analyst must consider limits on the rate at which {lambda}(t) can change, based on physiology and properties of the stimulus or behavior. For example, one might ask whether the firing rate is likely to jump substantially within 10 ms, throughout the time interval during which the recording is made. If so (if, for instance, the stimulus is itself rapidly fluctuating), then a less smooth estimate of {lambda}(t) is desirable than if slow variations of {lambda}(t) are to be expected. More recently developed methods, including BARS, are able to use the data to determine smoothness across the time domain, but they always involve assumptions about how rapid the fluctuation of {lambda}(t) might be.

POPULATION CODING AND DECODING

Population coding refers to the information contained in the combined activity of multiple neurons. The general challenge of decoding, meaning to extract this information, reverses the process: it is to determine from a set of spike trains (obtained from multiple neurons) the experimental condition or stimulus that produced them. In the simplest case, some small set of conditions is used, and the problem is to infer which condition led to the observed response. Decoding problems arise in several contexts. Examples include position representation by ensembles of rat hippocampal neurons (Brown et al. 1998Go; Zhang et al. 1998Go), velocity encoding by fly H1 neurons (Bialek et al. 1991Go), velocity and position encoding by M1 neurons (Georgopoulos et al. 1986Go; Moran and Schwartz 1999Go; Serruya et al. 2002Go), and natural scene representations by cat lateral geniculate nucleus neurons (Stanley et al. 1999Go). Aside from their use to study how populations of neurons represent information, decoding algorithms are being studied for their use in the brain-controlled neural prosthetic devices (Schwartz et al. 2004Go; Serruya et al. 2002Go, Wessberg et al. 2000Go).

Statistical analysis proceeds in 2 stages. First, the neural firing that results from the relevant stimulus or behavioral condition must be estimated from some history of data. Second, this relationship must somehow be "inverted" to provide a prediction of stimulus or behavior based on neuronal activity. These 2 stages may be intertwined, but are conceptually and, usually, analytically separable. The first is sometimes called "learning," "estimation," or "encoding." The second is often called "prediction" or "decoding." In many situations the encoding step takes place in a steady-state environment. This is the most common framework for "supervised learning." With brain-controlled ("closed loop") movement, however, the evolution of the system's state is important because the subject will continue to adapt over time (Taylor et al. 2002Go; Wessberg et al. 2000Go) and part of the goal is to enable the subject to learn.

After reviewing some simple, linear decoding methods we discuss Bayesian decoding. The Bayesian decoding framework is based on "state-space" modeling, where data (spike counts) are assumed to depend on an underlying internal state (such as planned velocity of hand movement), which in turn is assumed to vary according to some specified dynamic model. We find the framework appealing because it facilitates incorporation of useful particulars, such as spike count distributions and movement smoothness constraints. In addition it is widely applicable because it can accommodate internal states that are multidimensional and relationships of data to state that are highly nonlinear.

The stimulus may be reverse-predicted from spiking activity using linear regression

Reverse regression (also called reverse correlation) is a very simple and widely used decoding method (e.g., Stanley et al. 1999Go; Warland et al. 1997Go). The "reverse" part of the terminology comes from the reversal of roles played by the stimulus and spike-activity response: the spike count data are treated as if they were the inputs, i.e., the fixed explanatory variables (the x values in the usual regression notation), whereas the stimulus is considered the output, i.e., the response variable (the Y variable, which in the regression formulation is assumed to be subject to random error).

Let us first describe the procedure as it might be applied to data from a single neuron. The explanatory variables are defined by forming a series of successive bins of spike counts at some suitable resolution, such as 30 ms. Thus, (x1, x2, ..., xT) would represent the vector of spike counts in T successive bins after the stimulus. Given a training set of many stimulus and firing-rate combinations one computes the usual least-squares coefficients

(9)
where Y is the vector of observed stimulus values and the ith row of the matrix X is the spike count vector corresponding to the ith stimulus value. The predictor of a new, unobserved stimulus y* given a spike count vector x* is then

(10)
In the case of N neurons all NT spike counts are used as explanatory variables and then Eqs. 9 and 10 are applied. In some applications nonparametric regression methods are used in place of linear regression (Warland et al. 1997Go).

The population vector algorithm is simple and often effective

One widely successful decoding method is the population vector algorithm (PVA; Georgopoulos et al. 1982Go, 1986Go) and its modifications (Taylor et al. 2002Go). The PVA has enabled investigation of cortical control of arm movement (e.g., Moran and Schwartz 1999Go) including phenomena such as illusion perception (Schwartz et al. 2004Go) and has also been used in quite different contexts, such as the representation of moving tactile stimuli in sensory cortex (Ruiz et al. 1995Go). The method originated from the observation that motor cortex neurons are directionally tuned, with broad tuning curves that may be characterized reasonably well by 2 parameters, average firing rate and preferred direction. The preferred direction is the direction in which the neuron's firing rate is highest. The preferred directions are obtained by fitting tuning curves to observed training data. Once one knows a neuron's preferred direction and its average firing rate, its firing rate for an arbitrarily specified direction may be determined, subject to some error. The PVA reverses the process, predicting direction from firing rate, by combining the observed activity from a large number of neurons. Specifically, the movement velocity at a suitable time lag {tau} after spiking activity is predicted by the "population vector" according to the equation

(11)
where i is the ith neuron's preferred direction and the "weight" wi(t) is the neuron's firing rate at time t (after being normalized in some fashion). The intuitive interpretation of the population vector is that it represents the population when each neuron sends in its "vote" for its preferred direction, which is then weighted by its activity.

The population vector algorithm is a special case of linear regression

We may connect the PVA to least-squares regression by thinking of the preferred direction i as an explanatory variable, which, for a given velocity , predicts spiking activity wi according to the linear regression model

where the coefficients bi0 are intercepts representing baseline firing rate and i · is the dot product. If we now collect the firing rates together into a vector Y = (w1, ..., wn) and the preferred directions (along with the intercepts) into a corresponding matrix X, once we take {beta} = we obtain the usual matrix form of the linear regression model

(12)
as a description of the dependency of spiking activity on preferred direction.4 Furthermore, we also have

(13)
which is the PVA prediction in Eq. 11. Now, if we assume that the preferred directions are uniformly distributed, then it may be shown that the matrix XTX reduces to the identity matrix. In this case, the least-squares fit for the velocity vector {beta} = given by Eq. 9 reduces to the PVA in Eq. 13. Having made this observation, one may immediately generalize to the case in which the preferred directions are non-uniform by replacing the PVA estimator XTY in Eq. 13 with the least-squares estimator (XTX)–1XTY in Eq. 9. In addition, it should be noted that ordinary least squares assumes the responses, in this case the firing rates of the various neurons, have equal variances and are independent. Otherwise, classical statistical analysis shows that weighted least squares (WLS) should be used

where {Sigma} is the variance matrix of Y (Kutner et al. 2004Go). Salinas and Abbott (1994)Go refer to the WLS estimator as the optimal linear estimator.

Bayesian decoding methods are efficient and flexible

We indicated earlier that Bayesian methods are efficient and we also mentioned that they can effectively incorporate a priori information. To implement Bayesian decoding for velocity, as an alternative to the PVA outlined above, we must apply Bayes' theorem (analogously to Eq. 4). This requires 1) a probability model for the data and 2) a prior probability assumption for the unknown velocities (which are to be predicted).

A simple version of Bayesian decoding obtains the predicted velocity vector at time t + {tau}, which we now write as {nu}t+{tau}, from a pair of equations. (One simplification in this exposition is that we are assuming the time lag {tau} between neural firing and movement remains fixed throughout the experiment and is the same for all neurons.) The first equation specifies the posterior distribution given by Bayes' theorem (analogously to Eq. 4)

(14)
In Eq. 14 the likelihood function is the probability p(yt|{nu}t+{tau}) of a firing rate yt when a movement {nu}t+{tau} will be made at time t + {tau}. The second probability density on the right-hand side of Eq. 14 plays the role of the prior for velocity at time t + {tau} based on the spiking activity that has preceded time t. This density is determined, recursively, from the equation

(15)
In Eq. 15 we obtain the integrand p({nu}t–1+{tau}|y1, ...,| yt–1) from the previous step of Eq. 14. The factor p({nu}t+{tau}|{nu}t–1+{tau}) is where the prior information (information separate from, or "prior" to, any spiking activity) is inserted: we assume the velocity {nu}t+{tau} will tend to resemble {nu}t–1+{tau}, but will deviate from it by a small amount (governed by a smoothing parameter, which may also be obtained from the data); as a result the velocity will be smoothed across time. The details we are skipping may be found in Brockwell et al. (2004)Go and the references therein. Equation 15 is sometimes called the Chapman–Kolmogorov equation (Brown et al. 1998Go).

Equations 14 and 15 are very general. Alternative evolving behavioral parameters may take the place of velocity. If the likelihood function [the probability p(yt|{nu}t+{tau}) in Eq. 14] describes reasonably well the dependency of spiking activity on these parameters, there are sufficient data, and the prior smoothness condition [the probability p({nu}t+{tau}|{nu}t–1+{tau}) in Eq. 14] is appropriate, then Bayesian decoding will predict their evolution accurately.

Brockwell et al. (2004)Go compared the PVA to optimal linear estimation (OLE) and Bayesian decoding for a computer-simulated hand movement using 200 neurons with tuning that was similar to directional tuning observed in motor cortex data. In their simulation study PVA was less efficient than OLE by a factor of 2 and less efficient than Bayesian decoding by a factor of 10. Thus, for example, Bayesian decoding of 25 neurons would be roughly as accurate as PVA decoding of 250 neurons. Brockwell et al. (2004)Go also obtained roughly 7-fold improvement in MSE for a set of motor cortical data. Large gains in accuracy with Bayesian decoding have also been reported by Gao et al. (2002)Go.

To illustrate the flexibility of the approach, we provide a second example of Bayesian decoding in the context of reconstructing the path of a foraging rat based on the activity of pyramidal neurons in the CA1 region of the hippocampus. It is well known that when a rat moves through its environment these neurons fire only in certain regions of space (O'Keefe and Dostrovsky 1971Go; Wilson and McNaughton 1993Go) and, as a result, they are called place cells and regions of space in which they fire are termed place receptive fields. Brown et al. (1998)Go analyzed place cell spike trains and position data from a rat freely foraging in a circular environment. Here, we contrast Bayesian decoding with reverse correlation, as in Eq. 9 and 10. We analyze similar data in An example of iterative probability modeling.

To display results we show the X and Y components of the animal's position separately. Figure 4 displays the fit of the reverse correlation model (green line) and the Bayesian decoding algorithm (red line) to the X and Y components of position (blue line) for the decoding stage. Reverse correlation, which makes no assumption of smoothness in the animal's path, provides a very noisy prediction, and strays substantially from the correct path during several epochs. The proportion of variability explained by reverse correlation is R2 = 0.23. Compared with the reverse correlation fit, Bayesian decoding predicts the path of the animal in the separate directions very closely. Its deviations from the model fit, which are much smaller than those of the reverse correlation technique, occur at the extremes of the individual position components. The proportion of variability in the path data explained by Bayesian decoding is R2 = 0.87.



View larger version (39K):
[in this window]
[in a new window]
 
FIG. 4. Decoding of the position of a rat. Simultaneous activity of 34 place cells was recorded while the animal foraged for chocolate pellets for 23 min. Electroencephalogram (EEG) data were recorded from the same electrodes and bandpass filtered from 6 to 14 Hz to extract the theta rhythm, which is known to be an important determinant of place cell firing. Position of each animal was measured at 30 Hz by a camera that tracked the location of 2 infrared diodes mounted on the animal's headstage. Brown et al. (1998)Go defined the encoding stage as the first 13 min of spike train, path, and theta rhythm data and estimated the parameters of an inhomogeneous Poisson process model for each place cell, together with those in a spatial random-walk model for the prior probabilities. They defined the decoding stage as the last 10 min of the experiment for each animal. Position estimates for the decoding analysis were updated every 33 ms, the frame rate of the tracking camera. In total 18,000 (30 estimates/s x 60 s/min x 10 min) decoded position estimates were computed. X and Y directions are plotted separately. Bayesian decoding provides an accurate reconstruction of the path, whereas reverse regression is noisy and subject to sustained, substantial error in some parts of the experiment (e.g., around 500 s).

 
POISSON AND NON-POISSON MODELS

At the beginning of INFORMATION AND STATISTICAL EFFICIENCY we noted that non-Poisson variation in spike trains is to be expected, and has been documented, under particular conditions (see also Barbieri et al. 2001Go; Kass and Ventura 2001Go; Reich et al. 1998Go; and the references therein). One phenomenon leading to non-Poisson behavior is the refractory period: immediately after a spike there is a short interval of time during which another spike is impossible and a longer interval of time during which the probability of a spike is reduced. For high firing rates refractory effects become detectable and may be important for some analyses. Another point, mentioned earlier, is that theoretical integrate-and-fire–type models produce non-Poisson behavior. In addition, either bursting or excess trial-to-trial variation (arising from changes in stimuli or internal state of the subject) may be present. These may require specialized techniques that either complement or extend Poisson-based statistical methods.

In the first subsection we briefly describe the Poisson process and emphasize its usefulness in analyzing data pooled across trials. In the next subsection we indicate ways in which probability models may be extended so that they may be applied to non-Poisson data to produce within-trial analyses. Finally, we illustrate assessment of fit for Poisson and non-Poisson models. These methods are then used to demonstrate the presence of non-Poisson bursting in a hippocampal place cell in the next section on iterative probability modeling.

Time-varying Poisson processes are suitable for analyzing data pooled across many trials

The simplest conception of neuronal firing is that it follows a Poisson process with a time-invariant firing rate {lambda}. In this case, the spike-generating process is said to be stationary and Eq. 8 specializes to

(16)
where T = BA is the total observation time. The process is also called homogeneous (as opposed to time-varying or inhomogeneous). The simplicity of the homogeneous case is that the function {lambda}(t) is replaced by a single number {lambda}. When Eq. 16 holds, the probability that a spike will occur in the infinitesimal interval (t, t + dt) is {lambda} · dt, regardless of the time t (i.e., there is a constant firing rate over time). Because spike trains are very rarely stationary, Eq. 8 which allows for a time-varying firing rate, is an important generalization.

Equation 8 still involves a strong simplifying assumption: the probability of a spike at time t [i.e., in the infinitesimal interval (t, t + dt)] is given by

(17)
that is, the firing rate depends only on time. For both homogeneous and time-varying Poisson processes the probability of a spike at time t is independent of the number and timing of the spikes that have occurred before time t. Thus, effects stemming from a refractory period or bursting are being ignored.

One of the reasons the general Poisson model Eq. (8) is important is that it holds, at least approximately, when data are pooled across tria