|
|
||||||||
Innovative Methodology
1Laboratoire de Physiologie Cérébrale, Centre National de la Recherche Scientifique (CNRS) Unité Mixte de Recherche (UMR) 8118, Université René Descartes, 75006, Paris; 2Laboratoire de Physique Théorique des Liquides, CNRS UMR 7600, Université Pierre et Marie Curie, 75252 Paris Cedex 05; and 3Laboratoire d'Analyse et de Mathématiques Appliquées, CNRS UMR 8050, Université de Marne la Vallée, Batiment Copernic, Cité Descartes, 77454 Marne la Vallée Cedex, France
Submitted 10 March 2003; accepted in final form 14 January 2004
|
|
ABSTRACT |
|---|
|
|
|
INTRODUCTION |
|---|
|
1) First, the sequence of spike occurrence times emitted by a neuron has well-known features like the presence of a refractory period (e.g., after a spike has been emitted no other spike will be emitted by the same neuron for a period whose duration varies from 2 to 10 ms). Other features include an interspike interval (ISI) probability density, which is often unimodal and skewed (it rises "fast" and decays "slowly"). Although some methods make use of the presence of a refractory period (Fee et al. 1996a
; Harris et al. 2000
), none makes use of the full ISI density. This amounts to throwing useful information away, for if event 101 has been attributed to neuron 3 we not only know that no other spike can arise from neuron 3 for, say, the next 5 ms, but we also know that the next spike of this neuron is likely to occur with an ISI between 10 and 20 ms. As pointed out by Lewicki (1994
), any spike-sorting method that would include an estimate of the ISI density of the different neurons should produce better classification performances.
2) Second, the spike waveform generated by a given neuron often depends on the time elapsed since its last spike (Quirk et al. 1999). This nonstationarity of spike waveform will worsen the classification reliability of methods that assume stationary waveform (Lewicki 1994
; Pouzat et al. 2002
). Other methods (e.g., Gray et al. 1995
for a "manual" method and Fee et al. 1996a
for an automatic one) assume a continuity of the cluster of events originating from a given neuron when represented in a "feature space" [the peak amplitude, valley(s) amplitude(s), half-width are examples of such features] and make use of this property to classify nonstationary spikes. Although it is clear that the latter methods will in general outperform the former, we think that some cases could arise where none of them would work. In particular, doublet-firing cells could generate well-separated clusters in feature space and these clusters would not be put together by presently available methods.
These considerations motivated us to look for a new method that would have a built-in capability to take into account both the ISI density of the neurons and their spike waveform dynamics. However, there are two additional issues with spike-sorting that, we think, are worth considering.
1) "Hard" versus "soft" spike-sorting. Most spike-sorting methods, including the ones where users actually perform the clustering and some automatic methods like the one of Fee et al. (1996a
), generate "hard" classification. That is, each recorded event is attributed "fully" to one of the K neurons contributing to the data. Methods based on probabilistic models (Lewicki 1994
; Nguyen et al. 2003
; Pouzat et al. 2002
; Sahani 1999) deal with this issue differently: each event is given a probability to have been generated by each of the K neurons (e.g., one gets results like: event 100 was generated by neuron 1 with probability 0.75 and by neuron 2 with probability 0.25). We will refer to this kind of classification as "soft" classification. Users of the latter methods are often not aware of their soft classification aspect because what one does for subsequent processing like estimating the ISI density of a given neuron or computing the cross-correlogram of spike trains from 2 different neurons, is to force the soft classification into the most likely one (in the example above we would force the classification of event 100 into neuron 1). By doing so, however, the analyst introduces a bias in the estimates. It seems therefore interesting to find a way to keep the soft classification aspect of the model-based methods when computing estimates.
2) Confidence intervals on model parameters. Methods based on an explicit model for data generation could in principle generate confidence intervals for their model parameters, although this was never done. However, the values of model parameters do influence the classification and a source of bias could be reduced by including information about uncertainty of model parameters values.
Based on these considerations we have developed a semiautomatic method that takes spike waveform dynamics and ISI densities into account, which produces soft classification and allows the user to make full use of it, and which generates a full posterior density for the model parameters and confidence intervals. Our method uses a probabilistic model for data generation where the label (i.e., neuron of origin) of a given spike at a given time depends on the label of other spikes occurring just before or just after. We will call configuration the specification of a label for each spike in the train. It will become clear that with this model the spike-sorting problem is analogous to an image restoration problem where a picture (i.e., a set of pixel values) that has been generated by a real object and corrupted by some noise is given. The problem is to find out what the actual pixel value was, given the noise properties and the known correlation properties of noise free images. The case of spike sorting is equivalent to a one-dimensional "pixel sequence" where the pixel value is replaced by the label of the spike. More generally it is a special case of a Potts spin-glass model encountered in statistical physics (Newman and Barkema 1999
; Wu 1982
). Thanks to this analogy solutions developed to study Potts models (Landau and Binder 2000
; Newman and Barkema 1999
) and to solve the image restoration problem (Geman and Geman 1984
) can be tailored to the spike-sorting problem. These solutions rely on a Markov chain that has the posterior density of model parameters and configurations given the observed data as its unique stationary density. This Markov Chain is stochastically simulated on the computer. This general class of methods has been used for 50 years by physicists (Metropolis et al. 1953
) who call it Dynamic Monte Carlo and for 20 years by statisticians (Fishman 1996
; Geman and Geman 1984
; Liu 2001
; Robert and Cassela 1999
) who call it Markov Chain Monte Carlo (MCMC).
Our purpose in this paper is 2-fold: First to demonstrate that a particular MCMC method allows incorporation of more realistic data generation models and by doing so to perform reliable spike-sorting even in difficult cases (e.g., in the presence of neurons generating 2 well-separated clusters in feature space). Second, to explain how and why this method works, thereby allowing its users to judge the quality of the produced classification, to improve the algorithm implementing it, and to adapt it to their specific situations. Our method does not yet provide an automatic estimate of the number of neurons present in the data, although we illustrate how this critical problem can be addressed by the user.
|
|
METHODS |
|---|
|
In the present manuscript we will make the following assumptions about data generation.
These assumptions have been chosen to show as simply as possible the implementation of our method and should not be taken as intrinsic limitations of this method. They constitute, moreover, a fairly good first approximation of real data in our experience. More sophisticated models where the next ISI of a neuron depends on the value of the former one (Johnson et al. 1986
) could easily be included. In the same vein, models where the amplitude of a spike depends on several of the former ISIs could be considered. The Gaussian white noise assumption means that the analyst has "whitened" the noise before starting the spike-sorting procedure (Pouzat et al. 2002
).
INTERSPIKE INTERVAL DENSITY. We will use in this paper a log-Normal density for the ISIs. This density looks like a good next guess, after the exponential density, when one tries to fit empirical ISI densities. It is unimodal, exhibits a refractory period, rises "fast," and decays "slowly." The version of the log-Normal density we are using depends on 2 parameters: a dimensionless shape parameter
and a scale parameter s measured in seconds. The probability density for a realization of a log-Normal random variable I with parameters
and s to have the value i is
![]() | (1) |
In what follows we will use a shorter notation
![]() |
and s. A discussion of the properties of the log-Normal distribution can be found in the e-Handbook of Statistical Methods.
SPIKE AMPLITUDE DYNAMICS. We will consider events described by their occurrence time and their peak amplitude measured on one or several recording sites. This makes the equations more compact, but full waveforms can be accommodated if necessary. Following Fee et al. (1996b
) we will describe the dependence of the amplitude on the ISI by an exponential relaxation
![]() | (2) |
is the inverse of the relaxation time constant (measured in 1/s), P is the vector of the maximal amplitude of the event on each recording site (i.e., this is the amplitude observed when i >>
-1), and
is the maximal modulation [i.e., for i <<
-1 we have A(i)
P(1 -
)]. It is clear from this equation that the amplitudes of events generated by a given neuron are modulated by the same relative amount on each recording site. This is one of the key features of modulation observed experimentally (Gray et al. 1995
To keep equations as compact as possible we will assume that the amplitudes of the recorded events on each recording site have been normalized by the SD of the noise on the corresponding site. A and P in Eq. 2 are therefore dimensionless and the noise SD equals 1. Combining Eq. 2 with our third model hypothesis (independent Gaussian noise corruption) we get for the density of the amplitude vector a of a given neuron conditioned on the ISI value i
![]() | (3) |
![]() | (4) |
We get the density of the couple (i, a) by combining Eq. 1 with Eq. 3
![]() | (5) |
NOTATIONS FOR THE COMPLETE MODEL AND THE DATA. We now have, for each neuron in the model, 2 parameters (
and s) to specify the ISI density and 2 + number of recording sites parameters (e.g., for tetrode recordings: P1, P2, P3, P4,
,
) to specify the amplitude density conditioned on the ISI value (Eq. 3). That is, in the case of tetrode recording: 8 parameters per neuron. In the sequel we will use the symbol
to refer to the complete set of parameters specifying our model. That is for a model with K neurons applied to tetrode recordings
![]() | (6) |
![]() | (7) |
Data augmentation, configurations, and likelihood function
DATA AUGMENTATION AND CONFIGURATIONS. Until now we have specified our model and our data representation but what we are really interested in is to find the neuron of origin of each individual spike, or more precisely, for a model with K neurons, the probability for each spike to have been generated by each of the K neurons (remember that we are doing soft classification). To do that we will associate with each spike j a latent variable, cj
{1, 2,..., K}. When cj has value 2, that means that spike j has been "attributed" to neuron 2 of the model. We will use C to refer to the set of N variables c
![]() | (8) |
LIKELIHOOD FUNCTION. Because C has been introduced it is pretty straightforward to compute the likelihood function of the "augmented" data, given specific values of
: L(Y, C|
). We remind the reader here that the likelihood function is proportional to the probability density of the (augmented) data given the parameters. Figure 1 illustrates how this is done on a simple case with 2 units in the model (K = 2) and a single recording site.
|
![]() | (9) |
) themselves are product of terms like Eq. 5 In the case illustrated on Fig. 1 we will have for the first train
![]() | (10) |
). For the second neuron of Fig. 1 we obtain
![]() | (11) |
Posterior and prior densities, a combinatorial explosion
POSTERIOR DENSITY. Now that we know how to compute the likelihood function we can write the expression of the posterior density of model parameters and configuration given the data using Bayes formula
![]() | (12) |
prior(
) is the prior density of the parameters and
![]() | (13) |
post(
, C|Y) we will have done our job, for then the soft classification is given by the marginal density of C
![]() | (14) |
, that is, the uncertainty on the parameters is automatically taken into account.
PRIOR DENSITY. We will assume here that several thousand spikes have been collected and that we perform the analysis by taking chunks of data, say the first 5,000 spikes, then spikes 5,001 to 1,000 and so on. For the first chunk we will assume we know "little" a priori and that the joint prior density
prior(
) can be written as a product of the densities for each component of
, that is
![]() |
(Pq,1· · · 4) are null below 0 and above +20 (remember we are working with normalized amplitudes). We will reflect our absence of prior knowledge about the amplitudes by taking a uniform distribution between 0 and +20. The
value reported by Fee et al. (1996b
must, moreover, be smaller than
, so we took a prior density uniform between 10 and 200 s-1.
must be
1 (the amplitude of a spike from a given neuron on a given recording site does not change sign) and
0 (spikes do not become larger upon short ISI), so we used a uniform density between 0.1 and 0.9 for
. An inspection of the effect of the shape parameter
on the ISI density is enough to convince an experienced neurophysiologist that empirical unimodal ISI densities from well-isolated neurons will have
[0.1, 2]. We therefore took a prior density uniform between 0.1 and 2 for
. The same considerations led us to take a uniform prior density between 0.005 and 0.5 for s.
When one analyzes the following data chunks, say spikes 5,001 to 1,000, the best strategy is to take as a prior density for
the posterior density from the previous chunk
![]() |
A COMBINATORIAL EXPLOSION. We have so far avoided considering the normalizing constant Z of Eq. 12, whose expression is given by Eq. 13. A close examination of this equation shows that it involves a multidimensional integration in a rather highly dimensional space (e.g., for a model with 10 neurons and data from tetrode recordings, we have 80 components in
) and a summation over every possible configuration, that is a sum of KN terms. That is really a lot, to say the least, in any realistic case (say, N = 1,000 and K = 10)! One can wonder how we will manage to compute Z and therefore implement our approach. Fortunately, we do not really need to compute it, as will now be explained.
The Markov Chain Monte Carlo method
AN ANALOGY WITH THE POTTS MODEL IN STATISTICAL PHYSICS. Let us define
![]() | (15) |
, C|Y) is an unnormalized version of
post(
, C|Y), the posterior density, and
![]() | (16) |
![]() | (17) |
= (1/kT) = 1. If one interprets E as an energy,
as an "inverse temperature," Z as a partition function, one sees that the posterior density,
post(
, C|Y), is the canonical density (or canonical distribution) used in statistical physics (Landau and Binder 2000
is given, then Eq. 17 gives the probability to find the "crystal" in a specific configuration C.
The point of this analogy will now become clear. Physicists are interested in estimating expected values (which are what they can experimentally measure) of functions that can be easily computed for each configuration. For instance, they want to compute the expected energy of the "crystal"
![]() | (18) |
post(
, C|Y). Then an estimate of
is the empirical average
![]() | (19) |
A SOLUTION BASED ON A MARKOV CHAIN. For reasons discussed in detail by Newman and Barkema (1999
) and Liu (2001
) the draws cannot be generated in practice by "direct" methods like rejection sampling (Fishman 1996
; Liu 2001
; Robert and Casella 1999
). By "direct" we mean here that the C(t) are independent and identically distributed draws from
post(
, C|Y) for a given
value. Instead we must have recourse to an artificial dynamics generating a sequence of correlated draws. This sequence of draws is in practice the realization of a Markov chain. That means that in the general case, where we are interested in getting both
and C, we will use a transition kernel: T[
(t), C(t)|
(t-1), C(t-1)], and simulate the chain starting from [
(0), C(0)], chosen such that
post[
(0), C(0)|Y] > 0. Moreover, we will build T such that the chain converges to a unique stationary distribution given by
post(
, C|Y). The estimator Eq. 19 of the expected value Eq. 18 is then still correct even though the successive [
(t), C(t)] are correlated; we just need to be cautious when we compute the variance of the estimator (Janke 2002
; Sokal 1989; Empirical averages and SDs). By still correct we formally mean that the following equality holds
![]() | (20) |
Empirical averages and SDs
As explained at the beginning of Slow relaxation and REM, once our Markov chain has reached equilibrium, or at least once its behavior is compatible with the equilibrium regime, we can estimate values of parameters of interest as well as errors on these estimates. The estimator of the probability for a given spike, say the 100th, to originate from a given neuron, say the second, is straightforward to obtain. If we assume that we discard the first ND = 15,000 on a total of NT = 20,000 iterations we have
![]() | (21) |
![]() | (22) |
![]() |
To obtain the error on such estimators we have to keep in mind that the successive states [
(t), C(t)] generated by the algorithm are correlated. Thus, we cannot use the empirical variance divided by the number of generated states to estimate the error (Janke 2002
; Sokal 1989). As explained in detail by Janke (2002
) we have to compute for each parameter
i of the model the normalized autocorrelation function (ACF),
norm(l;
i), defined by
![]() | (23) |
autoco(
i)
![]() | (24) |
starts oscillating around 0. Using an empirical variance,
2(
i) of parameter
i, defined in the usual way
![]() | (25) |
of
becomes
![]() | (26) |
The first consequence of the autocorrelation of the states of the chain is therefore to reduce the effective sample size by a factor of 2
autoco(
i). This gives us a first quantitative element on which different algorithms can be compared (as explained in The MetropolisHastings algorithm of the APPENDIX, the MH algorithm does in fact give us a lot of freedom on the choice of proposal transition kernels). It is clear that the faster the autocorrelation functions of the parameters fall to zero, the greater the statistical efficiency of the algorithm. The other quantitative element we want to consider is the computational time,
cpu, required to perform one MC step of the algorithm. One could for instance imagine that a new sophisticated proposal transition kernel allows us to reduce the largest
autoco of our standard algorithm by a factor of 10, but at the expense of an increase of
cpu by a factor of 100. Globally the new algorithm would be 10 times less efficient than the original one. What we want to keep as small as possible is therefore the product
autoco
cpu. With this efficiency criterion in mind, the replica exchange method described in Slow relaxation and REM becomes even more attractive.
Initial guess
MCMC methods are iterative methods that must be started from an initial guess. We chose randomly with a uniform probability 1/N as many actual events as neurons in the model (K). That gave us our initial guesses for the Pq,i.
was set to
min for each neuron. All the other parameters were randomly drawn from their prior distribution. The initial configuration was generated by labeling each individual spike with one of the K possible labels with a probability 1/K for each label (this is the
= 0 initial condition used in statistical physics).
Data simulation
We have tried to make the simulated tetrode data set used in this paper a challenging one. First the noise corruption was generated from a Student's t density with 4 degrees of freedom instead of a Gaussian density as assumed by our model (Eq. 3). Second, among the 6 simulated neurons only 3 had an actual log-Normal ISI density. Among the 3 others, one had a gamma density
![]() | (27) |
is a shape parameter. One neuron was a doublet-generating neuron and the third one had a truncated log-Normal ISI density. The truncation of the latter came from the fact that 18% of the events it generated were below the "detection threshold," which would have been at 3 noise SD. The doublet-generating neuron was obtained with 2 log-Normal densities, one giving rise to short ISIs (10 ms) the other one to "long" ones (60 ms); the neuron was moreover switching from the short, respectively the long, ISI generating state to the long, respectively the short, ISI generating state after each spike, giving rise to a strongly bimodal ISI density (Fig. 9C). The mean ISI of these 6 neurons ranged from 10 to 210 ms (Table 1). A "noise" neuron was moreover added to mimic the collective effect of many neurons located far away from the recording sites, which would therefore not be detected most of the time. This "noise" neuron was obtained by generating first a sequence of time points exponentially distributed. A random amplitude given by the background noise density (t density with 4 degrees of freedom; see above) was associated to each point and the point was kept in the "data" sample only if its amplitude exceeded the detection threshold (3 noise SD) on at least one of the 4 recording sites. The initial exponential density for this "noise" neuron was set such that a preset fraction of events in the final data sample would arise from it (in that case 15%, or 758 events on a total of 5,058 events). Details on the pseudorandom number generators required to simulate the data can be found in Random number generators for the simulated data of the APPENDIX.
|
|
Codes were written in C and are freely available under the Gnu Public License at our web site: http://www.biomedicale.univ-paris5.fr/physcerv/Spike-0-Matic.html. The free software Scilab (http://www-rocq.inria.fr/scilab/) was used to generate output plots as well as the graphical user interface, which comes with our release of the routines. The GNU Scientific Library (GSL: http://sources.redhat.com/gs1/) was used for vector and matrix manipulation routines and (pseudo)random number generators (Uniform, Normal, log-Normal, Gamma). The GSL implementation of the MT19937 generator of Matsumoto and Nishimura (1998
) was used. This generator has a period of 219,937 - 1. Because the code requires a lot of exponential and logarithm evaluations, the exponential and logarithm functions were tabulated and stored in memory, which reduced the computation time by 30%. Codes were compiled and run on a PC laptop computer (Pentium IV with CPU speed of 1.6 GHz, 256 MB RAM) running Linux. The gcc compiler (http://www.gnu.org/software/gcc/gcc.htm) version 3.2 was used.
|
|
RESULTS |
|---|
|
The performances of our algorithm will be illustrated in this paper with a simulated data set. The data are supposed to come from tetrode recordings. Six neurons are present in the data together with a "noise" neuron supposed to mimic events rarely detected from neurons far away from the tetrode's recording sites. The data set has been made challenging for the algorithm by producing systematic deviations with respect to our model assumptions (Data simulation). This data set is supposed to come from 15 s of recording during which 5,058 events were detected. Wilson plots as they would appear to the user are shown on Fig. 2A. Figure 2B shows the same data with colors corresponding to the neuron of origin. The parameters used to simulate each individual neuron are given in Table 1. The reader can easily see that, whereas one cluster has very few points (purple cluster on Fig. 2B), others have a lot (brown, green, and blue clusters). Indeed, the smallest cluster contains 73 events, whereas the biggest contains 1,474 events. Some clusters are very elongated (e.g., green and red clusters) and one neuron, the doublet-generating neuron, even gives rise to 2 well-separated clusters (red). The "noise" neuron cluster (black) is much more spread out than the others. Two cluster pairs always exhibit an overlap (the yellow-black and the green-red pairs).
|
|
We want to illustrate here some basic features of our algorithm dynamics as well as how model comparison can be done. In its present form, our approach requires the user to decide what is the proper model (i.e., the proper number of neurons). This decision is based on the examination of "empirical" features associated with the different models. To obtain these empirical features, we have to perform the same computation on the different models, although the computation time increases with the number of events considered. It is therefore a good idea to start with a reduced sample that contains enough events for the model comparison to be done and little enough for the computation to run quickly. In this section we will work with the first 3 s of data (1,017 events), which represent approximately 20% of the total sample (5,058 events). We will focus here on a model with 7 neurons, but a similar study was performed with models containing from 4 to 10 neurons.
EVIDENCE FOR META-STABLE STATES. To explore a model with a given number of neurons we start by simulating 10 different and independent realizations of a Markov chain. The way random initial guesses for these different realizations are obtained is explained in Initial guess. For each realization, one Monte Carlo (MC) step consists in an update of the label of each of the 1,017 spikes (SPIKE LABEL TRANSITION MATRIX), an update of the amplitude parameters (P,
,
) (AMPLITUDE PARAMETERS TRANSITION KERNELS) and of the 2 parameters of the ISI density, s and
(SCALE PARAMETER TRANSITION KERNEL & SHAPE PARAMETER TRANSITION KERNEL) for each neuron. As explained in AMPLITUDE PARAMETERS TRANSITION KERNELS, the amplitude parameters are generated with a piecewise linear approximation of the corresponding posterior conditional. This piecewise linear approximation requires a number of discrete sampling points to be specified. Because when we start our model exploration, we know little about the location of the corresponding posterior densities, we used during these first 2,000 MC steps, 100 sampling points regularly spaced on the corresponding parameter domain defined by the prior (PRIOR DENSITY). Figure 4A illustrates the energy evolution of the 10 different trials (realizations). The reader not familiar with our notion of energy (Eq. 16) can think of it as an indication of the quality of fit. The lower it is, the better the fit. Indeed, if we were working with a template-matching paradigm, the sum of the squared errors would be proportional to our energy. The striking feature on Fig. 4A is the presence of "meta-stable" states (a state is defined by a configuration and a value for each model parameter). The trials can spend many steps at an almost constant energy level before making discrete downward steps (like the one falling from 5,800 to 4,300 before the 500th step). If we look at the parameters values and configurations generated by the different trials (not shown), the ones that end above 5,000 "miss" the purple cluster on Fig. 2B. The time required to perform 2,000 MC steps with 7 neurons in the model was roughly 9 min, meaning that 1.5 h was required to obtain the data of Fig. 4A. This time (for 2,000 MC steps) grew from 8' for a model with 4 neurons to 10' for a model with 10 neurons.
|
) combined with exchanges between replicas states. These inverse temperatures are purely artificial and are defined by analogy with systems studied in statistical physics. The idea is that the system with a small
value will experience little difficulty in crossing the energy barriers separating local minima (meta-stable states) and will therefore not suffer from the slow relaxation problem. The replica exchange dynamics will allow "cold" replicas (replicas with a larger
value) to take profit of the "fast" state space exploration performed by the "hot" replica (small
). The efficiency of the method is demonstrated by the black trace on Fig. 4B. Here we again took the last state of the best trial of Fig. 4A to restart the Markov chain. We used 8 different
values: 1, 0.8, 0.6, 0.5, 0.45, 0.4, 0.35, and 0.3. Five different runs with 4,000 steps were performed and the energy trace at
= 1 is shown in its integrity; this was the best of the 5. The final mean energy of the 5 trials is shown on the right side of the figure (triangles). The Markov chains simulated with the REM relax clearly faster than the ones simulated with the single replica method. The number of MC steps has been chosen so that the same computational time (32') was required for both runs, the long one (32,000 steps) with a single replica at
= 1 (gray trace) and the short one (4,000 steps) with 8 replicas. The reader can remark that for these trials the time required to perform 2,000 MC steps for a single replica is 2', whereas it was previously much larger (9'). This is attributed to the reduction in the number of sampling points used for the posterior conditional, 13 instead of 100 (AMPLITUDE PARAMETERS TRANSITION KERNELS). A significant amount of the computational time is therefore spent generating new amplitude parameters values. MODEL CHOICE. Figure 5 illustrates what we meant by "empirical features associated with different models" at the beginning of this section. We have now performed the same series of runs with 7 different models having between 4 and 10 neurons. One feature we can look at in the output of these runs is the most likely classification produced (Empirical averages and SDs, Eq. 21). By "most likely" we mean that the label of each spike has been forced to its most likely value. Figure 5A shows one of the Wilson plots with different neuron numbers, where the events have been colored according to their most likely label. The good behavior of the algorithm appears clearly. When asked to account for the data with 6 neurons, it mainly lumps together the "small" neuron (yellow on Fig. 2B, neuron 6 in Table 1) with the "noise" neuron (black on Fig. 2B, neuron 7 in Table 1). With 7 neurons, the actual number of neurons, it splits the previously lumped clusters into 2 (although it attributes too many spikes to the small neuron and too few to the noise one; see Long runs with full data set). With 8 neurons, it further splits the noise cluster. In fact it creates a "noise" cluster (black on the figure) whose peak amplitude on the fourth recording site is roughly 3 times the recording noise SD and another one (clear blue) whose peak amplitude on this same site is smaller than 3. We remind the reader that our detection threshold is here set at 3 noise SD. The algorithm does not split the doublet-firing neuron (red) into 2 neurons, although this neuron gives rise to two well-separated clusters. With 9 neurons, it splits even further the previous clear blue noise "neuron" into 2 neurons, one with a peak amplitude roughly equal to the detection threshold on the first recording site (pink cluster) and one whose peak amplitudes on sites 1 and 4 are smaller than 3. With 10 neurons (not shown) we end up with the actual noise neuron, resulting in 4 neurons with an average peak amplitude slightly above the detection threshold on one of the recording sites and very close to zero on the 3 others.
|
parameters of this neuron (10 ms and 0.2), we can compute the most likely ISI: s exp (-
2) = 9.6 ms. Then, the most likely peak amplitude is given by Eq. 2. It turns out that all the models considered gave the same and correct values for the 2 ISI density parameters (within error bars, not shown). The models with 6, 7, and 9 neurons gave for the amplitude parameters (P1,
,
) of the brown neuron: 7.8, 0.45, 114 (the amplitudes on the different recording sites were the same within error bars and the parameters were the same, within error bars, across models, not shown), whereas the model with 8 neurons gave 9.9, 0.45, and 39. Then the most likely amplitude for the models with 6, 7, or 9 neurons was 6.7, whereas it was 6.8 for the model with 8 neurons. That explains the larger frame of the brown neuron on Fig. 5B. In practice it turns out that the observation of the parameters alone is sufficient to find which neuron in a given model corresponds to which neuron in another model. Based on the considerations exposed in these last 2 paragraphs it seems reasonable to keep working with a model with 7 neurons, acknowledging the fact that the model cannot account very well for the noise neuron. In the sequel we will therefore proceed with the full data set (5,058) spikes and consider only the model with 7 neurons.
|
The analysis of the full data set (5,058 spikes or 15 s of data) was done in 2 stages. During the first 4,000 MC steps we fixed the model parameters at their mean values computed from the last 1,000 MC steps of the REM run with the reduced data set (Fig. 4B, black trace) and we updated only the configuration. The initial configuration was, as usual, randomly set. We used the last 1,000 MC steps of this first stage to take "measurements" and compare the algorithm's output with the actual values of different data statistics (Figs. 8 and 9). The "complete" algorithm was used during the 21,000 MC steps of the second stage. By complete algorithm we mean that at each MC step a new configuration was drawn as well as new values for the model parameters. The 13 sampling points of the posterior conditionals for the amplitude parameters between steps 4,001 and 6,000 were set using the last 1,000 MC steps of the REM run with the reduced data set. The last 2,000 MC steps of the second stage were used to take measurements. We use here a very long run to illustrate the behavior of our algorithm. Such long, and time-consuming, runs are not required for a daily use of our method. The time required to run the first stage (4,000 MC steps) was 3 h, whereas 26 h were required for the second (21,000 MC steps).
|
to be used (for a given range) as the number of spikes in the data set increases because the width of the energy histograms (Fig. 6B) is inversely proportional to the square root of the number N, of events in the sample (Hansmann 1997
grows therefore as
(if we had used the same 8 inverse temperatures as during our model exploration with a reduced data set, we would have only every second histogram). The computation time per replica grows, moreover, linearly with N. We therefore end up with a computation time of the REM growing like N1.5. This justifies keeping the sample size small during model exploration. Figure 6C shows that the autocorrelation function of the energy at each temperature falls to zero rapidly (within 10 MC steps) which is an indication of the good performance of the algorithm from a statistical view point (Empirical averages and SDs). A pictorial way to think of the REM is to imagine that several "boxes" at different preset inverse temperatures are used and that there is one and only one replica per box. After each MC step, the algorithm proposes to exchange the replicas located in neighboring boxes (neighboring in the sense of their inverse temperatures) and this proposition can be accepted or rejected (Eq. A28). Then if the exchange dynamics works properly one should see each replica visit all the boxes during the MC run. More precisely, each replica should perform a random walk on the available boxes. Figure 6D shows the random walk performed by the replica which starts at
= 1. The ordinate corresponds to the box index (see legend). Between steps 1 and 25,000, the replica travels several times through the entire inverse temperature range.
|
A more compact way to present the information of Fig. 7 is to summarize the distributions that are approximately Gaussian by their means and autocorrelation corrected SDs. The other distributions can be summarized by a 95% confidence interval, the left boundary being such that 2.5% of the generated parameters values were below it and the right boundary such that 2.5% of the generated values were above it, as shown in Table 2 (compare with the actual values of Table 1).
|
j. A false positive (Fig. 8B) is the reverse situation, a spike that was generated by neuron i
j and which ends up with label j. Two false positive and false negative values corresponding to the 2 measurements periods defined in Fig. 6A are given for each neuron on Fig. 8. Three features appear already clearly. First, the errors are smaller for the neurons that correspond to the model (neurons 1, 4, and 5) than for the others. Second, the difference between the 2 measurements periods is surprisingly small (given the significant difference in computation times). Third, most of the errors are attributed to noise events (from neuron 7) wrongly attributed to the small neuron (neuron 6), as expected from the Wilson plots. Figure 8C shows for each neuron, the sum of false positive and negative divided by the number of spikes actually generated by the neuron. The luxury simulation performed with the complete algorithm can be seen to decrease the fraction of misclassified spikes between the small neuron (6) and the noise neuron (7), as well as the fraction of misclassified spikes for the doublet-firing neuron (3). Overall, if we consider the full data set, we end up with 10.4% of the spikes misclassified from the first measurements period and 8.7% from the second one. If we (wisely) decide that neurons 6 and 7 cannot be safely distinguished and decide to keep only the first 5 neurons we misclassify 3.9% of the spikes during the first period and 3.5% during the second. ISI HISTOGRAMS ESTIMATES. The last statistics shown are the ISI histograms. Our algorithm generates at each MC step a new configuration, that is, a new set of labels for the spikes. An ISI histogram can therefore be computed for each neuron after each algorithm step. Then the best estimate we can get for an actual histogram (i.e., the histogram obtained using the true labels) is the average of the histograms computed after each step. This is better than getting first the most likely configuration (as we did in the previous paragraph) and then the histogram for each neuron because this latter method introduces a bias in the estimate. We have done this computation for each neuron during the 2 periods. The results together with the actual histograms are shown on Fig. 9. Again, our algorithm generates very good estimates for the 5 good neurons. In particular it generates the proper strongly bimodal ISI histogram for the doublet-firing neuron (Fig. 9C), whereas the model does use a unimodal log-Normal ISI density to describe this neuron. Here again that gain in accuracy provided by the very long run with the complete algorithm is rather modest.
|
|
DISCUSSION |
|---|
|
We have shown the presence of meta-stable states, which are, we think, an intrinsic property of models with "interacting" spikes. We have illustrated the potential problems resulting from these meta-stable states, which could introduce a bias in our algorithm's output. However, using a powerful methodology recently introduced in statistics and computational physics, the replica exchange method, we were able to circumvent the slow relaxation resulting from the meta-stable states.
We have deliberately used simulated data that did not correspond to our model assumptions, but which were hopefully realistic enough. Our algorithm with its "simple" data generation model performed very well for the neurons that generated events larger than the detection threshold, even for a neuron which generated two separated clusters on the Wilson plots. As far as we know, no other algorithm would automatically properly cluster the events of such a neuron. Clearly, such a performance is possible only if both the spike occurrence times and amplitude dynamics are taken into account. The robustness of the algorithm's output with respect to overclustering was moreover demonstrated. That being said, if a recording noise model based on a Student's t distribution turns out to be better than a Gaussian model, our algorithm can be trivially modified to work with the former. It can be easily modified as well to include an explicit "noise neuron" (i.e., a neuron with an exponential ISI density and a maximal amplitude below the detection threshold on all recording sites). We permit reader to download our implementation of the algorithm and check that an almost perfect classification (and model parameter estimation) can be obtained for data that correspond to the model, even when strong overlap between clusters are observed on the Wilson plots.
We have considered spikes described by their peak amplitude(s) instead of their waveforms, although it would not require a tremendous algorithmic change to deal with waveforms (see for instance Pouzat et al. 2002
). Caution would nevertheless be required to deal with superpositions of spikes, but as long as spikes do not occur exactly at the same time, superpositions will show up as "interaction" terms. That is, Eq. 2 would have to be modified to include a baseline change because of neighboring spikes.
The major drawback of our algorithm is the rather long computational time it requires. It should nevertheless be clear that the user can choose between different implementations of our approach. A fast one, where the model parameters are set during a prerun with a reduced data set, leads to an already very accurate classification of spikes from "good" neurons (a good neuron generates events clearly above the detection threshold). We think that this use of our method should satisfy most practitioners. For a high accuracy of the model parameters estimates the "full" version of our algorithm should be chosen. This version is now clearly too time consuming to be systematically used. This long computational time can be easily reduced, however, if one realizes that the REM can be trivially parallelized (Hansmann 1997
; Hukushima and Nemoto 1996
; Mitsutake et al. 2001
). That is, if one has 15 PCs (which is not such a big investment anymore), one can simulate each replica of Long runs with full data set on a single CPU and when the replica exchange is accepted just "swap the inverse temperatures of the different CPUs" (rather than swapping the states of replicas simulated on different CPUs). We can therefore expect to reduce the computation time by an order of magnitude (even more if one considers that recent CPUs run at 3 GHz, whereas the one used in this paper ran at 1.6 GHz). Preliminary tests on a Beowulf cluster confirmed this expectation. Other improvements can be brought as well, in particular for the amplitude parameters generation. For the latter, a Langevin diffusion or a simple random walk can be used to generate the proposed moves (Besag 2001
; Celeux et al. 2000
; Liu 2001
; Neal 1993
). We have in fact already done it (Pouzat et al. unpublished observations) and found that the computation time required for the amplitude parameters was reduced by a factor of 10 (compared with a piecewise linear approximation of the posterior conditional based on 13 sample points).
Finally, as we mentioned in the INTRODUCTION, our method is semiautomatic because user input is still required to choose the "correct" number of active neurons in the data. To make the method fully automatic we will have to do better and that fundamentally means estimating the normalizing constant Z of Eqs. 12 and 13, which is the probability of the data. Indeed Bayesian model comparison requires the comparison (ratio) of data probability under different models (Gelman and Meng 1998
; Green 1995
) as explained, in a spike-sorting context, by Nguyen et al. (2003
). This task of estimating Z could seem daunting but luckily is not. As we said, Dynamic MC and MCMC methods have been used for a long time, which means that other people in other fields already had that problem. Among the proposed solutions (Gelman and Meng 1998
; Green 1995
; Nguyen et al. 2003
), what physicists call thermo-dynamic integration (Frenkel and Smit 2002
; Landau and Binder 2000
) seems very attractive because it is known to be robust and it requires simulations to be performed between
= 1 and
= 0. That is precisely what we are already (partially) doing with the REM. Of course the estimation of Z is only half of the problem. The prior distribution on the number of neurons in the model has to be set properly as well. Increasing the number of neurons will always lead to a decrease in energy which will give larger Z values (the derivative of the log of Z being proportional to the opposite of the energy; Frenkel and Smit 2002
), we will therefore have to compensate for this systematic Z increase with the prior distribution (Pouzat et al. unpublished observations).
|
|
APPENDIX |
|---|
|
The theory of Markov chains (Brémaud 1998
) tells us that given a probability density like
post, and a transition kernel T, an equation like Eq. 20 will hold for any function E and for any initial "state" [
(0), C(0)], if T is irreducible and aperiodic and if
post is stationary with respect to T, that is, if
![]() | (A1) |
, C) in a finite number of steps. Aperiodic means, loosely speaking, that there is no temporal order in the way the chain returns to any state it visited. These 2 properties are often lumped together in the notion of ergodicity. In practice, if a transition kernel is nonzero for any pair of arbitrary states (
a, Ca) and (
b, Cb), then it is both irreducible and aperiodic (ergodic). Our problem thus becomes to find an ergodic T for which Eq. A1 holds.
It turns out that there is a very general prescription to build a transition kernel T with the desired properties and that, in fact, we do not even have to build it explicitly. We first need a "proposal" transition kernel g[
, C|
(t-1), C(t-1)], which is itself ergodic on the same space as
post.
We then apply the following algorithm:
Given a state [
(t-1), C(t-1)]
Generate:
![]() | (A2) |
Take:
![]() |
![]() | (A3) |
post up to a normalizing constant because Eq. A3 involves a ratio of
post values in 2 states. In other words it requires only the capability to compute the energy of any state. An initial version of this algorithm was given by Metropolis et al. (1953
The reader can see that the MetropolisHastings (MH) algorithm is indeed a very large family of algorithms because there exists a priori an infinite number of transitions g from which a specific procedure can be built. We therefore have to find at least one g and if we find several we would like to have a way to compare the resulting MH transition kernels (T). A common way to proceed for complicated models, like the one we are presently dealing with, is to split the transition kernel T in a series of parameter and label specific transition kernels (Besag 2001
; Fishman 1996
). Let us define
![]() | (A4) |
![]() | (A5) |
![]() | (A6) |
![]() | (A7) |
![]() | (A8) |
![]() | (A9) |
![]() | (A10) |
post to be stationary with respect to T are as follows:
-i, TC-j) is such that
![]() | (A11) |
-i,
i = a, C) and (
-i,
i = b, C) such that
![]() | (A12) |
-i, TC-j) is such that the detailed balance condition is met
![]() | (A13) |
![]() |
![]() |
![]() | (A14) |
post(
i|
-i, C, Y) is the posterior conditional density of
i. There are, of course, equivalent equations for the label-specific transition kernels (TC).
A straightforward way to obtain T
-i and TC-i with the desired properties is to find parameter- and label-specific proposal transition kernels (g
-i and gC-i) respecting Eq. A11 in combination with an acceptance probability like Eq. A3, that is, to have an MH rule for each parameter of the model and for each label of the configuration. It is indeed easy to show that such a construct gives rise to T
-i and TC-i respecting the detailed balance condition because, assuming a
b
![]() | (49) |
The detailed balance condition imposed on the parameter specific transition kernels implies the stationarity condition on the resulting "complete" transition kernel.We present here the justification for a simple model with 2 parameters (
1 and
2) and without latent variable, the extension to the full model considered in this paper being straightforward. So we have
![]() |
![]() | (A15) |
![]() | (A16) |
![]() | (A17) |
![]() | (A18) |
![]() | (A19) |
![]() | (A20) |
![]() | (A21) |
![]() | (A22) |
![]() | (A23) |
![]() | (60) |
It should become clear in the sequel that the appeal of the detailed balance comes from the (relative) ease with which it can be implemented (and checked). This is not the case of the stationarity condition which is much harder to check for an arbitrary transition kernel.
A close examination of the MH acceptance probability shows that a reasonable choice for the parameter specific proposal transition would be the corresponding conditional posterior density. For then we would have
![]() |
,
). The following section of the APPENDIX contains a detailed account of the algorithms used for the labels transition kernels (SPIKE LABEL TRANSITION MATRIX), the scale parameter transition kernel (SCALE PARAMETER TRANSITION KERNEL), the shape parameter transition kernel (SHAPE PARAMETER TRANSITION KERNEL), and the amplitude parameters transition kernels (AMPLITUDE PARAMETERS TRANSITION KERNELS). The algorithm of the later uses as proposal transition kernels, piecewise linear approximations of the posterior conditional densities. These piecewise linear approximations are computationally costly (they take a lot of time to compute) when one uses a lot of discrete posterior conditional evaluations, although they are then better approximations. To keep both a good precision and a reasonable computational cost we used a multiruns procedure where the sample generated during one run was used to optimize the few (13) locations at which the posterior conditional was evaluated (AMPLITUDE PARAMETERS TRANSITION KERNELS). This "position refinement" procedure is one of the reasons why we repetitively use successions of runs in the results section. Parameter-specific transition kernels
SPIKE LABEL TRANSITION MATRIX. Here ci takes integer values in {1,..., K}, so we are therefore, strictly speaking, dealing with a transition matrix rather than a transition kernel. As explained in The MetropolisHastings algorithm we build TC-i as a product of a proposal transition matrix gC-i and an acceptance probability "matrix" AC-i. We always propose a label different from the present one [gC-i (ci = a|ci = a) = 0], and the "new" labels are proposed according to their posterior conditional density
![]() | (A24) |
![]() | (A25) |
![]() |
The acceptance probability requires the calculation of the ratio of the posterior conditionals:
post(ci = 1|
, C-i, Y) and
post(ci = 2|
, C-i, Y). From Eq. 15 we have
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
A remark on numerical implementation.To avoid rounding errors and keep numerical precision we work with the log-likelihoods
![]() |
post(ci = b|
, C-i, Y) with
![]() |
{1,..., K}\{a} is drawn from the probability mass gC-i (ci = b|ci = a) using the inverse cumulative distribution function (Fishman 1996
Boundary treatment. From the statisticians' viewpoint the spike trains we are dealing with are censored data. They mean by that that there are missing data without which we cannot, strictly speaking, compute our likelihood function. The missing data here are the last spikes from each neuron that occurred before our recording started and the next spikes after our recording stopped. There are proper approaches to deal with such censored data (Gelman et al. 1995
; Robert and Casella 1999
) and we plan to implement them in a near future. In the present study we used a simple and quick way to deal with the end effects in our data. We used periodic boundary conditions; that is, the last spike from each neuron was considered as the spike preceding the first spike of the same neuron.
SCALE PARAMETER TRANSITION KERNEL. We consider here cases where
i of Eq. A14 corresponds to one of the K scale parameters, sq, where q stands for the neuron index. We will first show that the corresponding posterior conditional:
post(
i = sq|
-i, C, Y) can be written in a closed form. From Eq. 15 we have
![]() |
q. If we write {iq,1,..., iq,n}, the ISI computed from the spike train of neuron q in configuration C, we get
![]() |
, a short manipulation shows that
![]() |
In words: the posterior density of the log of the scale parameter sq is Normal with mean:
, and variance, (
q)2/nq. Our prior on sq is a uniform distribution between smin and smax (PRIOR DENSITY); we can therefore, using the Gibbs algorithm, generate sq as follows:
Norm
[smin, smax], set sq = s; otherwise, go back to 2
SHAPE PARAMETER TRANSITION KERNEL. Here
i is the shape parameter of neuron q,
i =
q. We again have to show that the posterior conditional can be written in a closed form. Using the approach of the previous section we quickly get
![]() |
![]() |
= (
q)2,
= (nq/2) - 1, and
= nq/2(
- log sq)2. To generate
q with the Gibbs algorithm we need to know how to generate random number with an inverse-Gamma density. It turns out that most analysis platforms (Scilab, MATLAB) and C libraries like GSL do not have an inverse-Gamma random number generator but do have a Gamma random number generator. A Gamma density can be written
![]() |
is Gamma with parameters
and
, then 1/
is inverse-Gamma with the same parameters. To generate
q using the Gibbs algorithm, we therefore perform
, then
otherwise go back to 1.
AMPLITUDE PARAMETERS TRANSITION KERNELS. As explained in The MetropolisHastings algorithm the amplitude parameters: P,
,
cannot be generated with the Gibbs algorithm because the corresponding posterior conditionals cannot be written in a closed form. For these parameters we had recourse to a piecewise linear proposal kernel. The principle of our method is illustrated here with a simple example, which can be thought of as the generation of the maximal peak amplitude of one of the neurons (q) on, say, the first recording site. Using a discrete set of points on the domain allowed by the corresponding priors: {Pq,1,1 = Pmin, Pq,1,2,..., Pq,1,m = Pmax}, an unnormalized value of the posterior conditional is computed
![]() |
Again the structure of the likelihood (Eq. 9) allows us to write
![]() |
![]() |
[Pq,1,i, Pq,1,i+1] sums to 1. Of course when we start the algorithm we do not know what is the best location for the discrete points Pq,1,i. We therefore used during a first run 100 regularly spaced points, although it turns out that computing at each MC step 100 values of the posterior conditional (for each amplitude parameter of the model) takes time. It is therefore very interesting, to speed up the sorting procedure, to use fewer points, which basically means forgetting about the points located where the posterior conditional is extremely small. To achieve this goal, at the end of each run, the last 1000 generated values of the parameter of interest (Pq,1) were used to get an estimate of the posterior marginal cumulative distribution of the parameter. Then 10 quantiles were located on this cumulative distribution, one discrete point was taken slightly smaller than the smallest generated value for that parameter, and the two extreme values allowed by the priors were used as discrete points as well. The new piecewise proposal kernel was therefore based on a sparser approximation. Slow relaxation and REM
One of the nice features of the MCMC approach, and in fact the key theoretical result justifying its growing popularity, is that the user can be sure that empirical averages (Eq. 19) computed from the samples it generates do converge to the corresponding expected values (Eq. 18). The not-so-nice feature, forgetting about the actual implementation details, which can be tedious, is that the user can run his/her MCMC algorithm only for a finite amount of time. Then the fact that empirical averages are approximations cannot be ignored. There are 3 sources of errors in such estimates. The first that comes to mind is the one resulting from the finite sample size, that is, the one we would have even if our draws where directly generated from
post. The difference between the MCMC-based estimates and estimates based on direct samples is that the variance of the estimators of the former have to be corrected to take into account the correlation of the states generated by the Markov chain, as explained in Empirical averages and SDs. The second source of error is a bias induced by the state [
(0), C(0)] with which the chain is initialized (Fishman 1996
; Sokal 1989). The bad news concerning this source of error is that there is no general theoretical result providing guidance on the way to handle it, but the booming activity in the Markov chain field already produced encouraging results in particular cases (Liu 2001
). The common wisdom in the field is to monitor parameters (and labels) evolution, and/or their functions like the energy (Eq. 16). Based on examination of evolution plots (e.g., Fig. 4, A and B, Fig. 6A) and/or on application of time-series analysis tools, the user will decide that "equilibrium" has been reached and discard the parameters values before equilibrium. More sophisticated tests do exist (Robert and Casella 1999
) but they were not used in this paper. These first 2 sources of error are common to all MCMC approaches. The third one appears only when the energy function exhibits several local minima. In the latter case, the Markov chain realization can get trapped in a local minimum, which could be a poor representation of the whole energy function. This sensitivity to local minima arises from the local nature of the transitions generated by the MH algorithm. That is, in our case, at each MC time step, we first attempt to change the label of spike 1, then the one of spike 2, and so forth, then the one of spike N, then we try to change the first component of
(P1,1), and so on until the last component (
K). That implies that if we start in a local minimum and if we need to change, say, the labels of 10 spikes to reach another lower local minimum, we could have a situation in which the first 3 label changes are energetically unfavorable (giving, for instance, an acceptance probability, Eq. A3, of 0.1 per label change), which would make the probability to accept the succession of changes very low (0.13), meaning that our Markov chain would spend a long time in the initial local minimum before "escaping" to the neighboring one. Stated more quantitatively, the average time the chain will take to escape from a local minimum with energy Emin grows as the exponential of the energy difference between the energy E* of the highest energy state the chain has to go through to escape and Emin
![]() |
Our chains will therefore exhibit an Arrhenius behavior. To sample more efficiently such state spaces, the replica exchange method (REM) (Hukushima and Nemoto 1996
; Mitsutake et al. 2001
), also known as the parallel tempering method (Falcioni and Deem 1999
; Hansmann 1997
; Yan and de Pablo 1999
), considers R replicas of the system with an increasing sequence of temperatures (or a decreasing sequence of
) and a dynamic defined by 2 types of transitions: usual MH transitions performed independently on each replica according to the rule defined by Eq. A10 and a replica exchange transition. The key idea is that the high temperature (low
) replicas will be able to easily cross the energy barriers separating local minima (in the example above, if we had a probability of 0.13 to accept a sequence of labels switch for the replica at
= 1, the replica at
= 0.2 will have a probability 0.13x0.2
0.25 to accept the same sequence). What is needed is a way to generate replica exchange transitions such that the replica at
= 1 generates a sample from
post defined by Eq. 17. Formally the REM consists in simulating, on an "extended ensemble," a Markov chain whose unique stationary density is given by
![]() | (A26) |
ee stands for "extended ensemble" (Iba 2001
1 >... >
R for convenience, and
![]() | (A27) |
to be different from 1. To construct our "complete" transition kernel we apply our previous procedure; that is, we construct it as a sequence of parameter, label, and interreplica-specific MH transitions. We already know how to obtain the parameter and label specific transitions for each replica. What we really need is a transition to exchange replicas, say the replicas at inverse temperature
i and
i+1, such that the detailed balance is preserved (Eq. A13)
![]() |
![]() |
Again we write Ti,i+1 as a product of a proposal transition kernel and an acceptance probability. Here we have already explicitly chosen a deterministic proposal (we only propose transitions between replicas at neighboring inverse temperatures), which gives us
![]() |
![]() | (A28) |
i+1) has a lower energy [E(
i, Ci)] than the "cold" one, the proposed exchange is always accepted. The exchange can pictorially be seen as cooling down the hot replica and warming up the cold one. Fundamentally this process amounts to making the replica that is at the beginning and at the end of the replica exchange transition at the cold temperature to jump from one local minimum (
i+1, Ci+1) to another one (
i, Ci). That is precisely what we were looking for. The fact that we can as well accept unfavorable exchanges (i.e., raising the energy of the "cold" replica and decreasing the one of the "hot" replica) is the price we have to pay for our algorithm to generate samples from the proper posterior (we are not doing optimization here).
For the replica exchange transition to work well we need to be careful with our choice of inverse temperatures. The typical energy of a replica (i.e., its expected energy) increases when
decreases (Fig. 6A). We will therefore typically have a positive energy difference:
E = Ehot - Ecold > 0 between the replicas at low and high
before the exchange. That implies that the acceptance ratio (Eq. A28) for the replica exchange will be typically smaller than 1. Obviously, if it becomes too small, exchanges will practically never be accepted. To avoid this situation we need to choose our inverse temperatures such that the typical product: 

E, where 
=
cold -
hot, is close enough to zero (Hukushima and Nemoto 1996
; Iba 2001
; Neal 1994
). In practice we used preruns with an a priori too large number of
s, checked the resulting energy histograms and kept enough inverse temperatures to have some overlap between successive histograms (Fig. 6B).
The replica exchange transitions were performed between each pair
i,
i+1 with an even, respectively odd, i at the end of each even, respectively odd, MC time step. With the replica exchange scheme, each MC time step was therefore composed of a complete parameter and label transition for each replica, followed by a replica exchange transition. This scheme corresponds to the one described by Hukushima and Nemoto (1996
). A rather detailed account of the REM can be found in Mitsutake et al. (2001
). Variations on this scheme do exist (Celeux et al. 2000
; Neal 1994
).
Random number generators for the simulated data
A vector n from an isotropic t density with
(>2) degrees of freedom and a SD equal to 1 (for each component) was generated with the following algorithm (Fishman 1996
; Gelman et al. 1995
):
Norm (0, 1)
2(
)
The Normal and
2 (pseudo-)random number generators are built-in features of Scilab (and MATLAB).
An isi from a log-Normal density with scale parameter s and shape parameter
was generated with the following algorithm:
Norm (0,
2) For the neuron with a Gamma ISI density, the Gamma random number generator of Scilab was used.
|
|
ACKNOWLEDGMENTS |
|---|
|
GRANTS
This work was supported in part by a grant from the Ministère de la Recherche (ACI Neurosciences Intégratives et Computationnelles, pré-projet, 20012003), by a grant from interÉtablissements Publiques Scientifiques et Techniques (Bioinformatique), and by the région Ile-de-France (programme Sésame). M. Delescluse was supported by a fellowship from the Ministère de l'Education National et de la Recherche.
|
|
FOOTNOTES |
|---|
Address for reprint requests and other correspondence: C. Pouzat, Laboratoire de Physiologie Cérébrale, CNRS UMR 8118, Université René Descartes, 45 rue des Saints Pères, 75006, Paris, France (E-mail: christophe.pouzat{at}univ-paris5.fr).
|
|
REFERENCES |
|---|
|
Besag J. Markov Chain Monte Carlo for Statistical Inference [online]. Working Paper No. 9. Seattle, WA: Center for Statistics and Social Sciences, Univ. of Washington, 2001. http://www.csss.washington.edu/Papers/wp9.pdf [28 Nov. 2003].
Binder K and Young AP. Spin glasses: experimental facts, theoretical concepts and open questions. Rev Mod Phys 58: 801-976, 1986.[CrossRef][Web of Science]
Brémaud P. Markov Chains: Gibbs Fields, Monte Carlo Simulations and Queues. New York: Springer-Verlag, 1998.
Celeux G, Hurn M, and Robert C. Computational and inferential difficulties with mixture posterior distributions. J Am Stat Assoc 95: 957-970, 2000.[CrossRef][Web of Science]
Falcioni M and Deem MW. A biased Monte Carlo scheme for zeolite structure solution. J Chem Phys 110: 1754-1766, 1999.[CrossRef]
Fee MS, Mitra PP, and Kleinfeld D. Automatic sorting of multiple-unit neuronal signals in the presence of anisotropic and non-Gaussian variability. J Neurosci Methods 69: 175-188, 1996a.[CrossRef][Web of Science][Medline]
Fee MS, Mitra PP, and Kleinfeld D. Variability of extracellular spike waveforms of cortical neurons. J Neurophysiol 76: 3823-3833, 1996b.
Fishman GS. Monte Carlo, Concepts, Algorithms and Applications. New York: Springer-Verlag, 1996.
Frenkel D and Smit B. Understanding Molecular Simulation. From Algorithms to Applications. San Diego, CA: Academic Press, 2002.
Gelman A, Carlin JB, Stern HS, and Rubin DB. Bayesian Data Analysis. Boca Raton, FL: Chapman & Hall/CRC, 1995.
Gelman A and Meng XL. Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Stat Sci 13: 163-185, 1998.[CrossRef][Web of Science]
Geman S and Geman D. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6: 721-741, 1984.[CrossRef]
Gray CM, Maldonado PE, Wilson M, and McNaughton B. Tetrodes markedly improve the reliability and yield of multiple single-unit isolation from multi-unit recordings in cat striate cortex. J Neurosci Methods 63: 43-54, 1995.[CrossRef][Web of Science][Medline]
Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82: 711-732, 1995.
Hansmann UHE. Parallel tempering algorithm for conformational studies of biological molecules. Chem Phys Lett 281: 140-150, 1997.[CrossRef][Web of Science]
Harris KD, Henze DA, Csicsvari J, Hirase H, and Buzsaki G. Accuracy of tetrode spike separation as determined by simultaneous intracellular and extracellular measurements. J Neurophysiol 84: 401-414, 2000.
Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57: 92-109, 1970.
Hukushima K and Nemoto K. Exchange Monte Carlo and application to spin glass simulations. J Phys Soc Jpn 65: 1604-1608, 1996.[CrossRef]
Iba Y. Extended ensemble Monte Carlo. Int J Mod Phys C 12: 623-656, 2001.[CrossRef]
Janke W. Statistical Analysis of Simulations: Data Correlations and Error Estimation [online]. 2002. http://www.fz-juelich.de/nic-series/volume10 [28 Nov. 2003].
Johnson DH. Point process models of single-neuron discharges. J Comput Neurosci 3: 275-299, 1996.[CrossRef][Web of Science][Medline]
Johnson DH, Tsuchitani C, Linebarger DA, and Johnson MJ. Application of a point process model to responses of cat lateral superior olive units to ipsilateral tones. Hearing Res 21: 135-159, 1986.[CrossRef][Web of Science][Medline]
Landau D and Binder K. A Guide to Monte Carlo Simulations in Statistical Physics. Cambridge, UK: Cambridge Univ. Press, 2000.
Lewicki MS. Bayesian modeling and classification of neural signals. Neural Comput 6: 1005-1030, 1994.[CrossRef][Web of Science]
Lewicki MS. A review of methods for spike-sorting: the detection and classification of neural action potentials. Network Comput Neural Syst 9: R53-R78, 1998.[CrossRef]
Liu JS. Monte Carlo Strategies in Scientific Computing. New York: Springer-Verlag, 2001.
Matsumoto M and Nishimura T. Mersenne twister: a 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Trans Model Comput Simul 8: 3-30, 1998.
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, and Teller E. Equations of state calculations by fast computing machines. J Chem Phys 21: 1087-1092, 1953.[CrossRef]
Mitsutake A, Sugita Y, and Okamoto Y. Generalized-ensemble algorithms for molecular simulations of biopolymers. Biopolymers 60: 96-123, 2001.[CrossRef][Web of Science][Medline]
National Institute of Standards and Technology (NIST). NIST/SEMATECH e-Handbook of Statistical Methods [online]. http://www.itl.nist.gov/div898/handbook/index.htm [28 Nov. 2003].
Neal RM. Probabilistic Inference Using Markov Chain Monte Carlo Methods [online]. Technical Report CRG-TR-93-1. Toronto, Canada: Department of Computer Science, Univ. of Toronto, 1993. http://www.cs.toronto.edu/~radford/papers-online.html [28 Nov. 2003].
Neal RM. Sampling from Multimodal Distributions Using Tempered Transitions [online]. Technical Report No. 9421. Toronto, Canada: Department of Statistics, Univ. of Toronto, 1994. http://www.cs.toronto.edu/~radford/papers-online.html [28 Nov. 2003].
Newman MEJ and Barkema GT. Monte Carlo Methods in Statistical Physics. Oxford, UK: Oxford Univ. Press, 1999.
Nguyen DP, Frank LM, and Brown EN. An application of reversible-jump Markov chain Monte Carlo to spike classification of multi-unit extracellular recordings. Network Comput Neural Syst 14: 61-82, 2003.[CrossRef]
Pouzat C, Mazor O, and Laurent G. Using noise signature to optimize spike-sorting and to assess neuronal classification quality. J Neurosci Methods 122: 43-57, 2002.[CrossRef][Web of Science][Medline]
Quirk MC and Wilson MA. Interaction between spike waveform classification and temporal sequence detection. J Neurosci Methods 94: 41-52, 1999.[CrossRef][Web of Science][Medline]
Robert CP and Casella G. Monte Carlo Statistical Methods. New York: Springer-Verlag, 1999.
Rosales R, Stark A, Fitzgerald WJ, and Hladky SB. Bayesian restoration of ion channel records using hidden Markov models. Biophys J 80: 1088-1103, 2001.[Web of Science][Medline]
Sahani M. Latent Variable Models for Neural Data Analysis (PhD thesis). Pasadena, CA: California Institute of Technology, 2002.
Sokal AD. Monte Carlo Methods in Statistical Mechanics: Foundations and New Algorithms. Cours de Troisième Cycle de la Physique en Suisse Romande [online]. http://citeseer.nj.nec.com/soka196monte.html [28 Nov. 2003].
Wu FY. The Potts model. Rev Mod Phys 54: 235-268, 1982.[CrossRef][Web of Science]
Yan Q and de Pablo JJ. Hyper-parallel tempering Monte Carlo: application to the Lennard-Jones fluid and the restricted primitive model. J Chem Phys 111: 9509-9516, 1999.[CrossRef]
This article has been cited by other articles:
![]() |
V. Ventura Traditional waveform based spike sorting yields biased rate code estimates PNAS, April 28, 2009; 106(17): 6921 - 6926. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Visit Other APS Journals Online |