## Abstract

The phase-response curve (PRC), relating the phase shift of an oscillator to external perturbation, is an important tool to study neurons and their population behavior. It can be experimentally estimated by measuring the phase changes caused by probe stimuli. These stimuli, usually short pulses or continuous noise, have a much wider frequency spectrum than that of neuronal dynamics. This makes the experimental data high dimensional while the number of data samples tends to be small. Current PRC estimation methods have not been optimized for efficiently discovering the relevant degrees of freedom from such data. We propose a systematic and efficient approach based on a recently developed signal processing theory called compressive sensing (CS). CS is a framework for recovering sparsely constructed signals from undersampled data and is suitable for extracting information about the PRC from finite but high-dimensional experimental measurements. We illustrate how the CS algorithm can be translated into an estimation scheme and demonstrate that our CS method can produce good estimates of the PRCs with simulated and experimental data, especially when the data size is so small that simple approaches such as naive averaging fail. The tradeoffs between degrees of freedom vs. goodness-of-fit were systematically analyzed, which help us to understand better what part of the data has the most predictive power. Our results illustrate that finite sizes of neuroscientific data in general compounded by large dimensionality can hamper studies of the neural code and suggest that CS is a good tool for overcoming this challenge.

- phase-response curve
- compressive sensing
- neural coding
- small data size
- cerebellar Golgi cell

the phase-response curve characterizes how an oscillation is influenced by external input delivered at a particular phase and is one of the most important tools for analyzing biological oscillators, such as neurons and neural ensembles (Achuthan et al. 2011; Netoff et al. 2005; Oprisan et al. 2004; Perkel et al. 1964; Pinsker 1977; Preyer and Butera 2005; Tateno and Robinson 2007; Winfree 2001). The key part of this characterization is the infinitesimal phase-response curve (IPRC or simply PRC), which is the linear kernel relating the infinitesimal and instantaneous stimulus with phase (Achuthan et al. 2011; Preyer and Butera 2005). At the level of a single neuron, the PRC characterizes the intrinsic properties of the neuron and determines how its spiking pattern will change with input. For example, whether the PRC is monophasic or biphasic indicates whether the neuron is an integrator or resonator based on its excitability class (Ermentrout 1996; Gutkin et al. 2005; Hansel et al. 1995; Izhikevich 2010; Moehlis et al. 2006; Phoka et al. 2010; Stiefel et al. 2008). Beyond the single cell level, the PRC has been useful for predicting population behavior such as synchrony. In a weakly coupled neural network, the PRCs of the individual neurons determine the interactions and many aspects of the network dynamics can thereby be inferred (Ermentrout et al. 2001; Hansel et al. 1995; Oprisan et al. 2004; Smeal et al. 2010; Teramae and Fukai 2008). The PRC has also been used to show how synchrony is generated within a population of uncoupled oscillators receiving common noise input (Abouzeid and Ermentrout 2009; Goldobin and Pikovsky 2005; Marella and Ermentrout 2008; Nakao et al. 2005; Teramae and Tanaka 2004).

Therefore, it is important to determine the PRCs of neurons, particularly using experimental data, to predict their population/network behavior. This can be done by measuring the phase shifts with respect to probe stimuli, and several estimation methods have been proposed (reviewed in Torben-Nielsen et al. 2010). Because the PRC is a function of phase and therefore also of time, its determination with sufficient temporal resolution requires the stimuli to contain high frequency components as explained by the standard time/frequency uncertainty principle (Pinsky 2009). Indeed, in all of the existing methods, the probe stimuli are given as short pulses or continuous noise, which have strong power at high frequencies. However, the time scales of the PRC are caused by neuronal dynamics that reside in a lower frequency range. Therefore, a priori, the degrees of freedom, or dimensionality, of the data (probe stimuli) are much larger than those actually constituting the PRC, and this becomes one of the main challenges in PRC estimation. Additional problems arise in practice when the firing is not perfectly regular and the period has to be estimated, rather than directly measured, together with the phase shifts (Phoka et al. 2010).

In principle, if one has infinitely large data, the extra irrelevant degrees of freedom can be simply averaged out. However, the required number of samples rapidly grows with the data dimensionality, and in practice this scheme is not feasible in most experimental settings. A possible solution is to describe the candidate PRC in terms of only arbitrarily selected low frequency component, which forces reduced data dimensionality by hand, and find the PRC by minimizing an error function (Achuthan et al. 2011; Galán et al. 2005; Izhikevich 2010; Preyer and Butera 2005; Torben-Nielsen et al. 2010). Also, as this problem is a typical case of the curse of dimensionality (Bishop 2007), a Bayesian estimation method assuming a smooth PRC distribution as prior has been proposed (Nakae et al. 2010).

In this study, we propose a more systematic and efficient approach based on recent progress in signal processing theory called compressive sensing (CS; Baraniuk 2007; Candés and Tao 2005, 2006; Donoho 2006) (see also Ganguly and Sompolinsky 2012). CS provides near-optimal recovery algorithms for signals from a small number of samplings when the signal is sparsely constructed, i.e., composed of much less degrees of freedom than the measured samples. Because our main goal is to efficiently discover a small number of frequency components relevant to the PRC out of the much wider stimulus spectrum, we demonstrate that our problem can be translated into the sparse signal recovery theory covered by the CS framework. We also show that our CS-based estimation method works successfully with simulated data from two different conductance-based models and experimental data recorded from cerebellar Golgi neurons (GoCs).

## MATERIALS AND METHODS

### Neuron Models and Simulations Procedure

The first neuron model is based on Morris-Lecar (ML) with the following kinetics (Morris and Lecar 1981)
*g*_{Leak} = 2 mS/cm^{2}, *g*_{Na} = 20 mS/cm^{2}, *g*_{K} = 20 mS/cm^{2}, *C* = 2 μF/cm^{2}, *E*_{K} = −100 mV, *E*_{Leak} = −70 mV, *E*_{Na} = 50 mV, ϕ = 0.15, β_{m} = −1.2 mV, γ_{m} = 18 mV, β_{w} = 0 mV, and γ_{w} = 10 mV. The cell surface area is *A* = 100 μm^{2}, and we set *I*_{0} = 369 pA so that the mean interspike interval (ISI) size is *T*_{0} = 50.2 ms.

The second one is the standard Hodgkin-Huxley (HH) model (Hodgkin and Huxley 1952)
*g*_{Leak} = 0.3 mS/cm^{2}, *g*_{Na} = 120 mS/cm^{2}, *g*_{K} = 36 mS/cm2, *C* = 1 μF/cm^{2}, *E*_{K} = −100 mV, *E*_{Leak} = −70 mV, and *E*_{Na} = 50 mV. The surface area is again *A* = 100 μm^{2}, and *I*_{0} = 73 pA, which made the mean ISI size *T*_{0} = 16.8 ms.

The external input was given by *I*_{ext}(*t*) = σξ where ξ is Gaussian white noise with zero mean and unit variance. The σ = 1.6 and 7 pA, and ξ was updated with 10 and 20 kHz for the ML and HH model, respectively.

All the simulations were done using the NEURON simulation platform (Hines and Carnevale 1997).

### Experimental Procedures

All experiments were performed according to animal procedures, which were approved by the Ethical Committee of the University of Antwerp in accordance with the Federal Laws of Belgium and the European Committee. Parasagittal cerebellar slices were prepared from Wistar rats (aged 15 to 22 days old) according to the procedures described in Robberechts et al. (2010). During recording, the slices were held between nylon nets in a submerged chamber on an upright microscope (Leica DM LFS) and continuously perfused at near physiological temperature (32–34°C) with artificial cerebrospinal fluid at 2–3 ml per minute using a Minipulse peristaltic pump (Gilson). Synaptic transmission was blocked by addition of 10 μM SR95531 [6-imino-3-(4-methoxyphenyl)-1(6H)-pyridazinebutanoic acid hydrobromide] (gabazine), a GABA_{A} receptor antagonist; 1 μM strychnine, a glycine receptor antagonist, 10 μM 2,3-dihydroxy-6-nitro-7-sulfamoyl-benzo(*f*)quinoxaline, a AMPA receptor antagonist; and 50 μM APV, an *N*-methyl-d-aspartate receptor antagonist, in the artificial cerebral spinal fluid.

Whole cell patch-clamp recordings were obtained using glass pipettes of 2–3 MΩ resistance. The recordings were performed with a potassium gluconate-based intracellular solution containing the following (in mM): 135 potassium gluconate, 10 KCl, 10 HEPES, 0.1 EGTA, 4 Mg-ATP, 0.4 Na_{3}GTP, and 14 Na_{2}-phosphocreatine, with pH 7.25–7.3 titrated with KOH and supplemented with 0.2% biocytin. The liquid junction potential was 10 mV (Neher 1992) but was not corrected for. The pipette capacitance was compensated for once the gigaseal was obtained, just before entry to the cell, in the cell-attached configuration. Seal resistance was ≥2 GΩ.

Then, whole cell voltage-clamp recordings were performed with a Heka (Lambrecht, Germany) EPC 10 amplifier. Initially the GoCs were voltage clamped at −70 mV to have the cell stabilize for 5–10 min before switching to current clamp. In current clamp, no current was injected and on regular time intervals passive cell parameters like membrane potential (*V*_{m}), input resistance (*R*_{input}), and recordings were discontinued if significant changes in these parameters occurred. Active cell parameters were also routinely tested like the FI-curve, amount of adaptation, and the reliability of the cell response to frozen noise injection. The continuous noise stimulus was created via a Ornstein-Uhlenbeck process
^{2}. The stimulus was injected somatically into the GoCs typically during periods of 40 up to 200 s depending on the spiking frequency. When injecting small currents the noise injection was continued longer to have ≥500 spikes for the PRC analysis. In between each noise stimulation protocol, the cell was left quiet for ≥2 min. In general, the amplitude of current injection was varied to have 5–10 Hz firing responses and on top of this the variance σ^{2} was varied. The time constant τ = 4 ms in the range of the time constants of synaptic currents. Signals were captured at 20 kHz and low-pass filtered at 2.5 kHz.

### PRC Estimation Procedures

#### Estimation procedure.

The first step of the estimation procedure was to collect our relevant variables, all the ISI sizes, {*T*_{i}}, and stimulus segments within the corresponding ISIs, {**x**_{i}} (Fig. 1, *A* and *B*). Then, these variables were transformed to the instantaneous firing rate change (IFRC) {*r*_{i}}, and *K* coefficients in the Fourier series expansion of the stimuli, arranged in a measurement matrix **Φ** as (Fig. 1*B*)
*L* is the length of x_{i}′s.

First of all, we computed the IFRC for each *i* as
*T*_{0} as the mean ISI width, *T*_{0} should be the ISI size without any perturbation. However, in the continuous stimulus paradigm that we used many slow dynamical components can be affected and significantly change membrane dynamics across multiple ISIs. Therefore, using *T*_{0} from the unperturbed data led to consistently and significantly worse performance except for the simulations. In one test case, we also set *T*_{0} as a free parameter and tried to find the optimal *T*_{0}, but the result was not significantly different from using the mean ISI size.

To compute **Φ**, we resampled **x**_{i} for each *i* by using the discrete Fourier transformation (DFT) so that all of them have the same length *L* = 401: if the length of **x**_{i} is *L*_{i} > *L*, only the first *L* DFT coefficients were taken, and, in the other case, *L* − *L*_{i} DFT coefficients were set to zero. Then, each resampled **x**_{i} was rescaled by *L* Fourier series coefficients do not change after resampling despite the change in length. Then, these resampled x_{i} were arranged into a matrix *X*_{ij} = (**x**_{i})_{j}, and finally we obtained the measurement matrix by **Φ** = **Xu**^{T}. In principle, we need *M* = *L* for **u** to form a basis that can span all the possible stimuli, but this is not really necessary since we are only interested in the low frequency part, and therefore we used *K* = 201 (401) for the simulated (experimental) data.

Finally, we estimated the Fourier series coefficients of the PRC, **Δ̃**, from **r** and **Φ** by using the optimization packages that provide the routines for all the CS schemes that we used. For the basis pursuit (BP) algorithm (*Eq. 14*) that we adapted to the noiseless ML model, we used *ℓ*_{1}-MAGIC (http://users.ece.gatech.edu/∼justin/l1magic/). For the ML model with background noise, HH model, and Golgi cell, we used the Dantzig selector algorithm (*Eq. 15*) via the *ℓ*_{1}-Homotpy package (Salman and Romberg 2010). The PRC in the time-domain representation could be obtained by **Δ = u**^{T}**Δ̃**. Our estimation code is available for public download at http://groups.oist.jp/cnu/software.

#### Cross-validation.

If our CS scheme is the Dantzig selector (*Eq. 15*), the error bound η should be specified. the η controls sparseness of the estimated PRC and consequently also determines how much the estimation overfits to the data noise. Therefore, η should be small enough to fit the given data but also large enough to avoid poor fitness when the estimation is tested with a new data set. To find the optimal η, we employed the *k*-fold cross-validation method (Hastie et al. 2009): the whole data set of *N* ISIs were randomly segmented into *k* blocks each with [*N*/*k*] ISIs. We obtained the estimated PRC **Δ̃**(*i*,η) with certain η and a training data set formed by leaving only the *i*-th block out from the whole set. Then, we obtained the cross-validation error χ_{i}(η), which is the sum of squared errors (SSE; *Eq. 9*) of the actual IFRC **r** and predictions by **Δ̃**(*i*,η) for the *i*-th block data. After we did this with sufficiently large number of the blocks *k′*, we could obtain the average cross-validation error and the optimal η = η* as
*k* = *k′* = 100, which yielded a consistent η* value in each case. In finding the minimum of χ(η), we employed a simple brute force approach that we evaluate χ(η) within an interval [η_{min}, η_{max}] with 40 sample points and selected η with the smallest value of χ(η). The η_{min,max} were determined by a few test runs for each data set, and the typical values were η_{min} ≈ 5 and η_{max} ≈ 40.

#### Inclusion of uncontrolled noise and CS-total least-squares estimation.

To investigate the effect of uncontrolled noisy fluctuation of the ISI, we simulated the ML model in the same way described above except that we injected another Gaussian white noise current with zero mean. The standard deviation was varied from 0 to 1.6 pA by a step of 0.32 pA. When the noise level is very high, the CS estimation showed systematic deviation from the no-noise case. See *Effects of the uncontrolled noise* for our discussion on this point.

To resolve this problem, we augmented our CS-based estimation procedure by a pruning step followed by the total least-squares (TLS) estimation: first, we computed **Δ̃** by using the Dantzig selector with a particular η and found a set of significant modes *S*(η) = {*j*|Δ̃* _{j}* is signifcant} where their coefficients have significantly larger amplitudes than the rest by maximizing min

_{j∈S(η)}|Δ̃

*| − max*

_{j}_{j∉S(η)}|Δ̃

*|. Then, we constructed the “reduced stimuli”*

_{j}**Φ**′ from Φ

_{ij}only for

*j*∈

*S*(η) and carried out the TLS estimation by using singular value decomposition (Markovsky and Huffel 2007). The cross-validation was performed in the same way except that χ

_{i}(η) is given by the Rayleigh quotient,

Phoka et al. (2010) recently discussed an interesting phenomenon that the noisy in the spike time leads to a systematic bias in the estimated PRC particularly in the later phase. We could not observe similar bias in our simulated or experimental data and did not try to prevent it. However, in principle, we can address this problem in the continuous stimulus paradigm in a similar way as Phoka et al. (2010) by using corrections from multiple ISIs: we first make the stimulus-IFRC pair from every two consecutive ISI, such as {*T*_{i} + *T*_{i + 1}} and {[**x**_{i}, **x**_{i + 1}]}, and estimate the PRC for these two ISIs as [**Δ**;**Δ**]. In this way, we can eliminate the effect of the spike time uncertainty in the spike between the *i*-th and (*i* + 1)-th ISI. We can continue this step with three and more consecutive ISIs, and the averages of all estimated PRCs is our final estimation.

#### Computation of the theoretical PRC.

The theoretical PRC can be computed by solving the adjoints of the original dynamical equations (Izhikevich 2010). We computed the theoretical PRCs, in Figs. 2*B* and 3*B*, using the XPPAUT software (Ermentrout 2002), which implements a numerical algorithm to calculate the adjoint.

### Statistical Analysis

We computed the predictive power of each estimation method by evaluating the coefficient of determination,
*r*_{i} and *s*_{i} are the measured and predicted IFRC for each ISI. If *s*_{i} is obtained by proper regression ensuring *s̄* ∼ *r̄*, *R*^{2} is simply a squared correlation coefficient between {*r*_{i}} and {*s*_{i}}. However, if the prediction gets as bad as *s̄* ≠ *r̄*, *R*^{2} can be negative despite its notation.

If the estimated PRC captures most of the linear relationship between stimuli and measured IFRC **r**_{meas}, the predictions from the PRC **r**_{PRC} should not be significantly correlated with the residuals δ**r** = **r**_{meas} − **r**_{PRC}. We evaluated the significance of the correlation ρ between **r**_{PRC} and δ**r** by a bootstrap test: we first constructed resampled samples of {**r**_{PRC}^{(resampled)}, δ**r**^{(resampled)}}* _{i}* (

*i*= 1, …., 1,000) and computed the correlations ρ

_{i}. Then, a bootstrap

*t*-score was evaluated by

*P*value.

All the preprocessing of the data, PRC estimation, cross-validation, and statistical analysis were done in MATLAB (Mathworks, MA).

## RESULTS

### Estimation of the PRC as a Linear Regression Problem

We consider a neuron receiving a time-dependent stimulus *x*(*t*), which makes the ISI *T* deviate from the original ISI *T*_{0}. The oscillation period begins at *t* = 0 and finishes at *t* = *T* (Fig. 1*A*), and the phase of this oscillation θ(*t*) should increase monotonically from θ(*t*) = 0 to θ(*t*) = 1. Then, in the limit that *x*(*t*) is weak, θ(*t*) and *x*(*t*) are related by the infinitesimal PRC Δ(θ) as
*T*_{i}} and {*x*_{i}(*t*)} that we collected from each *i*-th ISI.

A key step to achieve this is that *Eq. 5* can be approximately recasted into a linear regression problem since we can assume that *x*(*t*) and the induced change in θ(*t*) are sufficiently small so that we can take a linear approximation of *Eq. 5* in which the higher order fluctuations in Δ(θ) due to the first order perturbations in θ are ignored. In practice, we use discretized representations such as *x*_{i}(*t*) → **x**_{i} = [*x*_{i}(0), *x*(δτ), *x*_{i}(2δτ), … ] and similarly Δ(θ) → **Δ**. Then, when the IFRC for each ISI is *r*_{i} = (*T*_{0} − *T*_{i})/*T*_{i}, we can show that linear approximation leads to (see appendix for derivation)

Note that the IFRC is approximately proportional to the ISI fluctuation as *r*_{i} = (*T*_{0} − *T*_{i})/*T*_{i} ≈ (*T*_{0} − *T*_{i})/*T*_{0} + *O*[(*T*_{0} − *T*_{i})^{2}] (Achuthan et al. 2011).

The second key observation is that we can take alternative representations for **Δ** and **x _{i}** in

*Eq. 6*by a simple basis change. In this study, we will focus on representing the PRC in terms of the finite Fourier series,

*Eq. 6*, the equation for Δ̃ is simply

**u**is a matrix composed of Fourier modes given by

*Eq. 2*. Here

**Δ**and Δ̃ essentially refer to the same object, but, to prevent confusion, we will call Δ̃ a PRC coefficient vector or PRC coefficients while PRC will refer only to its time-domain representation

**Δ**.

#### Relation to some existing methods.

A direct approach for solving a problem like *Eq. 8* is to minimize the sum of squared errors (SSE),
**Φ**^{T}**r** is the sum of the stimuli each weighted by *r*_{i} and **Φ**^{T}**Φ** corresponds to the stimulus-stimulus correlation across the ISIs. This means that the best fitting estimate of the PRC coefficients are always expressed in the form of the weighted average of the stimuli normalized by the stimulus correlation factor.

Therefore, if there is no prior inter-ISI stimulus correlation and the number of the ISIs *N* is so large that chance correlations (due to random fluctuations) are suppressed, **Φ**^{T}**Φ** provides trivial normalization and the PRC estimation can be expressed in a simple form. For example, let's consider PRC estimation in the time domain (i.e., **Φ** = **X** and Δ̃ = **Δ**) when the stimuli **x**_{i} are Gaussian white noise with zero mean and variance σ^{2}. When the number of ISIs *N* is very large, we find **Φ**^{T}**Φ** ≈ *N*σ^{2}**1**. In this limit, the LS estimate of **Δ** via *Eq. 10* is
*Eq. 10*.

In another example, Galán et al. (2005) proposed a method based on pulse stimuli and the Fourier series representation as in *Eq. 7*. Here the pulse stimuli are delivered randomly, say at τ_{i} for each ISI, and the ISI size *T*_{i} is predicted from a candidate PRC, which is presumably a linear combination of only a few Fourier basis vectors with low frequencies. The PRC coefficients are then estimated by minimizing the sum of the squared errors between the predicted and measured *T*_{i}. This method is effectively equivalent to minimizing *E*_{ℓ2} when **Φ** = **Xu**^{T} is constructed with the pulse stimuli *x*_{i}(τ) = σδ(τ − τ_{i}), and the selected Fourier mode vectors {**u**_{k}} (*k* = 1, …, *K*) (from *Eq. 2*). Then, when *N* is large and τ_{i} is uniformly distributed within each ISI, we find **Φ**^{T}**Φ** ≈ *N*ρ^{2}σ^{2}**1**, where ρ is the pulse density within a time bin [τ, τ + δτ], and therefore
*j*_{i} is the index corresponding to τ_{i}, *j*_{i} = [τ_{i}/δτ]. Again, this is nothing but the weighted average of Fourier mode evaluated at the pulse times.

Izhikevich (2010) proposed using a similar scheme as the Galán et al. method for continuously fluctuating noise stimulus. This method is equivalent to minimizing *E*_{ℓ2}, where **Φ** is constructed with noisy stimuli and *K* randomly selected Fourier basis vectors. A least square optimization in this case seems to converge with much more difficulty than the Galán et al. method (Torben-Nielsen et al. 2010), but nevertheless a direct solution can be obtained via *Eq. 10*. When the stimuli are Gaussian white noise with zero mean and variance σ^{2} and *N* is large, we again find **Φ**^{T}**Φ** ≈ *N*σ^{2}**1**. This leads us to

Comparing this to *Eq. 11*, we can see that the result of the Izhikevich method is nothing but a WSTA projected onto a space spanned by the Fourier mode vectors {**u**_{k}}. Therefore, the Izhikevich method is simply equivalent to the WSTA estimate in this limit.

In the comparison study by Torben-Nielsen et al. (2010), these methods were classified according to whether the PRC is directly evaluated, such as WSTA, or inferred via minimizing the SSE, such as the Galán et al. method. However, we showed that this distinction is not necessary since SSE minimization can be solved directly by the “weighted average” solution, *Eq. 10*. Therefore, the direct estimation and optimization methods naturally share most of their pros and cons.

Instead, a more important difference between the direct estimation and optimization methods is whether the PRC is densely or sparsely constructed: for example, in the WSTA method, Δ_{k} for every time step *k* = 1, 2, …, *L* is calculated from the data by averaging. Conversely, the Galán et al. method assumes that the PRC can be represented by a Fourier series with *K* ⪡ *L*. Therefore, this PRC contains only a small number of nonzero PRC coefficients and is sparsely constructed. As we discussed in the Introduction, this is a reasonable assumption considering that the time scale/frequency information of Δ̃ is concentrated in a much lower range compared with the wide spectrum of **Φ**.

Searching specifically for a sparsely constructed PRC is important because it resolves many issues in estimation due to the small number of samples (ISIs) *N*. The first problem is feasibility; if data size *N* is small, the data cannot give us the full information necessary to estimate the PRC correctly. For example, if *N* < *L*, the number of unknowns exceeds the number of equations and the estimation is simply infeasible. However, even if *N* ≥ *L*, the large number of parameters can introduce artifacts in the estimated PRC. In the LS solution, *Eq. 10*, has an additional normalization by the stimulus autocorrelation **Φ**^{T}**Φ** and, as we have seen above, this is usually assumed to be trivial particularly in the limit when *N* is very large and the estimated PRC becomes a weighted average of stimuli. However, if an autocorrelation is given, the normalization would be already nontrivial even with the large *N*, and furthermore, in the case when *N* is small, **Φ**^{T}**Φ** contains many (growing as ∼*L*^{2}) nonzero stimulus correlations due to random fluctuations (chance correlations). In both cases, the simple normalization scheme fails and this causes artifacts in the PRC, such as a wrong amplitude.

This problem has been partially addressed by using the total stimulus autocorrelation, not stimulus variance, for normalizing the WSTA estimate (Ota et al. 2009). However, this ignores the off-diagonal correlations and can distort the estimated PRC as we will see in the examples. A more careful approach is to explicitly deconvolve the full stimulus correlation out from the candidate PRC (Achuthan et al. 2011; Preyer and Butera 2005) as our LS solution does by the factor of (**Φ**^{T}**Φ**)^{−1}. However, particularly when the data are small, there can be significant off-diagonal chance correlations, and trying to deconvolve these out can mostly contribute to fitting the PRC to the data noise rather than to the data signal, which is a general problem in any regression with a large number of parameters and small data size (Bishop 2007; Hastie et al. 2009).

Sparse estimation relies on a much smaller number of variables (*K* ⪡ *L*) and may not suffer from the same problems. However, in this case, we have a new challenge because we must determine how many and which basis vectors are to be used. In the Galán et al. and Izhikevich methods, this decision is somewhat arbitrarily based on random truncation or selection of the modes. In this study, we describe a much more systematic approach to solve this problem.

### CS Method

Once we assume the sparseness of the PRC, we note that its estimation can be translated into a problem in signal processing, called sparse signal recovery. Sparse signal recovery aims to reconstruct an original signal from the data made by a smaller number of measurements, possibly contaminated with noise, by taking advantage of the prior knowledge that the signal itself can be constructed by a smaller number of basis vectors than the apparent dimension of the signal. This problems has exactly the same form as our PRC estimation equation *Eq. 8* where Δ̃ is the signal to be recovered, **Φ** is the measurement matrix, and the result of measurement is **r**. Therefore, algorithms that can solve this problem can be also used for PRC estimation.

One strategy for the sparse signal recovery is to test how well each basis vector can predict **r** from **Φ** one by one and exclude those whose contributions are insignificant (below a threshold) from our set of basis vectors. However, this matching pursuit (MP) strategy (Mallat and Zhang 1993) is known to be computationally demanding and inefficient (Natarajan 1995). A much more efficient alternative is BP (Chen et al. 1998)
**x**‖_{1} = ∑_{i} |*x _{i}*| One interpretation is that the BP provides “regularization,” such that the data can be explained with minimal number of parameters as minimizing ‖

**Δ̃**‖

_{1}will make many

**Δ̃**

*zero (Bishop 2007; Hastie et al. 2009). Recently, it has been proven that, if some conditions such as sparseness of the signal are met, the BP algorithm can achieve exact signal recovery even when the number of samples is as small as*

_{i}*N*∼

*K*log

*L*, and this strategy has been named CS (Candés and Tao 2005, 2006; Donoho 2006).

CS is currently an actively developing field and many interesting variants of the original breakthrough have been proposed (Baraniuk 2007). In this study, we focus on two basic CS algorithms: in addition to the exact BP strategy (*Eq. 14*), we also used the Dantzig selector (Candés and Tao 2007),
* x*‖

_{∞}= max ∣

*x*

_{i}∣. In the case of PRC estimation, this algorithm tries to minimize the distance between the WSTA and the estimated PRC convoluted with the stimulus correlation but also minimizes the number of nonzero PRC coefficients at the same time. Compared with the exact BP algorithm, the Dantzig selector performs better when the data are contaminated with Gaussian noise. In such cases, the exact BP algorithm, a priori assuming that the recovered signal predicts the measurement results exactly, may fail to converge. Furthermore, the Dantzig selector also applies to cases where the signal is approximately sparse such that many coefficients actually remain nonzero, but their magnitudes decay at least faster than a power law, |Δ̃

*| ∼ 1/*

_{k}*k*for a positive integer

^{n}*n*. This is important since a PRC can have this property: for example, it has been reported that cerebellar Purkinje neuron can have a square wave-like PRC, indicating neuronal near-perfect integration (Phoka et al. 2010), and in this case many Fourier coefficients will be nonzero but decay as |Δ̃

*| ∼ 1/*

_{k}*k*.

In our case, the exact BP successfully converged in only a few examples of the simulated neurons, and we used the Dantzig selector for the rest. We also tried another algorithm for noisy data called BP denoising (Candés et al. 2006b) for those examples, but we could not find any significant improvements over the Dantzig selector (data not shown).

Not any measurement **Φ** can be used for CS recovery, and the precise conditions for good signal recovery seem rather complex (Donoho and Tanner 2005a,b). However, surprisingly, it has been proven that **Φ** satisfies the sufficiency condition if it is given randomly such as by Gaussian noise (Baraniuk et al. 2008; Candés and Tao 2005, 2006). This can be achieved by using continuous Gaussian noise stimuli, and therefore we restrict our study to this case.

### Applications to Neuronal PRCs

#### Simulated data: ML and HH neurons.

We first show two examples of estimating the PRCs from the simulated data of two different model neurons, each representing a distinct type of neural excitability (Hodgkin 1948; Izhikevich 2010). The first one, the ML model (Morris and Lecar 1981) has type I excitability, and the other model, the HH model (Hodgkin and Huxley 1952) is type II. Type I and II neurons undergo saddle-node or Hopf bifurcations, respectively, and this difference is reflected in the shape of their PRCs. A type I neuron has a positive and monophasic PRC such that a depolarizing stimulus will always increase the instantaneous firing rate, while the PRC of the type II neuron is biphasic, characterized by a significant negative part (Ermentrout 1996). We simulated these models with continuously fluctuating current stimuli, collected the spiking data, and estimated the PRC. For basis vectors, we used the Fourier mode vectors from *Eq. 2*.

Figure 2 shows the results from the simulated data of the ML model with 300 ISIs. In Fig. 2*A*, we can see that the WSTA estimate has a large signal-to-noise ratio, which is even worse than the LS estimation via *Eq. 10*. As we already discussed, these two estimations are essentially the same except that, in the WSTA, the normalization in *Eq. 10* is replaced by the large *N* approximation **Φ**^{T}**Φ** ≈ *N*σ^{2}**1**, which is causing a problem. On the other hand, the CS estimate via the BP algorithm (*Eq. 14*) shows very good signal-to-noise ratio. For comparison, these three estimations indeed get close to each other when the data becomes much larger (Fig. 4*A*).

If the stimuli are assumed to be infinitesimally small, the PRC can be exactly computed from the dynamical equations of the system (Ermentrout 2002; Izhikevich 2010). This theoretical PRC matches well not only with the full CS estimate but also the PRC with only the three largest PRC coefficients (Fig. 2*B*). Indeed, both in the LS and CS estimate, most power is in those three modes (Fig. 2, *C* and *D*), which seem to be sufficient to predict the actual IFRC with good accuracy (Fig. 2, *E* and *F*). This confirms that our CS-based estimation method, based on the sparseness assumption of the PRC, is an effective strategy.

In the HH case where the data consisted of only 200 ISIs, we obtained more interesting results. Here the signal-to-noise ratio of the LS estimate is only slightly better than the WSTA estimate while the CS method (via the Dantzig selector) can estimate a well-denoised PRC (Fig. 3*A*). Again, not only the full estimate but also the PRC only with nine largest coefficients, matches well with the theoretical PRC (Fig. 3*B*). Again, small differences among the estimates and theoretical PRC vanish at very large data sizes (Fig. 4*B*).

As we expect from these, the LS estimate is not sparse: while most of the power of the CS estimate is concentrated in the first ∼8–9 modes, the LS estimate has many other coefficients as large as those (Fig. 3, *C* and *D*). Figure 3, *E* and *F*, also shows that the CS estimates, even with a few PRC coefficients, can be a good predictor for the actual IFRCs in the data. Also note that *R*^{2} still keeps improving significantly even beyond *K* = 4 (Fig. 2*E*), while *K* = 3–5 is often used when *K* is chosen by hand (for example, Torben-Nielsen et al. 2010).

In all the cases, the LS estimate should be the best predictor (*R*^{2} = 1) for the particular data that the estimation is based on. However, this is simply due to fitting to the data noise as we discussed: with a different data set with 2,000 ISIs as in Figs. 2*F* and 3*F*, the LS estimates actually perform worse (*R*^{2} = −1.40 in the HH case, in the ML case *R*^{2} = 0.97; see materials and methods for when *R*^{2} becomes negative and how we interpret it). On the other hand, the CS estimate can avoid this overfitting due to the imposed sparseness constraint.

#### Effects of the uncontrolled noise.

Real neurons are almost always under the influence of many noise sources such as channel stochasticity, background synaptic activity, etc., and therefore it is important to characterize how our PRC estimation performs at the different noise levels.

For this purpose, we simulated the ML model with additional uncontrolled noise in the input and estimated the PRC. Figure 5*A* shows how the estimated PRC deviates from the theoretical PRC depending on the noise level and number of ISIs. The CS method already performs much better compared with the WSTA estimation method (dotted line), but we often found that the estimated PRC has a significantly and systematically reduced amplitude compared with the theoretical one (Fig. 5*B*), particularly when the noise level is high. This seems to be due to the fact that the algorithm minimizes ‖**Δ̃**‖_{1} and therefore, in the high noise regime, it also tries to minimize nonzero and significantly contributing PRC coefficients.

To solve this problem, we added a “pruning” step after the CS estimation where we identified which Fourier modes make significant contributions and reestimated the PRC by using only those modes without minimizing ‖**Δ̃**‖_{1}. For reestimation, we can simply use the LS estimation, but instead we used the TLS method (Markovsky and Huffel 2007), which minimizes not the SSE but the Rayleigh quotient (RQ),
**n**_{input} represents the uncontrolled input noise, and this is a typical problem where the TLS method applies. Figure 5*C* shows that our CS-TLS hybrid method gives us much better estimations, without the amplitude problems and the deviations diminish nicely with larger number of ISIs even at higher noise levels. In general, we observed that this method also can correct well the systematic deviation in the CS estimation for cases where the noise is included in different ways, such as simply adding noise to the ISIs (data not shown).

Note that, for example in Fig. 5*B*, it would be very difficult to figure out which estimation is better based on the measures such as SSE, without knowing the theoretical PRC or actual noise level: *R*^{2} ≈ 0.5 in both cases, and actually it would not make sense to look for an estimation with higher *R*^{2} since the amplitudes of the input noise probe stimuli are the same in this case. Both in the CS and CS-TLS, what prevented those overfitting estimations was cross-validation, which emphasizes the importance of this step. Also interestingly, we empirically observed that the cross-validation in the CS case can behave relatively worse at the higher noise level and pick out a nonsparse PRC estimation. On the other hand, the CS-TLS case consistently preferred the sparse estimations. However, the CS-TLS cross-validation was sometimes more unstable and we had to repeat the step several times with differently shuffled data. One possible reason is that, during the CS-TLS estimation, if a certain Δ̃* _{i}* becomes small after the CS step but still passes our significance criterion, Δ̃

*can grow back after the pruning step and contribute to generating an overfitting estimation. Therefore, future improvements can be made by putting a more careful criterion about whether a particular Δ̃*

_{i}*is significant or not.*

_{i}#### Experimental data: cerebellar Golgi neuron.

We also tested our CS method with experimental data obtained from patch-clamp recordings of GoCs stimulated with a Gaussian noise current filtered with time constant τ = 4 ms.

Figure 6*A* shows the estimated PRCs by our CS method (via the Dantzig selector) from the data sets recorded from one cell with different standard deviations σ of the stimulus. The estimated PRC can vary with σ due to many factors influencing neuronal dynamics and the limitation that the PRC is a linear approximation (Ermentrout et al. 2007; Torben-Nielsen et al. 2010). However, such differences are relatively small in Fig. 6*A*: the three estimated PRCs largely overlap with each other, showing consistent triangular shapes. Interestingly, PRCs of GoCs with a similar shape have been observed in a different context, where stimuli were delivered via gap junctions and firing of another coupled GoC (Vervaeke et al. 2010).

On the other hand, each WSTA estimate shows a very different shape from the others (Fig. 6*B*). Here the amplitudes of the PRCs are normalized by the total stimulus autocorrelation (Ota et al. 2009) because normalization by the stimulus variance, as *Eq. 11*, makes the PRCs larger by several orders of magnitude (up to 0.04 pA^{−1}·ms^{−1}). Using the total autocorrelation for normalization is equivalent to an approximation **Φ**^{T}**Φ** ≈ *N*σ̂^{2}**1** where σ̂^{2} is the average of the total autocorrelations in **Φ**^{T}**Φ** matrix, i.e., σ̂^{2} = *L*^{−1}∑* _{i,j}* (

**Φ**

^{T}

**Φ**)

_{ij}. Therefore, this scheme fixes the issue with the amplitude of the PRC but still ignores the effects of the off-diagonal terms in

**Φ**

^{T}

**Φ**, which leads to the artifacts in the shape.

On the other hand, the LS estimates suffer from a completely different problem: here the measured IFRCs can be perfectly matched with the predictions (*R*^{2} = 1 in all cases), but the estimated PRCs are completely noisy and do not seem to reflect any intrinsic dynamics of the neuron at all (Fig. 6*C*). Again, this indicates that the LS estimate is fitting to the data noise.

The goodness-of-fits for the CS estimates are in a good range of *R*^{2} = 0.58–0.91 (Fig. 6*D*) and the residuals (= [measured IFRC] − [predicted IFRC]) are stable for different stimulus σ (Fig. 6*D*, *inset*). In comparison, the goodness-of-fits for the WSTA estimates seem to be only slightly worse (*R*^{2} =0.58–0.78). However, a closer look reveals that the residuals from the WSTA estimates are significantly correlated with the predicted IFRC (*P* < 0.001, bootstrap test) except for σ =12.5 pA, while those from the CS estimates are not. This indicates that the WSTA estimate can miss a significant part of the linear relationship between the IFRCs and stimuli.

The residuals for the CS estimates, which are almost invariant to the stimulus variance (Fig. 6*D*, *inset*), suggest that the uncaptured variance of the actual IFRC (= 1 − *R*^{2}) is mostly affected by intrinsic irregularity. Because the GoC fires somewhat irregularly due to noise in its excitability (Forti et al. 2006), the PRC cannot predict this part. As the stimulus amplitude increases, we force the cell more and more to fire coupled to the stimulus and the prediction can become more accurate (*R*^{2} increases) until eventually the assumption of a linear perturbation stops holding.

In summary, these results demonstrate that our CS-based estimation strategy can indeed extract the biophysically relevant information from experimental data while avoiding dangerous overfitting.

## DISCUSSION

In this study, we have introduced a novel and efficient method to estimate the PRC based on CS. CS is a recent development in signal processing theory that proposes strategies to recover a signal from highly undersampled data. We showed the similarity between the signal recovery problem and the estimation of the PRC from a limited amount of experimental data and illustrated how we can make use of the CS framework for PRC estimation. We finally demonstrated that the CS method can successfully extract the PRC from small sized data samples using data from simulated neurons and experimental data recorded in vitro.

One of the reasons that our CS method worked well is that it deals with the problems arising for a regression analysis with finite data in a correct way. To illustrate this, we discussed in detail how those issues emerge in PRC estimation, which is formulated as a linear regression problem. In particular, we focused on two problems: applying an approximation that is only good for very large data to a small data set and overfitting noise in the data. As we observed in the examples, those factors caused artifacts by entirely ignoring (i.e., only true at infinitely large data) or fully considering (i.e., overfitting to all the data noise) stimulus correlations, particularly chance correlations of data noise. The CS algorithms, designed to recover a signal from undersampled data, avoid these pitfalls and generate PRC estimations that are little affected by chance correlations. In fact, these problems are well known in statistical learning theory (Bishop 2007; Hastie et al. 2009) but have rarely been considered in PRC estimation except for a recent Bayesian estimation approach (Nakae et al. 2010).

The CS method is also efficient: it generated the estimation in each of our simulated and experimental data sets, including cross-validation, within at most a few minutes in the Matlab environment (Mathworks, MA) running on a desktop computer with a single 2.9 GHz Core 2 Duo processor (Apple, CA). Previously, the Izhikevich method (Izhikevich 2010) with continuous stimuli data has been observed to converge very slowly (Torben-Nielsen et al. 2010) because the relevant Fourier modes are determined by random testing whether each mode improves the fitness. This type of algorithm is a variant of matching pursuit (Mallat and Zhang 1993), which is computationally expensive (Natarajan 1995). Therefore, it is not surprising that the CS method, based on a much more efficient BP strategy, performs faster. Similarly, the Bayesian estimation algorithm also takes several hours to run on a cluster computer (Nakae et al. 2010). The most time-consuming part of our computation was the cross-validation, which used a grid-based simple brute force approach, repeated 4,000 times, to find the optimal sparseness parameter. Using better optimization algorithms can definitely improve efficiency at this stage by reducing the number of repeated evaluations.

On the other hand, the CS method puts constraints on the probe stimulus. The nature of a boundary sharply separating good and bad signal recovery is complicated (Donoho and Tanner 2005a,b), but there exist proven sufficiency criteria (Baraniuk et al. 2008; Candés and Tao 2005; 2006), which are satisfied by the Gaussian random measurements. This corresponds to the continuous Gaussian noise current stimulus, which has been suggested to mimic integrated synaptic current in vivo (Mainen and Sejnowski 1995), and this is the scheme that we used in this study. Another popular choice for the probe stimuli is a set of pulses (for example, see Ermentrout et al. 2001; Galán et al. 2005; Gutkin et al. 2005; Phoka et al. 2010; Stiefel et al. 2008). Since the pulses are sharply localized in time while we estimate the PRC in the Fourier domain, this corresponds to one of the earliest examples of CS where the so-called robust uncertainty principle guarantees good signal recovery (Candés et al. 2006a). Therefore, two popular choices for the probe stimulus are in principle well-suited for applying the CS method.

However, in practice there are caveats. First of all, the stimuli often have temporal correlations to mimic those of synaptic inputs. In our case, the Gaussian noise stimuli for the GoC experiments had a relaxation time of τ = 4 ms. A similar scheme has been used for the pulse inputs, too (Achuthan et al. 2011; Preyer and Butera 2005). In general, such correlations reduce the effective amount of independent information from the whole data set. As we have discussed, the CS method is able to deal with this to some degree since it looks for a sparse solution and therefore circumvents the problem of undersampling, but still coherence can be one of the main factors that limit good estimation (Candés and Romberg 2006). If significant effects of correlation are present, methods specialized for such cases should be used (Zhang and Rao 2011).

Secondly, we used the linear approximation of the phase-response equation, *Eq. 5*, which already linearly approximates the full phase response of the neuron. Therefore, in practice, all the nonlinearities that we have ignored can affect the estimations in many different ways, particularly depending on what kind of stimulus protocol is used. In the pulse stimulus protocol, the unperturbed period can be relatively well defined while there still can be effects of the intrinsic background noise (Phoka et al. 2010), and the effects of the stimulus on top of that can be better isolated. However, in the continuous stimulus protocol, the neuron is continuously driven without any time to relax down to the unperturbed state. Therefore, the nonlinearity can significantly affect the baseline of subthreshold dynamics (either in *Eq. 5* or the full dynamics) and we should be careful about interpreting the estimated PRC.

Our method relies on the central assumption that the PRC is (at least approximately) sparse and can be constructed from a much smaller number of (Fourier) basis modes than the stimulus dimensionality given by the number of sampling points in an average ISI. This also implies that neurons are sensitive specifically to a low dimensional subspace (which we might call the feature space) of the space of all possible stimuli with much larger dimensionality. Recent investigations of how such input specificity, or feature selectivity, can arise from single neuron dynamics suggest that the sensitivity mechanism can be understood via linearization of the neuronal dynamics in the subthreshold regime (Famulare and Fairhall 2010; Hong et al. 2007; Ostojic and Brunel 2011; Paninski 2006). Therefore, the dimensionality of the feature space, which impacts not only computation at the single neuron level (Hong et al. 2008) but also at the population level (Hong et al. 2012), is naturally limited by the number of the active mechanisms in the neuron, each of which contributes a characteristic time scale/frequency (Hong et al. 2007). The PRC characterizes how inputs modulate spontaneous firing rather than how the stimuli evoke spikes, as in the studies cited, and therefore it has rarely been interpreted as an input/output kernel [with notable exceptions (Ermentrout et al. 2007; Moehlis et al. 2006; Phoka et al. 2010)]. However, the neuron spends most of the interspike period in the subthreshold regime where the dynamics evolve slowly, and linearization with limited dimensionality should be able to provide a good approximation of the full dynamics. Therefore, we suggest that our assumption of the sparseness of the PRC is biophysically plausible. Our results indeed confirm the effectiveness of the sparseness assumptions particularly for PRC estimation.

In many neural systems, high input feature specificity and sparseness have been suggested as an underlying principle that enables efficient coding (Olshausen and Field 2004). Based on our results, we further suggest that the CS algorithms can be great tools for discovering and characterizing the input/output function of a single neuron and also of neural ensembles, particularly when the experimentally obtainable data are small but of large dimensionality (Ganguly and Sompolinsky 2012).

## GRANTS

S. Hong and E. De Schutter are supported by Okinawa Institute of Science and Technology.

## DISCLOSURES

No conflicts of interest, financial or otherwise, are declared by the author(s).

## AUTHOR CONTRIBUTIONS

Author contributions: S.H. designed the method and analyzed the data; Q.R. designed and did the experiments; S.H., Q.R., and E.D.S. wrote the paper.

## ACKNOWLEDGMENTS

We thank Ben Torben-Nielsen, Timm Lochmann, Mario Negrello, and Joao Couto for helpful discussion.

## APPENDIX

### Derivation of Eq. 8

Here we describe the detailed derivation of the linear approximation, *Eq. 8*. We first begin with the definition of the PRC
*t*·(*T/T*_{0}). We define the ISI deviation *r* and rescale *x*(τ) as λ*x*(τ) → *x*(τ), and we obtain a simple linear expression, which is correct up to the first order in the magnitude of *x*(*t*) (Ota et al. 2009),

In actual data, the time is discretized with some time step δτ = *T*_{0}/*L*, and a function of time is represented by an *L*-dimensional vector as *f*(*t*) → **f** = {*f*(0), *f*(δτ), …, *f*[(*L* − 1)δτ]}. For each *i*, *Eq. 17* then becomes
**Δ**δτ → **Δ** for convenience. When we collect each *ri* into a vector as r = [*r*_{1}, *r*_{2}, …, *r*_{N}], we get a discrete form of *Eq. 17*
*X*_{ij} = *x*_{i}(*j*δτ). Therefore, the PRC estimation has become a linear regression problem to find a vector **Δ** that predicts r given the matrix X.

So far, we have discussed the PRC represented in the time domain, but this may not be the best representation for efficient estimation when **Δ** can be constructed much more simply in a different representation, such as in the frequency domain via Fourier transformation. In general, we can consider a suitable set of basis vectors *u*_{kj} = (**u**_{k})_{j} to transform *Eq. 18*. If the PRC can be constructed with *K* such basis vectors as
*Eq. 18* as
*Eq. 8*.

- Copyright © 2012 the American Physiological Society