Scalable optimal Bayesian classification of single-cell trajectories under regulatory model uncertainty

Background Single-cell gene expression measurements offer opportunities in deriving mechanistic understanding of complex diseases, including cancer. However, due to the complex regulatory machinery of the cell, gene regulatory network (GRN) model inference based on such data still manifests significant uncertainty. Results The goal of this paper is to develop optimal classification of single-cell trajectories accounting for potential model uncertainty. Partially-observed Boolean dynamical systems (POBDS) are used for modeling gene regulatory networks observed through noisy gene-expression data. We derive the exact optimal Bayesian classifier (OBC) for binary classification of single-cell trajectories. The application of the OBC becomes impractical for large GRNs, due to computational and memory requirements. To address this, we introduce a particle-based single-cell classification method that is highly scalable for large GRNs with much lower complexity than the optimal solution. Conclusion The performance of the proposed particle-based method is demonstrated through numerical experiments using a POBDS model of the well-known T-cell large granular lymphocyte (T-LGL) leukemia network with noisy time-series gene-expression data. Electronic supplementary material The online version of this article (10.1186/s12864-019-5720-3) contains supplementary material, which is available to authorized users.

In [9] and [10], the maximum-likelihood (ML) based classification of single-cell trajectories has been developed. The method uses the ML-adaptive filter proposed in [6] for estimation of the unknown parameters, followed by the Bayes classifier tuned to the ML parameter estimates. The drawback of this method is its inability to use prior knowledge in deriving the classifier. In [11], the intrinsically Bayesian robust (IBR) classifier for the trajectories is developed. This IBR classifier is optimal relative to the prior distribution of unknown parameters.
In this paper, assuming that there are two classes, healthy (c = 0) and cancerous (c = 1), we derive the optimal Bayesian classifier (OBC) [12,13] for classification of single-cell trajectories. The difference between the OBC and IBR classifiers is that in the OBC the expectation of the class-conditional densities is taken over the posterior distribution of the unknown parameters [14,15], whereas in the IBR classifier the expectation is taken over the priors.
Despite the optimality of the developed OBC for singlecell trajectories, its exact computation for large GRNs becomes intractable, due to the large size of the matrices involved. In this paper, we develop a particle-based OBC to scale up the classification of single-cell trajectories. The proposed method contains a bank of Auxiliary Particle-Filter implementations of the Boolean Kalman Filter (APF-BKF) proposed in [16], for both training and test processes.
Our contributions are twofold: 1) Optimal Bayesian Classification: we derive the optimal Bayesian classifiers (OBC) for both single-cell gene expression trajectories and multiple-cell averaged gene expression with uncertain regulatory network prior; and 2) Scalability: the parallel particle filters together with the Monte-Carlo inference have been efficiently used to estimate the likelihood and stationary distributions, making the derived OBC scalable to larger gene regulatory networks.
We apply the APF-BKF-based OBC to classify trajectories of the blood cancer T-cell large granular lymphocyte (T-LGL) leukemia. T-LGL leukemia is a chronic disease characterized by a clonal proliferation of cytotoxic T cells [17]. A Boolean network model of T cell survival signaling in the context of T-LGL leukemia has been constructed by [18] through performing extensive literature search. Then the T-LGL network has been simplified by [17], which constructs the minimum network that preserves the attractor structure of that system. The reduced network contains 18 genes, which has an optimal solution with a transition matrix with 2 2×18 = 68, 719, 476, 736 elements. By contrast, as we will show in our numerical experiments, the proposed APF-BKFbased method captures T-cell dynamics with only 1000 particles.

Gene regulatory network model
Gene regulatory networks are modeled as partiallyobserved Boolean dynamical systems (POBDS). The two components of the POBDS model are a state space model that describes the evolution of the dynamics of the GRN, and an observation model for the measurements. These two components are described below.

GRN state space model
The state process {X k ; k = 0, 1, . . .}, where X k ∈ {0, 1} n , represents the activation (ON)/inactivation (OFF) state for the corresponding gene across time. The state at each discrete time is assumed to be updated through the nonlinear signal model Boolean function called the network function, "⊕" indicates component-wise modulo-2 addition, and n k ∈ {0, 1} n is Boolean transition noise at time k. The modulo-2 addition means that if a bit in the noise n k is 1, the value of the corresponding gene in the Boolean state X k is flipped. In this paper, the noise process n k is assumed to have independent components distributed as Bernoulli(p), where the parameter p > 0 models the noise "intensity" -the closer p is to 0.5, the more chaotic the system will be, while a value of p close to zero means that the state trajectories are nearly deterministic, being governed tightly by the network function and perturbation process. The network possesses a steady-state distribution π ∞ describing its long-run behavior as: where x 1 , . . . , x 2 n denotes the set of the corresponding network states in the Boolean vector representation.

Observation model
The data available to the experimenter is described by the observation process is a vector containing the transcript abundance measurements at time k, for k = 1, 2, . . .. We consider a Gaussian linear model where v k ∼ N (0, σ 2 I n ) is an uncorrelated zero-mean Gaussian noise vector, λ = [ λ 1 , . . . , λ n ] T is a vector of baseline gene expressions corresponding to the "zero" state for each gene, and D = Diag(δ 1 , . . . , δ n ) is a diagonal matrix containing differential expression values for each gene along the diagonal (these indicate by how much the "one" state of each gene is overexpressed over the "zero" state). Such a Gaussian linear model is an appropriate model for single-cell gene-expression data [19,20].

Optimal Bayesian classifier (OBC) for single-cell trajectories
Assume there are two POBDSs corresponding to the healthy and cancerous (mutated) classes, each having n genes. The difference between the healthy and mutated classes could be the over-expression or disruption of a value of single or multiple genes in the mutated case. Let This uncertainty could arise due to some unknown regulations (i.e. interactions) between some genes in the pathway diagram (more information in "Results and discussion" section). We wish to derive the optimal Bayesian classifier (OBC) under uncertainty using all available data and prior knowledge.
If the feature-label distribution is unknown but belongs to an uncertainty class of feature-label distributions, then we desire a classifier to minimize the expected error over the uncertainty class. This expected error is equivalent to the Bayesian minimum mean-square-error estimate [21] given byˆ (ψ) = E θ|Y c ,c [ (ψ, θ)], where (ψ, θ) is the error of ψ on the feature-label distribution parameterized by θ and the expectation is taken over the posterior distribution of parameters π(θ | Y c , c), c = 0, 1.
For a given test trajectory Y, the OBC, minimizing the Bayesian minimum mean-square error estimate, can be obtained as where p θ (.) denotes a probability density function corresponding to parameter θ, p 0 is the prior probability of class 0 and for c = 0, 1. Derivation of (5) requires computing the posterior distribution where π(θ | c) denotes the prior probability of the corresponding network model θ for class c. For an arbitrary trajectoryỸ, we define the log-likelihood function associated to model θ and class c by Now, using the above definition in (4)-(6) leads to the following exact OBC solution: where The expectation in (9) is taken with respect to the posterior distribution of θ, i.e. π(θ|Y c , c), as opposed to IBR classifier [11] that considers the prior distribution π(θ|c). (9) due to the independency of the training trajectories.
As shown in Eq. (8), the OBC requires computing the log-likelihood functions of all training trajectories and test trajectory for both classes and θ ∈ . Let x 1 , . . . , x 2 n denote the corresponding network states in the Boolean vector representation and Y 1:T = (Y 1 , . . . , Y T ) be a single trajectory of length T. The log-likelihood function can be computed as where based on the POBDS model, Define the conditional probability of any network state at time step k for the model θ and class c as Using (11) and (12), the log-likelihood function in (10) can be written in a compact form as [22] where ||.|| 1 denotes L 1 norm, indicating the summation in (10). M θ ,c k is the transition matrix of the Markov state process corresponding to model θ ∈ , with entries given by: with p denoting the Bernoulli noise parameter. T θ,c k (Y k ) is a diagonal matrix, called the update matrix, with the ith diagonal element given by for i = 1, . . . , 2 n , where δ j and λ j are gene-expression parameters associated to the jth gene in class c as defined in (3). Notice that the initial distribution θ,c 0|0 = π ∞ θ,c is the steady-state distribution associated to model θ and the class c defined as This vector can be either computed exactly as introduced in [9] or approximated by creating multiple Monte-Carlo trajectories with relatively long horizons. The posterior distribution can also be recursively computed according to the transition and update matrices as described in [6]: The complexity of computing the log-likelihood function for a single trajectory of length T is of order O(2 2n × T) due to the transition matrix involved in its computation. The whole process of the proposed OBC is presented in Algorithm 1.

Scalable classification of single-cell trajectories
In the previous section, the exact solution for the optimal Bayesian classifier is introduced. However, for large systems with a large number of state variables, the exact computation of Algorithm 1 becomes impractical. This is due to the large transition matrix with 2 2n elements required to compute the log-likelihoods, leading to exponential computational and memory complexity. Thus, the key here is to scale up the OBC for single-cell trajectories by reducing both computational and memory complexity when computing (10). We adopt the Sequential Monte-Carlo (SMC) techniques [23][24][25][26][27] for estimating nonlinear state-space models, such as our POBDS here. These techniques approximate the target distribution using sample points ("particles") drawn from a proposal distribution, taking advantage of the fact that sampling from the proposal distribution is easier than from the target. This helps alleviate the high computation of the exact filter by using a finite set of Monte-Carlo samples. In this paper, we use the Auxiliary Particle Filter implementation of the Boolean Kalman Filter (APF-BKF) proposed in [16] to deal with large GRNs.
Let Y 1:T be a given trajectory and the goal is to approximate the likelihood function in Eq. (10) The particle filter allows us to produce an approximation for the elements in θ,c k|k of (12), simply by using the discrete support of the particles Usually only a few particles have significant weights after a few iterations of the algorithm and most particles have negligible weights. APF-BKF is a look-ahead method that predicts the location of particles with a high probability at time k based on the observations at time step k − 1, with the purpose of making the subsequent resampling step more efficient. Without the look-ahead, the basic algorithm blindly propagates all particles, even those in low probability regimes. The APF-BKF algorithm defines: where ζ k = 1, . . . , N, is an index of the mixture in (18). If we draw from the joint density and then discard the index, then we will have a sample from (18) as required. This procedure carries out the prediction and update steps of the optimal filter in (17). We can approximate (19) by
By simulating the index with the probability v k,i = p θ (Y k | μ k,i , c) w k−1,i , we can sample from P θ (X k , ζ k | Y 1:k , c) and then sample from the transition density given the mixture, P θ (X k | x k−1,i , c).
Actually, we simulate only from particles associated with large predictive likelihoods. We first sample N times from the joint density of P θ (X k , ζ k | Y 1:k , c), and then obtain the new particles {x k,i } N i=1 and their associated weights The auxiliary variables {ζ k,i } N i=1 are obtained by sampling from a discrete distribution: where Cat(a 1 , . . . , a N ) represents a categorical distribution with the probability mass function f (ζ = i) = a i / N j=1 a j . It is shown in [16,28] that the log-likelihood function in (10) can be approximated by: whereL θ c (Y 1:T ) denotes the approximation of L θ c (Y 1:T ). Note that the computational complexity of this algorithm is of order O(NT) which can be much smaller than O(2 2n T) that is the complexity of computing the exact log-likelihood function in (10).
The whole process and the schematic diagram of the proposed classifier are presented in Algorithm 2 and Fig. 1, respectively. During the training process, 2MD 0 D 1 particle filters need to be run for computing the loglikelihood functions of trajectories from all network models in the uncertainty class for two classes. The output values of the particle filters can be used for efficient approximation of the posterior distribution in (9). Then, during the test process, for a given test trajectory, 2M particle filters need to be performed for log-likelihood approximation of all network models (θ ∈ ) and classes (c = 0, 1). These log-likelihood values and posterior probability approximated during the training process can be used to derive the approximate OBC in (8).
The training process of the proposed method has the computational complexity O(2NMTD 0 D 1 ), whereas the exact solution has the complexity O(2 2n+1 MTD 0 D 1 ). The exponential growth of the complexity with the size of network (i.e., number of genes) for the exact solution precludes its application to large GRNs. However, the number of particles, N, by the proposed method can be chosen relatively small according to the attractor structure of

Optimal Bayesian classifier for multiple-cell scenarios
In the previous sections, the classification of singlecell trajectories is discussed. Here, we consider common scenarios in molecular biology research where geneexpression data are often based on the average expression from multiple-cells at different time with different states. Since the trajectories are assumed to be independent and drawn based on the dynamics of the true network, its steady-state distribution π ∞ θ * characterizes the probability of the system being at different states. It can be shown that the multiple-cell data are independent samples from the following measurement model [10]: where {x 1 , . . . , x 2 n } denotes all the network states in the Boolean vector representation. However, we assume that the true network model θ * , is unknown, and is represented by a finite set of M possible network models {θ 1 , . . . , θ M } with prior probability π(θ | c). Let Y be the multiple-cell training measurements available for class c. The optimal Bayesian classifier for a given test sample Y can be represented by The posterior probability of the parameter θ can be computed as Computation of (26) requires computing the conditional probability of training and test samples given all θ ∈ and c ∈ {0, 1}. Let A be an n×2 n matrix containing all Boolean states of the system (i.e., A = x 1 , . . . , x 2 n ). According to (25), the conditional distribution of an arbitrary samplẽ Y given class c and network model θ can be written as where "•" is the Hadamard product. Thus, the Boolean nature of the state vector suggests that each element of the multiple-cell measurement is distributed as a mixture of two Gaussian distributions. Replacing (27) and (28) into (26) leads to the OBC for multiple-cell scenarios.
Comprehensive comparison results between the OBC in single-cell trajectories and multiple-cells are provided in the next section.

Results and discussion
We evaluate the proposed single-cell trajectory classifier and compare its performance with the OBC based on multiple-cell average expression on the T-LGL leukemia Boolean network, whose GRN [17] is shown in Fig. 2. This GRN has 18 genes. The regulating functions are defined in Table 1 [17,18]. The main node is "Apoptosis", which denotes programmed cell death. According to [17],   12: end for 13: APF-BKF (θ, c, Y 1:T , π 0|0 ) [16] 1:L θ c = 0, x 0,i ∼ π 0|0 , w 0,i = 1/N, for i = 1, . . . , N. 2: for k = 1, 2, . . . , do 3: 6: in the mutated case, the node Apoptosis is stuck at OFF state and cannot be activated. As a result, we derive the healthy Boolean network from Table 1 and for the mutated Boolean network we put the value of Apoptosis in Table 1 to zero. This means that in the cancerous scenario, the value of Apoptosis does not obey the regulating functions and is always zero. As we may not know the true network function, we consider four candidate network functions for each of the healthy and mutated networks as the uncertainty class of possible GRN models. In addition to the true network, we remove the operation ¬ of Apoptosis for the genes sFas and GPCR, which are intermediate nodes. Therefore, this network is very close to the true network. For the third network, we remove ¬ of Apoptosis from two other nodes IAP and P2. In the fourth network of this uncertainty class, we change the operation AND to OR for the gene BID. In this study, we use the observation models described in Eqs. (3) and (25) for the single-cell trajectory and multiple-cell averaging, respectively. Figure 3a and b show the classification errors for the simulated data based on the T-LGL leukemia Boolean network versus the number of time points T for two values of state perturbation probability p = 0.05 and 0.1, respectively. For the sake of simplicity, we assume the gene-expression parameters in Eqs. (3) and (25) to be the same for all genes, that are λ =[ λ, . . . , λ] T and D =  Diag(δ, . . . , δ), and set λ = 10, and δ = 30. Two different values are considered for the observation noise level: σ = 20 (low noise), and σ = 25 (high noise). While σ corresponds to both within subject and between subject variations in the single-cell, it shows between subject variation in the multiple-cell because multiple-cells would allow to average out the within subject variance. We set the number of training trajectories D = D c=0 + D c=1 = 4. In both figures, the error curves are monotonically decreasing in terms of the trajectory length T for the single-cell classifier. There is a special case in which the error gets fixed after some T. This may be explained by the effect from the steady-state distributions depending on the lengths of attractor cycles of the networks under study. When the perturbation noise p and the observation noise σ are small, the sufficient T to achieve the least possible error is L + 1, where L is the minimum attractor length in the two networks. More precisely, the BNps tend to the deterministic BNs when the perturbation noise is small, meaning that the observations occur only in the attractor states and circulate inside the attractor cycles. In such a case, L + 1 is the maximum length of a trajectory that can help distinguish the two networks. But there is a nonnegligible probability of jumping states in the considerable perturbation scenarios, so that longer trajectories can be helpful. In all figures for every value of T and p, the error increases with increasing observation noise. While the proposed classifier works in both low-and high-noise scenarios, the classifier based on the multiple-cell expression data only works well in the low-noise scenarios and is very sensitive to σ . In low observation noise scenarios and when p = 0.05, multiple-cell classifier can easily classify, while the trajectory-based with two time-points will have a little bit higher error due to the common short segments of the trajectories between two classes. When there are at least four time-points, which is the attractor cycle size in the uncertainty class of the networks, the performance of the classifiers based on single-cell trajectories is better compared to the ones using multiple-cell even in the low-noise scenarios. Increasing the number of timepoints help better decipher the difference in single-cell trajectories between two classes (healthy vs. cancerous) with improved classification accuracy. Compared to the multiple-cell classifier based on averaged gene expression over cells at different states, the trajectory based classifier can clearly improve the classification performance. Even using four time points, the classification accuracy can be improved up to 8%. Using longer trajectories, the improvement can be up to 18% as indicated in Fig. 3.
In addition to the effect of trajectory length, we would like to investigate how the number of training trajectories affect the performance of the proposed method, especially with a low number of training samples. Figure 4 shows the effect of the number of training trajectories D on the Table 1 Definitions of Boolean functions for the T-LGL leukemia Boolean network with 18 nodes [17,18] Node Regulating function ¬ (Ceramide ∨ Apoptosis) Apoptosis Caspase ∨ Apoptosis classification performance. In the particle filter point of view, increasing D by 1 means increasing the available data as T. Therefore, we set T = 2, that is smallest T, to better see the trend of classification error. Both the average error and its standard deviation decrease with more training trajectories and the classification error converges to a fixed value when D becomes large enough. The value of D required for a converged error rate depends on the parameters T, p, and M. In real-world scenarios that may have significant uncertainty and the perturbation probability is high, we need more training data to improve the performance. Figure 5 illustrates how the model uncertainty may affect the classification performance. In our setup, the uncertainty is manifested as the size of the network uncertainty class -the number of uncertain networks for  Figure 6 shows that increasing the number of particles monotonically decreases the classification error and the average error converges with only 1000 particles. This can be compared with the traditional method [9], which needs computation based on the 2 18  To more comprehensively evaluate the proposed method, we have compared it with two other classification methods, i.e. IRB [11] and Plug-In [10]. The performance comparisons under four different scenarios are provided in Table 2. The OBC stands out as the best performing method in different scenarios. Using OBC improves the accuracy by 8.5% and 1.5% with T = 3 for p = 0.1 and σ = 25 compared to Plug-In and IRB, respectively.
The proposed method does not have any restriction on the noise distribution assumptions due to the generalizability of particle filters. To show this, we also test our method with different noise distributions. While the noise of GRNs is usually Gaussian, sometimes due to up regulation, noise can be Poisson or Negative Binomial (NB). We have simulated Gaussian and NB noise distributions with the same mean and variance while for Poisson noise, the variance is equal to its mean (Details can be found in Additional file 1, the additional experimental results). Additional file 1: Figure S1 shows that our method performs consistently well for all observation noises. The Poisson results are superior to the other models as its variance is smaller.
We also compare the performance of APF with the plain Sequential Importance Resampling (SIR) in high process noise. As Additional file 1: Figure S2 shows, the performance is similar especially when there is enough time points. When the number of time points is low, the SIRbased particle filter performs worse. The superior performance of APF in real-world GRNs is due to the fact that the size of attractors is usually small and the predicted mode values by APF can be a good approximation for the next prediction. Moreover, the initial distribution  The achieved best accuracy is highlighted in boldface is assumed to be from the stationary distribution. This makes APF a more desirable approximation solutions due to the lower diversity in the particles.

Conclusions
In this paper, we have developed the optimal Bayesian classifier for binary classification of single-cell trajectories under regulatory model uncertainty. The partiallyobserved Boolean dynamical system is used for modeling the dynamical behavior of gene regulatory networks. Due to the intractability of the OBC for large GRNs, we have proposed a particle filtering technique for approximating the OBC. This particle-based solution reduces the computational and memory complexity of the optimal solution significantly. The performance of the proposed particle-based method is demonstrated through numerical experiments using a POBDS model of the well-known T-cell large granular lymphocyte (T-LGL) leukemia network based on noisy time-series gene-expression data.