 Research
 Open Access
 Published:
Bayesian gammanegative binomial modeling of singlecell RNA sequencing data
BMC Genomics volume 21, Article number: 585 (2020)
Abstract
Background
Singlecell RNA sequencing (scRNAseq) is a powerful profiling technique at the singlecell resolution. Appropriate analysis of scRNAseq data can characterize molecular heterogeneity and shed light into the underlying cellular process to better understand development and disease mechanisms. The unique analytic challenge is to appropriately model highly overdispersed scRNAseq count data with prevalent dropouts (zero counts), making zeroinflated dimensionality reduction techniques popular for scRNAseq data analyses. Employing zeroinflated distributions, however, may place extra emphasis on zero counts, leading to potential bias when identifying the latent structure of the data.
Results
In this paper, we propose a fully generative hierarchical gammanegative binomial (hGNB) model of scRNAseq data, obviating the need for explicitly modeling zero inflation. At the same time, hGNB can naturally account for covariate effects at both the gene and cell levels to identify complex latent representations of scRNAseq data, without the need for commonly adopted preprocessing steps such as normalization. Efficient Bayesian model inference is derived by exploiting conditional conjugacy via novel data augmentation techniques.
Conclusion
Experimental results on both simulated data and several realworld scRNAseq datasets suggest that hGNB is a powerful tool for cell cluster discovery as well as cell lineage inference.
Background
Singlecell RNA sequencing (scRNAseq) has emerged as a powerful tool for unbiased identification of previously uncharacterized molecular heterogeneity at the cellular level [1]. This is in contrast to standard bulk RNAseq techniques [2], which measures average gene expression levels within a cell population, and thus ignore tissue heterogeneity. Consideration of celllevel variability of gene expressions is essential for extracting signals from complex heterogeneous tissues [3], and also for understanding dynamic biological processes, such as embryo development [4] and cancer [5].
A large body of statistical tools developed for scRNAseq data analysis include a dimensionality reduction step. This leads to more tractable data, from both statistical and computational point of views. Moreover, the noise in the data can be decreased, while retaining the often intrinsically lowdimensional signal of interest. Dimensionality reduction of scRNAseq data is challenging. In addition to high gene expression variability due to cell heterogeneity, the excessive amount of zeros in scRNAseq hinders the application of classical dimensionality reduction techniques such as principal component analysis (PCA). For instance, in realworld datasets, it has been reported that the first or second principal components often depend more on the proportion of detected genes per cell (i.e., genes with at least one read) than on the actual biological signal [6].
Several existing computational tools adopt explicit zeroinflation modeling to infer the latent representation of scRNAseq data. Zeroinflated factor analysis (ZIFA) [7] extends the framework of probabilistic PCA [8] to the zeroinflated setting, by modeling the excessive zeros using Bernoulli distributed random variables which indicate the dropout event. Zeroinflated negative binomialbased wanted variation extraction (ZINBWaVE) [9] directly models the scRNAseq counts using a zeroinflated negative binomial distribution, while accounting for both gene and celllevel covariates. It infers the model parameters using a penalized maximum likelihood procedure.
Despite its popularity, using an explicit zeroinflation term may place unnecessary emphasis on the zero counts, leading to complication in discovering the latent representation of scRNAseq data. In this paper, we propose a hierarchical gammanegative binomial (hGNB) model to both perform dimensionality reduction and adjust for the effects of the gene and celllevel confounding factors simultaneously. Exploiting the hierarchical structure, the proposed hGNB model is capable of capturing the high overdispersion present in the scRNAseq data. More precisely, we factorize the logit of the negativebinomial (NB) distribution probability parameter to identify latent representation of the data. In addition to factorization, linear regression terms are also included in that logit function to adjust for the impact of covariates.
In hGNB, a gamma distribution with varying rate parameter is used to model the cell dependent dispersion parameter of the NB distribution. The celllevel dispersion serves as a means of representing the prevalence of the dropout events. For instance, cells that are sequenced deeply will naturally include less droppedout genes with zero counts, and thus this will be reflected in the cell specific dispersion parameter of NB distribution.
We follow a Bayesian framework, similar to bulk RNAseq setting [10–14], and derive closedform Gibbs sampling update equations for the model parameters of hGNB, by exploiting sophisticated data augmentation techniques. More specifically, we apply the data augmentation technique of [15] (2015) for the NB distribution, and the PolyaGamma distributed auxiliary variable technique of [16] (2013) for the closedform inference of regression coefficients and also latent factor parameters, removing the need for nontrivial MetropolisHastings correction steps [17]. Experimental results on several realworld scRNAseq datasets demonstrate the superior performance of hGNB to identify cell clusters, especially in complex settings, and also its potential application in cell lineages inference.
Methods
hGNB model
In this section we present the hierarchical gammanegative binomial (hGNB) model for factor analysis of scRNAseq data. The graphical representation of hGNB is shown in Fig. 1. The parameters of the hGNB model with their interpretations in the context of scRNAseq experiments are presented in Table 1. Let n_{vj} denote the number of sequencing reads mapped to gene v∈{1,...,V} in the cell j∈{1,...,J}. Under the hGNB model, gene counts are distributed according to a negative binomial (NB) distribution:
where r_{j} and p_{vj} are dispersion and probability parameters of NB distribution, respectively. The probability mass function (PMF) of this distribution can be expressed as \(f_{N}\left (n_{vj}\right) = \frac {\Gamma \left (n_{vj}+r_{j}\right)}{n_{vj}!\Gamma \left (r_{j}\right)} p_{vj}^{n_{vj}} \left (1p_{vj}\right)^{r_{j}}\), where Γ(·) is the gamma function.
Data from scRNAseq experiments exhibit high variability between different cells, even for genes with medium or high levels of expression. To capture this variability, we impose a gamma prior on the celllevel dispersion parameters as
where for simplification, the hyperparameter e_{0} is set to 0.01 in our experiments, and the rate h is learned during the Gibbs sampling inference, presented in the following section. This hierarchical prior on the dispersion parameter, enhances the flexibility of NB distribution to capture the high overdispersion of scRNAseq counts, without the need for explicit zeroinflation modeling.
To account for various technical and biological effects common in scRNAseq technologies, we impose a regression model on the logit of NB probability parameter as
With this particular formulation, the expected count for gene v in cell j can be written as
Thus the regression and factorization terms in (3) can directly adjust the value of expected gene counts in different samples.
The three terms in (3) are integrated to capture different expression variability sources. In the first term, x_{j} is a known vector of P covariates for cell j and β_{v} is the regressioncoefficient vector adjusting the effect of covariates on gene v. The covariate vector x_{j} can represent variations of interest, such as cell types, or unwanted variations, such as batch effects or quality control measures. An intercept term can also be included in these celllevel covariates to account for gene dependent baseline expressions.
In the second term, z_{v} is a vector of Q covariates for gene v, representing gene length or GCcontent for example [18], and δ_{j} is its associated regressioncoefficient vector. We also include a fixed intercept element in z_{v} to account for cellspecific expressions, such as the size factors representing differences in sequencing depth.
In the third term, \(\boldsymbol {\phi }_{v}^{T}\boldsymbol {\theta }_{j}\) corresponds to the latent factor representation of the count n_{vj}, after accounting for the effects of gene and celllevel covariates. More precisely, the unknown K×1 vector ϕ_{v} contains the factor loading parameters which determine the association between genes and latent factors. Moreover, the unknown K×1 vector θ_{j} encodes the popularity of the K factors in the expression of cell j.
We place independent zeromean normal distributions on the components of the regression coefficient parameters β_{v} and δ_{j} as
where α_{p} and η_{q} are precision parameters of the normal distributions and gamma priors are imposed on them. These priors are known as automatic relevance determination (ARD), which are effective tools for pruning large numbers of irrelevant covariates [19, 20]. In addition, by assuming identical precision for components of the regression coefficients across all genes or samples, hGNB borrows statistical strengths to infer these precision parameters.
We impose independent normal priors on latent factor loading and score parameters ϕ_{v} and θ_{j}:
Note that the posterior for these terms is not generally independent or normal, but accounts for the statistical dependence as reflected in the data.
We complete the model by imposing a gamma prior on the precision parameters of normal distributions, and also the rate parameter of gamma distributions. Specifically, throughout the experiments, we set both the shape and rate of these gamma priors to 0.01.
Inference via Gibbs sampling
In this section, we provide an efficient inference algorithm that adopts data augmentation techniques tailored to our hGNB model. Algorithm 1 summarizes all the steps in the Gibbs sampling algorithm.
Sample dispersion parameter We start with the data augmentation technique developed for inferring the NB dispersion parameter [15]. More precisely, the negative binomial random variable n∼NB(r,p) can be generated from a compound Poisson distribution as
where u∼Log(p) corresponds to the logarithmic random variable [21], with the PMF \(f_{U}(u) = \frac {p^{u}}{u\ln (1p)}, u \in \{1,2,...\}\). As shown in [15], given n and r, the distribution of ℓ is a Chinese Restaurant Table (CRT) distribution, (ℓn,r)∼CRT(n,r), which can be generated as \(\ell = \sum _{t=1}^{n} b_{t}, b_{t}\sim \text {Bernoulli}\left (\frac {r}{r+t1}\right)\).
Utilizing this augmentation technique, for each observed count n_{vj}, an auxiliary count is sampled as
Using gammaPoisson conjugacy, the celldependent dispersion parameters are updated as
Sample regression coefficients For the regression coefficients modeling potential covariate effects, the lack of conditional conjugacy precludes immediate closedform inference. Therefore we adopt another data augmentation technique, specifically designed for hGNB, to infer the regression coefficients β_{v} and δ_{j}, relying on the PolyaGamma (PG) data augmentation [16, 22].
Denote ω_{vj} as a random variable drawn from the PG distribution as ω_{vj}∼PG(n_{vj}+r_{j},0). Since \(\mathbb {E}_{\omega _{vj}}\left [\exp \left ( \omega _{vj} \psi _{vj}^{2}/2\right)\right ]=\cosh ^{\left (n_{vj}+r_{j}\right)}\left (\psi _{vj}^{2}/2\right)\), the likelihood of ψ_{vj} in (3) can be expressed as
Exploiting the exponential tilting of the PG distribution in [16], we draw ω_{vj} as
Given the values of the auxiliary variables ω_{vj} for j=1,...,J and the prior in (5), the conditional posterior of β_{v} can be updated as
where \(\Sigma _{v}^{(\beta)} = \left (\text {diag}\left (\alpha _{1},...,\alpha _{P}\right)+\sum _{j} \omega _{vj} \boldsymbol {x}_{j} \boldsymbol {x}_{j}^{T} \right)^{1}\) and \(\mu _{v}^{(\beta)} = \Sigma _{v}^{(\beta)} \left [ \sum _{j} \left (\frac {n_{vj}r_{j}}{2}  \omega _{vj} \left (\boldsymbol {\delta }_{j}^{T}\boldsymbol {z}_{v} + \boldsymbol {\phi }_{v}^{T}\boldsymbol {\theta }_{j}\right) \right) \boldsymbol {x}_{j} \right ]\).
A similar procedure can be followed to derive the conditional updates for celllevel regression coefficients as
where \(\Sigma _{j}^{(\delta)} = \left (\text {diag}\left (\eta _{1},...,\eta _{Q}\right)+\sum _{v} \omega _{vj} \boldsymbol {z}_{v} \boldsymbol {z}_{v}^{T} \right)^{1}\) and \(\mu _{j}^{(\delta)} = \Sigma _{j}^{(\delta)} \left [ \sum _{v} \left (\frac {n_{vj}r_{j}}{2}  \omega _{vj} \left (\boldsymbol {\beta }_{v}^{T}\boldsymbol {x}_{j} + \boldsymbol {\phi }_{v}^{T}\boldsymbol {\theta }_{j}\right) \right) \boldsymbol {z}_{v} \right ]\).
Sample latent factor parameters Using the likelihood function in (9) and the priors in (6), we can derive closedform update steps for factor loading and score parameters. More specifically, the full conditional for factor loading ϕ_{v} is a normal distribution:
where \(\Sigma _{v}^{(\phi)} = \left (I_{K} + \sum _{j} \omega _{vj} \boldsymbol {\theta }_{j} \boldsymbol {\theta }_{j}^{T} \right)^{1}\) and \(\mu _{v}^{(\phi)} = \Sigma _{v}^{(\phi)} \left [ \sum _{j} \left (\frac {n_{vj}r_{j}}{2}  \omega _{vj} \left (\boldsymbol {\beta }_{v}^{T}\boldsymbol {x}_{j} + \boldsymbol {\delta }_{j}^{T}\boldsymbol {z}_{v}\right) \right) \boldsymbol {\theta }_{j} \right ]\).
The full conditional for factor score θ_{j} is also a normal distribution:
where \(\Sigma _{j}^{(\theta)} = \left (\text {diag}\left (\gamma _{1},...,\gamma _{K}\right) + \sum _{v} \omega _{vj} \boldsymbol {\phi }_{v} \boldsymbol {\phi }_{v}^{T} \right)^{1}\) and \(\mu _{j}^{(\theta)} = \Sigma _{j}^{(\theta)} \left [ \sum _{v} \left (\frac {n_{vj}r_{j}}{2}  \omega _{vj} \left (\boldsymbol {\beta }_{v}^{T}\boldsymbol {x}_{j} + \boldsymbol {\delta }_{j}^{T}\boldsymbol {z}_{v}\right) \right) \boldsymbol {\phi }_{v} \right ]\).
Sample precision and rate The precision parameters of normal distributions in (5) and (6) can be updated using the normalgamma conjugacy:
Finally, the rate of gamma distribution in (2) can be updated using the gammagamma conjugacy with respect to the rate parameter:
Results
We evaluate our hGNB model on four different sets of realworld scRNAseq data from different platforms, and compare its performance to those of principal component analysis (PCA), ZIFA [7], and ZINBWaVE [9]. In the following, We briefly describe these scRNAseq datasets. To preprocess these datasets when needed, we followed the same procedures as in [9].
V1 dataset. This dataset characterizes more than 1600 cells from the primary visual cortex (V1) in adult male mice, using a set of established Cre lines [23]. A subset of three Cre lines, including Ntsr1Cre, Rbp4Cre, and Scnn1aTg3Cre, that respectively label layer 4, layer 5, and layer 6 excitatory neurons were selected. We only retained 285 cells that passed the authors’ quality control (QC) filters. The dimensionality reduction methods were only applied to the 1000 most variable genes.
S1/CA1 dataset. This dataset characterizes 3005 cells from the primary somatosensory cortex (S1) and the hippocampal CA1 region, using the Fluidigm C1 microfluidics cell capture platform followed by Illumina sequencing [24]. Gene expression is quantified by UMI counts.
mESC dataset. This dataset includes the transcriptome measurement of 704 mouse embryonic stem cells (mESCs), across three culture conditions (serum, 2i, and a2i), using the Fluidigm C1 microfluidics cell capture platform followed by Illumina sequencing [25]. We excluded the samples that did not pass the authors’s QC filters, resulting in a total of 169 serum cells, 141 2i cells, and 159 a2i cells. The dimensionality reduction methods were only applied to the 1000 most variable genes.
OE dataset. This data characterizes 849 FACSpurified cells from the mouse OE, using the Fluidigm C1 microfluidics cell capture platform followed by Illumina sequencing [26]. We followed the filtering procedure of [27], and filtered the cells that exhibited poor sample quality, retaining a total of 747 cells.
For all datasets, hGNB was run using 2000 MCMC iterations, where after the first 1000 burnin iterations, the posterior samples with the highest likelihood were collected as the point estimates of model parameters corresponding to latent factors. In the dimensionality reduction analysis below, following [9], for S1/CA1 dataset we set the number of latent factors K=3, and for V1 and mESC we set K=2. The average run time for hGNB with 2000 MCMC iterations on a cluster compute node with Intel Xeon 2.5GHz processor was approximately 4 hours.
Goodnessoffit of hGNB model
We have examined the goodnessoffit of hGNB model on V1, S1/CA1 and mESC datasets, using the meandifference (MD) plots. Figure 2 shows the MD plot for the S1/CA1 dataset, where the yaxis is the difference between observed counts and the expected counts under hGNB, and xaxis is the average of these two sets of counts. The solid red line in this figure, which represents the local regression fit [28] to the data, resides near zero for various average levels. This supports the good fit of hGNB model to the highly overdispersed scRNAseq data. Similar trends are observed for V1 and mESC datasets (Supplementary materials).
Capturing zeroinflation
Next we evaluate the performance of hGNB on simulated data based on the zeroinflated NB distribution of [9] (ZINBWaVE) to show that hGNB faithfully captures zero inflation without the need of explicit zeroinflation modeling. Specifically, the capability of hGNB to recover true clustering structure of cells under three zerocount prevalence levels with two different total numbers of cells.
ZINBWaVE models the gene count n a
where π is the zeroinflation probability and μ and σ^{2} are mean and variance of the NB distribution. For each gene and cell, the zeroinflation probability π and the NB mean μ are linked to regression and factorization as in (3). In our simulations, the parameters were learned based on the S1/CA1 dataset. Genes that did not have at least five reads in at least five cells were filtered out and 1000 genes were then sampled at random for each dataset. The number of latent factors was set to K=2. To simulate cell clustering, a Kvariate Gaussian mixture distribution with three components was fitted to the inferred factor score parameters, and then for each simulated dataset, factor scores were generated from Kvariate Gaussian distributions. By adjusting the value of regression coefficients in the zeroinflation term of ZINBWaVE model, we generated synthetic datasets with three levels of zerocount percentages as 40%, 60% and 80% (for details refer to [9]). The number of cells were set to J=100 and J=1000. For each scenario, including cell numbers and zerocount prevalence (sparsity) levels, we simulated 10 datasets.
We evaluate the performance of our method for the clustering task based on the average silhouette width measure. The silhouette width s_{j} of sample j is defined as
where a_{j} is the average distance between sample j and all samples in the cluster that it belongs to, and b_{j} is the minimum average distance between sample j and samples in other clusters.
Tables 2 and 3 show the mean and standard deviation of clustering average silhouette width based on multiple runs of the above simulation setup, for different zerocount prevalence levels and cell numbers. In the setting with small sample size, for 40% and 60% zero fractions, hGNB has the best clustering silhouette width, and for the 80% zero fraction its performance is identical to that of ZINBWaVE. In the setting with moderate sample size, hGNB has the best clustering silhouette width for 40% zero fraction, and for 60% and 80% zero fractions it closely follows the performance of ZINBWaVE. This suggests that the hierarchical structure of hGNB equips it with the capacity to capture highly overdispersed count data, even though an explicit zeroinflation term is not included in its model. Also, ZINBWaVE requires large enough samples to have robust inference results due to the introduction of zeroinflation terms in its model. ZIFA and PCA have a less competitive performance, as they normalize the data before learning its latent representation. Furthermore, in these tables, we have included the performance of Monocle [29] and scVI [30]. Despite exploiting independent component analysis (Monocle) or a deep generative framework (scVI), these two methods also fail to compete with hGNB in terms of cell clustering quality.
Dimensionality reduction
We applied hGNB to the three scRNAseq datasets, V1, S1/CA1 and mESC, to assess its power to separate cell clusters in the low dimensional space, and compared it to PCA, ZIFA, and ZINBWaVE methods. Figure 3 illustrates the projected scRNAseq expression of profiled cells in the twodimensional space for S1/CA1 dataset. The proposed hGNB model provides more biologically meaningful latent representations of scRNAseq gene expressions for S1/CA1 cells, especially compared to PCA and ZIFA that do not model the counts directly. Furthermore, hGNB leads to more separated clusters of cells in the twodimensional space, compared to ZINBWaVE. Specifically, hGNB distinguishes microglia from endothelialmural cells, while ZINBWaVE fails to accomplish this task.
To examine the dimensionality reduction results more carefully, we used the average silhouette width as a measure of goodness for clustering.
Figure 4 shows the average silhouette width of different methods on V1, S1/CA1, and mESC datasets. For PCA and ZIFA, the results on both raw counts and normalized counts are included in this figure. For S1/CA1 dataset, which has the highest number of clusters, the proposed hGNB method outperforms all other methods in terms of clustering average silhouette. For mESC dataset, performance of hGNB is comparable to ZINBWaVE, and it is significantly better than PCA and ZIFA. For V1 dataset, however, we observe that hGNB, besides PCA applied to raw counts, possess the lowest average silhouette. By further examination of the latent representations of cells for this dataset (Supplementary materials), we observe that all methods split the Rbp4Cre_KL100 cells into two clusters, one of them located near Scnn1aTg3Cre cells, suggesting the presence of batch effects, which have led to confounding of latent representations [9].
Identification of developmental lineages
In addition to characterization of cell types, we further demonstrate the capability of hGNB to derive novel biological insights, by analyzing a set of cells from the mouse olfactory epithelium (OE). The samples were collected to identify the developmental trajectories that generate olfactory neurons (mOSN), sustentacular cells (mSUS), and microvillous cells (MV) [26].
We first performed dimensionality reduction on the OE dataset by applying hGNB with K=50. Next, we clustered the cells using the lowdimensional factor score parameters θ_{kj}. More specifically, the resamplingbased sequential ensemble clustering (RSEC) framework implemented in the RSEC function from the Bioconductor R package clusterExperiment [31] was applied to factor scores, leading to identification of 14 cell clusters. The correspondence between the detected clusters and the underlying biological cell types is presented in Table 4. In addition to these already known cell clusters in OE, hGNB is able to detect new clusters, potentially offering novel biological insights.
We further investigated the potential benefit of using the learned latent representation by our proposed hGNB model to infer branching cell lineages and order cells by developmental progression along each lineage. To infer the global lineage structure (i.e., the number of lineages and where they branch), a minimum spanning tree (MST) was constructed on the clusters identified above by RSEC. We used the R package slingshot [32]. Figure 5 illustrates the inferred lineages for the OE dataset, in a twodimensional space obtained by applying multidimensional scaling (MDS) algorithm to the factor scores learned by hGNB. There are three branches in the inferred lineages, with endpoints located in microvillous (MV), mature olfactory sensory neurons (mOSN), and mature sustentacular (mSUS) cells.
Conclusions
We propose a hierarchical Bayesian gammanegative binomial (hGNB) model for extracting low dimensional representations from singlecell RNA sequencing (scRNAseq) data. hGNB obviates the need for explicit modeling of the zeroinflation prevalent in scRNAseq count data. Our hGNB can naturally account for covariate effects at both the gene and cell levels, and does not require the commonly adopted preprocessing steps such as normalization. By taking advantage of sophisticated data augmentation techniques, hGNB possesses efficient closedform Gibbs sampling update equations. Our experimental results on realworld scRNAseq data demonstrates that hGNB is capable of identifying insightful cell clusters, especially in complex settings.
Availability of data and materials
Not applicable.
Abbreviations
 hGNB:

Hierarchical gammanegative binomial
 NB:

Negative binomial
 CRT:

Chinese restaurant table
 ZIFA:

Zeroinflated factor analysis
 ZINBWaVE:

Zeroinflated negative binomialbased wanted variation extraction
 PCA:

Principal component analysis
 ARD:

Automatic relevance determination
 PG:

PolyaGamma
 RSEC:

Resamplingbased sequential ensemble clustering
 OE:

Olfactory epithelium
 mSUS:

Sustentacular cells
 MV:

Microvillous cells
 mOSN:

Olfactory neurons
References
 1
Shapiro E, Biezuner T, Linnarsson S. Singlecell sequencingbased technologies will revolutionize wholeorganism science. Nat Rev Genet. 2013; 14(9):618–30.
 2
Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008; 320(5881):1344–9.
 3
Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, Trombetta JJ. Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell. 2015; 161(5):1202–14.
 4
Deng Q, Ramsköld D, Reinius B, Sandberg R. Singlecell RNAseq reveals dynamic, random monoallelic gene expression in mammalian cells. Science. 2014; 343(6167):193–6.
 5
Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, Cahill DP, Nahed BV, Curry WT, Martuza RL, Louis DN. Singlecell RNAseq highlights intratumoral heterogeneity in primary glioblastoma. Science. 2014; 344(6190):1396–401.
 6
Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, Slichter CK, Miller HW, McElrath MJ, Prlic M, Linsley PS. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in singlecell RNA sequencing data. Genome Biol. 2015; 16(1):1–13.
 7
Pierson E, Yau C. ZIFA: Dimensionality reduction for zeroinflated singlecell gene expression analysis. Genome Biol. 2015; 16(1):243.
 8
Tipping ME, Bishop CM. Probabilistic principal component analysis. J R Stat Soc Ser B Stat Methodol. 1999; 61(3):611–22.
 9
Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert JP. A general and flexible method for signal extraction from singlecell RNAseq data. Nat Commun. 2018; 9(1):1–17.
 10
Dadaneh SZ, Qian X, Zhou M. BNPSeq: Bayesian nonparametric differential expression analysis of sequencing count data. J Am Stat Assoc. 2018; 113(521):81–94. https://doi.org/10.1080/01621459.2017.1328358.
 11
Dadaneh SZ, Zhou M, Qian X. Bayesian negative binomial regression for differential expression with confounding factors. Bioinformatics. 2018; 34(19):3349–56.
 12
Zamani Dadaneh S, Zhou M, Qian X. Covariatedependent negative binomial factor analysis of RNA sequencing data. Bioinformatics. 2018; 34(13):61–9.
 13
Boluki S, Dadaneh SZ, Qian X, Dougherty ER. Optimal clustering with missing values. BMC Bioinformatics. 2019; 20(12):321.
 14
Boluki S, Esfahani MS, Qian X, Dougherty ER. Incorporating biological prior knowledge for Bayesian learning via maximal knowledgedriven information priors. BMC Bioinformatics. 2017; 18(14):552.
 15
Zhou M, Carin L. Negative binomial process count and mixture modeling. IEEE Trans Pattern Anal Mach Intell. 2013; 37(2):307–20.
 16
Polson NG, Scott JG, Windle J. Bayesian inference for logistic models using Pólya–Gamma latent variables. J Am Stat Assoc. 2013; 108(504):1339–49.
 17
Chib S, Greenberg E. Understanding the metropolishastings algorithm. Am Stat. 1995; 49(4):327–35.
 18
Risso D, Schwartz K, Sherlock G, Dudoit S. GCcontent normalization for RNASeq data. BMC Bioinformatics. 2011; 12(1):480.
 19
Wipf DP, Nagarajan SS. A new view of automatic relevance determination. In: Advances in Neural Information Processing Systems: 2008. p. 1625–32.
 20
Tipping ME. Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res. 2001; 1(Jun):211–44.
 21
Johnson NL, Kemp AW, Kotz S. Univariate Discrete Distributions, vol. 444. Hoboken: Wiley; 2005.
 22
Zhou M, Li L, Dunson D, Carin L. Lognormal and gamma mixed negative binomial regression. In: Proceedings of the International Conference on Machine Learning: 2012. p. 1343–50.
 23
Tasic B, Menon V, Nguyen TN, Kim TK, Jarsky T, Yao Z, Levi B, Gray LT, Sorensen SA, Dolbeare T, Bertagnolli D. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat Neurosci. 2016; 19(2):335–46.
 24
Zeisel A, MuñozManchado AB, Codeluppi S, Lönnerberg P, La Manno G, Juréus A, Marques S, Munguba H, He L, Betsholtz C, Rolny C. Cell types in the mouse cortex and hippocampus revealed by singlecell RNAseq. Science. 2015; 347(6226):1138–42.
 25
Kolodziejczyk AA, Kim JK, Tsang JC, Ilicic T, Henriksson J, Natarajan KN, Tuck AC, Gao X, Bühler M, Liu P, Marioni JC. Single cell RNAsequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell. 2015; 17(4):471–85.
 26
Fletcher RB, Das D, Gadye L, Street KN, Baudhuin A, Wagner A, Cole MB, Flores Q, Choi YG, Yosef N, Purdom E. Deconstructing olfactory stem cell trajectories at singlecell resolution. Cell Stem Cell. 2017; 20(6):817–30.
 27
Perraudeau F, Risso D, Street K, Purdom E, Dudoit S. Bioconductor workflow for singlecell RNA sequencing: Normalization, dimensionality reduction, clustering, and lineage inference. F1000Research. 2017; 6(1158):1158.
 28
Shyu WM, Grosse E, Cleveland WS. Local regression models. In: Statistical Models in S. Routledge: 2017. p. 309–76.
 29
Qiu X, Mao Q, Tang Y, Wang L, Chawla R, Pliner HA, Trapnell C. Reversed graph embedding resolves complex singlecell trajectories. Nat Methods. 2017; 14(10):979.
 30
Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for singlecell transcriptomics. Nat Methods. 2018; 15(12):1053–8.
 31
Purdom E, Risso D. clusterexperiment: Compare clusterings for singlecell sequencing. R package version. 2017; 1(0).
 32
Street K, Risso D, Fletcher RB, Das D, Ngai J, Yosef N, Purdom E, Dudoit S. Slingshot: Cell lineage and pseudotime inference for singlecell transcriptomics. BMC Genomics. 2018; 19(1):477.
Acknowledgements
The authors would like to thank Texas A&M High Performance Research Computing and Texas Advanced Computing Center for providing computational resources to perform experiments in this paper. The authors also would like to thank Ehsan Hajiramezanali and Seyed Nami Niyakan for their help in running scVI and Monocle on the synthetic data.
About this supplement
This article has been published as part of BMC Genomics Volume 21 Supplement 9, 2020: Selected original articles from the Sixth International Workshop on Computational Network Biology: Modeling, Analysis, and Control (CNBMAC 2019): genomics. The full contents of the supplement are available online https://bmcgenomics.biomedcentral.com/articles/supplements/volume21supplement9.
Funding
This work was supported by the National Science Foundation (NSF) Grants 1553281 and 1812641, as well as the United States Department of Agriculture National Institute of Food and Agriculture competitive grant USDANIFASCRI20175118126834 through the National Center of Excellence for Melon at the Vegetable and Fruit Improvement Center of Texas A&M University to XQ; the NSF Grant 1812699 to MZ; the NSF Grant 1532188 and the funding support from QNRF (NPRP90012001, NPRP716342604), CSTR (201701), and CONACYT to PdF. Publication costs are funded by NSF Grant 1553281. The funding agencies had no roles in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author information
Affiliations
Contributions
Conceived the method: SZD, MZ, XQ. Developed the algorithm: SZD, MZ, XQ. Performed the simulations: SZD. Analyzed the results and wrote the paper: SZD, PdF, SHS, MZ, XQ. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file
Additional figures.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Dadaneh, S.Z., de Figueiredo, P., Sze, SH. et al. Bayesian gammanegative binomial modeling of singlecell RNA sequencing data. BMC Genomics 21, 585 (2020). https://doi.org/10.1186/s12864020069388
Published:
Keywords
 Singlecell RNA sequencing
 Bayesian
 Hierarchical modeling