### Over-processing of data by imputation may cause information loss due to data processing inequality and Fano’s lemma

Let three random variables form the Markov chain *X* → *X*^{′} → *Y*, implying that the conditional distribution of *Y* depends only on *X*^{′} and is conditionally independent of *X*. By data processing inequality [13], the mutual information between *X* and *Y* is greater than or equal to that between *X*^{′} and *Y*, i.e.

$$ I\left(X;Y\right)\ge I\left({X}^{\prime };Y\right). $$

(1)

*X* is observed single-cell data, *X*^{′} is imputed data, *Y* is decision variables, such as cell-types or DEG. This equation indicates the information of data cannot be increased from data imputation. Note that if we have a priori information *S* about genes or cell-types, we may have \( I\left(X;Y\right)\le I\left({X}^{\prime };Y|S\right) \), which indicates data imputation with a priori information may improve mutual information. But even in this case, we still have *I*(*X*; *Y*| *S*) ≥ *I*(*X*′; *Y*| *S*).

From Fano’s inequality, we have a lower bound on the detection-error probability (cell-type mis-classification or DEG mis-detection)

$$ {p}_e=\Pr \left(\hat{Y}\ne Y\right)\ge \frac{H(Y)-I\left(X;Y\right)-1}{\log \left(|Y|\right)}. $$

(2)

From data processing inequality, if processed data *X*^{′} instead of un-processed data *X* is used, the right-hand side of eq. (2) becomes bigger. Even though (2) is only a lower bound, data imputation increases the lower bound of error-detection. Therefore, performing data imputation on observed data and performing subsequent analysis leads to information loss and an increase of a lower bound on the detection-error probability. This indicates that there is an opportunity to perform cell-type discovery and DEG analysis simultaneously to prevent such an information loss.

### JOINT algorithm

In the JOINT algorithm we consider a general mixture model

$$ p(x)=\sum \limits_{k=0}^{K-1}{\pi}_k{f}_{k\left(\left.x\right|{\theta}_k\right),} $$

where *x* is observed count number, *k* is the number of cell-types, *π*_{k} is the probability of choosing cell-type *k* and *f*_{k}(*x*|*θ*_{k}) is the probability of observing *x* given parameters *θ*_{k} in cell-type *k*. Given *x* and *θ*_{k}, we compute the posterior probability of observed counts *x* from cell-type *k* as

$$ p\left(\left.k\right|x\right)=\frac{\pi_k{f}_k\left(\left.x\right|{\theta}_k\right)}{\sum_{k=0}^{K-1}{\pi}_k{f}_k\left(\left.x\right|{\theta}_k\right)}. $$

Rather than using hard-clustering methods where a given cell is clustered into a particular cell-type, we obtain the probability of individual cell belonging to each cell-type (Fig. 1a). If a cell has non-zero probability *p* belonging to cell-type *k*, then it contributes accordingly (proportional to *p*) to clustering and DEG analysis for cell-type *k* (Fig. 1a). Here, we assume that *f*_{k}(*x*|*θ*_{k}) takes a generalized zero-inflated negative binomial model with multiple negative binomial components

$$ {q}_{g,k,0}{1}_{x_g==0}+\sum \limits_{l=1}^{L-1}{q}_{g,k,l}\int Gamma\left({\lambda}_{g,k,l}|{\alpha}_{g,k,l},{\beta}_{g,k,l}\right) Poisson\left({x}_g|{s}_c{\lambda}_{g,k,l}\right)d{\lambda}_{g,k,l}, $$

where there are *L* components, *q*_{g, k, 0} is the dropout probability for gene *g* in cell-type *k*, \( {1}_{x_g==0} \) is 1 when *x*_{g} = 0, and otherwise 0. *q*_{g,k,l} is the probability that the observed count *x*_{g} is from the *l*-th negative binomial component for gene *g* in cell-type *k*, and *s*_{c} is a cell level scaler. We choose the same cell scaler as Seurat process which normalizes the library size to 10,000. The dropout probability *q*_{g,k,0} is the probability of observing zero-counts, regardless of the real expression level of gene *g*. When the first dropout term is omitted and *L* = 1, we obtain a *negative binomial model*. When *L* = 2, the model reduces to the *zero-inflated negative binomial model*. When *L* = 3, we obtain a *zero-inflated negative binomial model with two components*. Note that *f*_{k}(*x*|*θ*_{k}) can be also adapted and used for other models in DEG analysis.

To generate observed count *x*, we first draw a cell-type *k* from *π*, which determines a set of parameters used for each gene in cell-type *k*. Then, we choose a negative binomial component type *l* with probability *q*_{g,k,l}. When *l* = 0, we set *x*_{g} = 0, which corresponds to dropout and the process stops. When *l* > 0, we choose *α*_{g,k.l} and *β*_{g,k.l} for each gene in cell-type *k* and generate a Poisson intensity λ_{g,k,l}. Finally, we generate the observed count *x*_{g} from a Poisson distribution with intensity λ_{g,k,l}. Given observed counts in a given cell *x* = [*x*_{0}, …, *x*_{G − 1}], we estimate *θ* = {*α*_{g,k,l}, *β*_{g,k,l}, *q*_{g,k,l}, *π*_{k}} by maximizing the Probability Mass Function

$$ p\left(x|{\pi}_k,{q}_{g,k,l},{\alpha}_{g,k,l},{\beta}_{g,k,l}\right)=\sum \limits_{k=0}^{K-1}{\pi}_k\prod \limits_{g=0}^{G-1}\left({q}_{g,k,0}{1}_{x_g==0}+\sum \limits_{l=1}^{L-1}{q}_{g,k,l}\int Gamma\left({\lambda}_{g,k,l}|{\alpha}_{g,k,l},{\beta}_{g,k,l}\right) Poisson\left({x}_g|{s}_c{\lambda}_{g,k,l}\right)d{\lambda}_{g,k,l}\right), $$

where we assume individual genes obtain independent parameters *α*_{g,k,l}, *β*_{g,k,l}, *q*_{g,k,l}.

We do not assume a constant dispersion across all genes but rather each gene has its own *α*_{g,k,l} and *β*_{g,k,l}. The dropout probability *q*_{g,k,0} is optimized for each gene without assuming specific dependence on the mean expression. Each cell-type has its own negative binomial distribution rather than a single distribution shared across all cell-types. The mixture model is an unsupervised learning problem which is solved using the EM algorithm.

The probability of *x* from cell-type *k* and negative binomial distribution parameters *α*_{g,k,l} and *β*_{g,k,l} (also used for DEG analysis) are calibrated jointly, rather than fixing the cell-type first and estimating parameters for DEG analysis thereafter. Although usually challenging when run on CPU especially with big dataset, model calibration is successfully achieved when it is trained on GPU. All model training and testing was performed on a computer with Intel Xeon CPU E5–2686 v4 @ 2.30GHz with 62GB RAM and NVIDIA Tesla K80 GPU with 17GB memory.

### Model validation using the Zeisel dataset

We chose the Zeisel dataset [21] and analyzed the gene expression with the “Oligodendrocyte” label provided in the dataset for model validation. Top and bottom 10% cells were removed based on their library size. Genes that have non-zero expression between 30 and 90% were chosen. This resulted in a dataset with 742 cells and 3069 genes for model testing and validation. For each gene, we tested the performance of three variations of the JOINT algorithm: 1) *negative binomial* (Poisson-Gamma mixture), 2) *zero-inflated negative binomial* (initial points were: dropout probability *q*_{0} = 0.1, *α* = mean, and *β* = 1), 3) *zero-inflated negative binomial with two components* where one component started from *α* = 0.1 and *β* = 1 (mimic a Poisson component with rate 0.1 from reference [23]) and the other one started from *α* = mean and *β* = 1 in training. The initial probability *q*_{0} was set to 0.5 for the first and 0.4 for the second components. For the proposed generalized zero-inflated negative binomial model with multiple negative binomial components, the probability of getting zero-count is

$$ {q}_{g,k,0}+\sum \limits_{l=1}^{L-1}{q}_{g,k,l}{\left(\frac{\beta_{g,k,l}}{\beta_{g,k,l}+{s}_c}\right)}^{\alpha_{g,k,l}}. $$

In order to test whether the three variations of JOINT algorithm can explain the zero-counts in the Zeisel dataset, we trained all three variations of the algorithm on GPU using TensorFlow, obtained predicted zero-count probability \( {\hat{p}}_{c,g}^0 \) for each gene *g* and cell *c*, then calculated the mean across all cells for each gene \( {\hat{p}}_g^0=\frac{1}{C}\sum {\hat{p}}_{c,g}^0 \). We compared \( {\hat{p}}_g^0 \) to the empirical zero-count probability for each gene \( {\overline{p}}_g^0 \) by counting the number of cells with zero-count (for this gene), divided by the total number of cells. Then, we performed two-sided student t-tests with the null hypothesis that \( {\hat{p}}_g^0-{\overline{p}}_g^0 \) has mean 0, to examine whether each variation of the model can recover the zero-count probability. We found that *p*-values were: *p* = 1.58e^{− 19} for *negative binomial*, *p* = 0.057 for *zero-inflated negative binomial*, and *p* = 1.12e^{− 10} for *zero-inflated negative binomial with two components*. Since we could not reject the null hypothesis (i.e. predicted zero-count probability is the same as the empirical estimate at 95% confidence level), we concluded that the *zero-inflated negative binomial model* can recover the zero-count probability. Although model 3 subsumes model 2, the EM algorithm may converge to a suboptimal local optimum when model 3 is initialized as in Methods.

### Generation of a simulated dataset with two genes and two cell-types

#### Simulation set up

In order to validate and test the clustering performance of the model (Fig. 1b-d, Fig. 2, Fig. S1, S2, S3 and Table S1), we generated a simulated dataset with two genes and two cell-types (clusters) as the “ground truth.” To set up the simulation, we chose *π* = {0.4,0.6}, *q*_{g,k,0} = 0.2, *q*_{g,k,1} = 0.8, and *β*_{g,k,1} = 1.0; first cluster *α*_{0,0,1} = 10 and *α*_{1,0,1} = 5; second cluster *α*_{0,1,1} = 30 and *α*_{1,1,1} = 20.

#### Convergence of the model with iterations

We generated 10,000 samples from the mixture model using parameters described above. In the EM algorithm, we chose initial values *π* = {0.5,0.5}, *q*_{g,k,0} = 0.1, *q*_{g,k,l} = 0.9, and *β*_{g,k,l} = 1.0; first cluster *α*_{0,0,1} = 8 and *α*_{1,0,1} = 8; second cluster *α*_{0,1,1} = 25 and *α*_{1,1,1} = 25. The JOINT algorithm converged after 30 iterations (Fig. 1b and Fig. S1).

#### Convergence of the model with number of samples

For a given number of samples, we randomly generated 50 datasets and applied JOINT on each dataset for statistics. As the number of samples increased, we found that the EM estimate converged to the actual values with smaller variances (Fig. 1c and Fig. S2). This agrees with the fact that Maximum Likelihood (ML) estimates converge almost surely to true values asymptotically when the number of samples goes to infinity [26].

#### Convergence of the model with dropout probability

We fixed the number of samples as 1000 and varied the dropout probability *q*_{g,k,0} from 0.1 to 0.5 with step size of 0.1. At each dropout probability, we generated 50 datasets and ran JOINT on each dataset to test the convergence (Fig. 1d and Fig. S3).

### Generation of a simulated dataset with three cell-types using Zeisel data

We simulated a scRNA-Seq dataset with 3 cell-types (Fig. 3 and Fig. S5, S6, S7). We trained JOINT on cells with the “CA1 Pyramidal” label in the Zeisel dataset [21] for each gene using the EM algorithm. First, we chose cells with > 10,000 library size and genes with non-zero-counts in at least 40% of cells. Then, we trained the JOINT algorithm on the 3529 genes and 834 cells that were selected. Next, we randomly chose 1000 genes without replacement from the selected 3529 genes and generated three cell-types (1200 cells in total). We randomly generated gene counts for 400 cells in each cell-type. In order to generate cells with different DEG numbers, we randomly selected *n* genes (*n* = 50, 100 and 150) from the chosen 1000 genes without replacement and set the mean expression of these genes 1.5 times higher in one cluster than in the other two (1.5 is the median of the gene expression ratio between cells with “CA1 Pyramidal” and “Oligodendrocytes” labels in the dataset (Fig. S4)).

### Evaluation of clustering performance

#### Evaluation of clustering performance using simulated data sets with three genes and three clusters

We assumed the number of cell-types *K* = 3 was known in all algorithms. We performed K-means clustering and spectral clustering on imputed counts from published algorithms with the following transformations: log(1 + count), PCA on log(1 + count) with 2 components, PCA on log(1 + count) with components explaining 25% or 40% of variance. Since we do not know the transformation required to achieve best performance for published imputation algorithms, we tested all 8 transformations for each, and chose the one with the best score for comparison. We also ran the JOINT algorithm (initialized with the same 8 conditions) using original unimputed counts, and chose the one with the highest likelihood as the final solution. In order to obtain clustering scores for JOINT, we assigned each individual cell to the cell-type with the highest posterior probability, converting soft-clustering into hard-clustering results. Although Seurat process [15] can also be used for clustering, different parameters must be chosen for each individual dataset in order to achieve cluster number *K* = 3. Given that the performance of multiple algorithms at different dropout probabilities and DEG numbers needed to be tested extensively, K-means clustering method was used to simplify the process. It is also worth emphasizing that for data mapping and visualization in lower dimensional space, we applied the PCA from the original data without dropout, to the imputed data from published algorithms and data from JOINT, so that all data were transformed with the same projection from higher dimensional space to 2-dimensional space (Fig. 3, Fig. S6, and S7). Mapping to 2-dimensional space allows us to compare these different algorithms by inferring aspects of their relative positions in the original higher dimensional space. This is different than published work where PCA is performed for each individual dataset [11], which makes data incomparable following transformation. Although the simulated dataset may not have the same distribution as the original data, the performance of different algorithms in various conditions can be investigated.

#### Evaluation of clustering performance using real, large-scale scRNA-Seq datasets

We first applied Saver and scImpute algorithms to Baron and Zeisel datasets with default parameters for imputation. Then, we applied standard Seurat process with default parameters to the imputed data using 2000 highly expressed genes and cluster number *K* = 9 and 9 for each dataset. The number of PCA components in Seurat [15] was set to 25 and 45 (from the elbow method [15, 27]) for Baron and Zeisel datasets respectively. Finally, we applied the JOINT algorithm to both datasets.

#### Correlation analysis (cell and gene correlation)

We consider cell to cell correlation and gene to gene correlation. For cell to cell correlation, let *x*_{c} = [*x*_{c,1},..., *x*_{c,G}]^{T} be a vector of counts without dropout for cell *c* and *y*_{c} = [*y*_{c,1},..., *y*_{c,G}]^{T} be the corresponding vector of imputed counts. We compute the Pearson correlation between *x*_{c} and *y*_{c} as

$$ {\rho}_c= pearsonr\left({x}_c,{y}_c\right). $$

The cell to cell correlation is defined as the average of *ρ*_{c} across all cells, i.e.,

$$ {\rho}_{cell}=\frac{1}{C}\sum \limits_{c=1}^C{\rho}_c. $$

Similarly, *x*_{g} = [*x*_{1,g},..., *x*_{C,g}]^{T} be a vector of counts without dropout for gene *g* and *y*_{c} = [*y*_{1,g},..., *y*_{C,g}]^{T} be the corresponding vector of imputed counts. We compute the Pearson correlation between *x*_{g} and *y*_{g} as

$$ {\rho}_g= pearsonr\left({x}_g,{y}_g\right). $$

The gene to gene correlation is defined as the average of *ρ*_{g} across all gene

$$ {\rho}_{gene}=\frac{1}{G}\sum \limits_{g=1}^G{\rho}_g. $$

### Imputation algorithm for data visualization

We impute the observed counts directly. If the observed count is non-zero, we treat it as it is and do not perform imputation. If the observed count is zero, we impute it based on the posterior mean calculated from the JOINT algorithm. Consider a simple case in which we only have one cluster *K* = 1, one negative binomial component *L* = 2, and the observed count is 0. If the observed count is purely from the negative binomial component, the observed count 0 is the true count (the true expression is 0). If the observed count 0 is purely from the zero component, the best estimate in this case is the mean from negative binomial component which we assume is 5. If the probability that the 0 count is from the zero component *q*_{0} = 0.2, the probability from the negative binomial component 1-*q*_{0} = 0.8, and the mean of negative binomial component is 5, then the mean of the count imputed for given observed 0 is 0.2∗5 + 0.8∗0 = 1. We apply the idea formally, given observed count *x*_{c} in cell *c*, we first compute the posterior probability that *c* is from type *k* as

$$ p\left(k|{x}_c\right)=\frac{\pi_k{\Pi}_g{\sum}_l{q}_{g,k,l}h\left({x}_{c,g}|{\theta}_{g,k,l}\right)}{\sum \limits_{\kappa =0}^{K-1}{\pi}_{\kappa }{\Pi}_g{\sum}_{l^{\prime }}{q}_{g,\kappa, {l}^{\prime }}h\left({x}_{c,g}|{\theta}_{g,\kappa, {l}^{\prime }}\right)}, $$

where

$$ h\left({x}_{c,g}|{\alpha}_{g,k,l},{\beta}_{g,k,l}\right)=\left\{\begin{array}{ll}\int Gamma\left({\lambda}_{g,k,l}|{\alpha}_{g,k,l},{\beta}_{g,k,l}\right) Poisson\left({x}_{c,g}|{s}_c{\lambda}_{g,k,l}\right)d{\lambda}_{g,k,l}& l>0\\ {}1,& l=0\end{array}\right.. $$

Given *x*_{g,c} for gene *g* and cell-type *k*, the probability of *x*_{g,c} from the *l*-th negative binomial component is

$$ p\left(l|k,{x}_{g,c}\right)=\frac{q_{g,k,l}h\left({x}_{c,g}|{\theta}_{g,k,l}\right)}{\sum_{l^{\prime }}{q}_{g,k,{l}^{\prime }}h\left({x}_{c,g}|{\theta}_{g,k,{l}^{\prime }}\right)}. $$

The mean of each component *l* is *s*_{c}*m*_{g,k,l} where

$$ {m}_{g,k,l}=\left\{\begin{array}{ll}\frac{\alpha_{g,k,l}}{\beta_{g,k,l}}& l>0\\ {}0,& l=0\end{array}\right. $$

With probability 1 − *p*(0|*k*, *x*_{g,c}) the observed 0 is from a negative binomial component and we do not need imputation in this case. With probability *p*(0|*k*, *x*_{g,c}) the observed count is from dropout events and we use the mean expression (conditional on this count is truly expressed) as the best estimate for imputation. The probability of *l* > 0 conditional on this count is truly expressed is

$$ p\left(l|k,{x}_{g,c}, expressed\right)=\frac{p\left(l|k,{x}_{g,c}\right)p\left( expressed|k,{x}_{g,c},l\right)}{p\left( expressed|k,{x}_{g,c}\right)}=\frac{p\left(l|k,{x}_{g,c}\right)}{p\left( expressed|k,{x}_{g,c}\right)}=\frac{p\left(l|k,{x}_{g,c}\right)}{1-p\left(0|k,{x}_{g,c}\right)}. $$

We thus have the imputation value as

$$ \sum \limits_kp\left(k|{x}_c\right)\left(1-p\left(0|k,{x}_{g,c}\right)\right)\ast 0+p\left(0|k,{x}_{g,c}\right)\sum \limits_{l>0}\frac{p\left(l|k,{x}_{g,c}\right)}{1-p\left(0|k,{x}_{g,c}\right)}{s}_c{m}_{g,k,l}={s}_c\sum \limits_kp\left(k|{x}_c\right)\frac{p\left(0|k,{x}_{g,c}\right)}{1-p\left(0|k,{x}_{g,c}\right)}\sum \limits_{l>0}p\left(l|k,{x}_{g,c}\right){m}_{g,k,l}. $$

### DEG analysis

We apply the Wald test [28] for DEG analysis by directly estimating the mean and the variance of expression conditional on that gene is expressed (or no dropout) for cell-type *k*. Given *p*(*k*|*x*_{c}) and *p*(*l* = 0|*k, x*_{c,g}), let *w*_{c,k} = *p*(*k*|*x*_{c}) and *v*_{c,g,k} = 1 − *p*(*l* = 0|*k*, *x*_{c,g}), where *v*_{c,g,k} is the probability that the observed zero-count is from a negative binomial component. We find the mean by minimizing

$$ \sum \limits_{c,{x}_{c,g}>0}{w}_{c,k}{\left|{x}_{c,g}-{m}_{g,k}\right|}^2+\sum \limits_{c,{x}_{c,g}==0}{w}_{c,k}{v}_{c,g,k}{\left|{x}_{c,g}-{m}_{g,k}\right|}^2. $$

We obtain

$$ {m}_{g,k}=E\left({x}_{c,\mathrm{g}}|k, expressed\right)=\frac{\sum \limits_{c,{x}_{c,g}>0}{w}_{c,k}{x}_{c,g}}{\sum \limits_{c,{x}_{c,g}>0}{w}_{c,k}+{\sum}_{c,{x}_{c,g}==0}{w}_{c,k}{v}_{c,g,k}}, $$

which is a weighted average with weight the probability of the observed count that is expressed in cell-type *k*. Similarly, we compute *E*(*x*^{2}_{c,g}|*k*) and obtain the variance as

$$ {\sigma}^2\left({x}_{c,g}|k, expressed\right)=E\left({x}_{c,g}^2|k, expressed\right)-{E}^2\left({x}_{c,g}|k, expressed\right). $$

Wald test [28] is used with the estimated mean and variance. After model training, it requires simple arithmetic operations to compute the mean and variance for Wald test. The Wald test *p*-values are adjusted using the Benjamini and Hochberg method [29]. As hard-clustering is a special case of soft-clustering with *p*(*k*|*x*_{c})∈{0, 1}, all the proposed DEG algorithms can be readily applied to hard-clustering as well. We are aware that we can use Fisher information matrix to estimate the variance of MLE estimate. However, although a closed-form of Fisher information matrix can be derived, we find the matrix is not always positive semidefinite for real scRNA-Seq data. Therefore, the MLE estimate method cannot be used directly to identify the variance of the EM algorithm. We can also use the likelihood-ratio test. However, it requires training the JOINT multiple times, which is computational expensive.