C-Impute
I-Impute utilizes a subroutine called C-Impute, which performs imputation with an objective function based on continuous similarity and Lasso penalty (see Fig. 1a). The following describes this subroutine.
Data prepossessing
The input of C-Impute is a count matrix \({{{\dot {\boldsymbol {X}}}^{C}} \in M \times N_{total}}\) which contains rows as genes and columns as cells, where M and Ntotal represent the total number of genes and cells correspondingly. The dropout values are replaced by zero counts.
First, C-Impute performs normalization, dimension reduction, and outlier removal as in scImpute [18]. This results in a matrix X∈M×N and Z∈K×N, where K is the reduced dimensionality of metagenes, N is the number of remained cells.
Affinity matrix constructing
From Z, a cell affinity matrix A∈N×N is computed with Euclidean distance and Gaussian Kernel:
$$\begin{array}{*{20}l} \text{Dist}(i, j) &= \left|\left|{\boldsymbol{Z}}^{\top}_{i} - {\boldsymbol{Z}}^{\top}_{j}\right|\right|^{2}_{F} \end{array} $$
(1)
$$\begin{array}{*{20}l} \sigma_{i} &= \text{Dist}(i,k) \end{array} $$
(2)
$$\begin{array}{*{20}l} {\boldsymbol{A}}_{ij} &= \left\{\begin{array}{ll} \exp^{-\frac{\text{Dist}(i, j)}{2\sigma_{i}^{2}}}, \text{ Dist}(i, j) \leq \sigma_{i}, \\ 0, \text{ Dist}(i, j) > \sigma_{i}. \end{array}\right. \end{array} $$
(3)
where i,j represent two different cell indices, \({\boldsymbol {Z}}^{\top }_{i}\) and \({\boldsymbol {Z}}^{\top }_{j}\) indicate the principle components of i-th and j-th cell respectively, ||·||F is the Frobenius norm. For the i-th cell, the kernel width will be set to the distance between it and its n-nearest neighbor, cell k, which stands for the cell whose distance to cell i is n-th smallest in all other cells, where n is a hyper-parameter.
Identification of dropout values and calculating dropout rate
With preprocessed gene expression matrix X, we utilize a statistical model to infer which entries are influenced by the dropout effects. Instead of treating all zero values as missing entries, we use the Gamma-Normal mixture model to learn whether a zero observation originates from dropout or not. We use the Normal distribution to present the actual gene expression level and Gamma distribution to take the dropout events into account. Since the preprocessed matrix X is no longer of integral values, we cannot adopt zero-inflated negative binomial (ZINB) distribution.
For the i-th gene and its observed value x in prepossessed gene profiling Xi, the Gamma-Normal mixture model will be:
$$\begin{array}{*{20}l} {}\begin{aligned} &f_{\text{Gamma-Normal}}\left(x; {\boldsymbol{\pi}}_{i}, {\boldsymbol{\alpha}}_{i}, {\boldsymbol{\beta}}_{i}, {\boldsymbol{\mu}}_{i}, {\boldsymbol{\sigma}}_{i}\right) \\ =&{\boldsymbol{\pi}}_{i} \text{Gamma}\left(x;{\boldsymbol{\alpha}}_{i},{\boldsymbol{\beta}}_{i}\right)+\left(1-{\boldsymbol{\pi}}_{i}\right)\text{Normal}\left(x;{\boldsymbol{\mu}}_{i}, {\boldsymbol{\sigma}}_{i}\right) \end{aligned} \end{array} $$
(4)
where πi is the dropout rate of gene i, αi and βi is the shape and rate parameter of Gamma distribution respectively, μi and σi are the mean and standard deviation of Normal distribution. The estimated model parameters \(\hat {{\boldsymbol {\pi }}}, \hat {{\boldsymbol {\alpha }}}, \hat {{\boldsymbol {\beta }}}, \hat {{\boldsymbol {\mu }}}\), and \(\hat {{\boldsymbol {\sigma }}}\) are obtained by Expectation-Maximization (EM) algorithm. Then, we can calculate the dropout probability matrix D∈M×N.
$$\begin{array}{*{20}l} {\boldsymbol{D}}_{ij}= \frac{{\boldsymbol{\pi}}_{i} \text{Gamma}\left({\boldsymbol{X}}_{ij}; {\boldsymbol{\alpha}}_{i}, {\boldsymbol{\beta}}_{i}\right)}{f_{\text{Gamma-Normal}}\left({\boldsymbol{X}}_{ij}; {\boldsymbol{\pi}}_{i}, {\boldsymbol{\alpha}}_{i}, {\boldsymbol{\beta}}_{i}, {\boldsymbol{\mu}_{i}}, {\boldsymbol{\sigma}}_{i}\right) } \end{array} $$
(5)
This mixture model enables the identification of whether an observed value is a dropout value or not, since a zero value can be either caused by a technical error or may reflect the actual expression value. If a gene has high expression and low variation in most of its similar cells, a zero count will have high dropout probability and more likely to be a dropout value; otherwise, the zero value may exhibit real biological variability [18].
Imputation of dropout values
To impute the gene expression levels, we first define a hyper-parameter t which is used as the threshold to determine if Xij is a dropout event. An entry of dropout probability less than t is considered a real observation, in which case its value is retained. Otherwise, while values with dropout probability higher than t will be replaced by imputation result. We perform imputation by linear regression weighted by dropout probability and cell affinity.
$$\begin{array}{*{20}l} \hat{{\boldsymbol{X}}}_{ij} &= \left\{\begin{array}{lc} {\boldsymbol{X}}_{ij}, {\boldsymbol{D}}_{ij} < t, \\ \left(\left(1-{\boldsymbol{D}}_{\bar{j}}^{\top}\right)\circ\left({\boldsymbol{A}}_{j\bar{j}} \odot {\boldsymbol{X}}^{\top}_{\overline{j}}\right)\right){\boldsymbol{B}}^{\top}_{j}, {\boldsymbol{D}}_{ij} \geq t. \end{array}\right. \end{array} $$
(6)
where \({\boldsymbol {D}}^{\top }_{j}\) and \({\boldsymbol {X}}^{\top }_{j}\) are the j-th column of D and X respectively. The ∘ operator is the Hadamard product which follows (P∘Q)ij=PijQij. \(\bar {j}\) denotes all indices except index j, thus \({\boldsymbol {D}}^{\top }_{\bar {j}}\) and \({\boldsymbol {X}}^{\top }_{\bar {j}}\) denotes the sub-matrix of D and X which contains all cells except the j-th cell, respectively. \({\boldsymbol {A}}_{j\bar {j}}\) stores the pairwise affinity between j-th cell and all other cells; \({\boldsymbol {X}}^{\top }_{\bar {j}}\) is a sub-matrix of X which contains all cells except the j-th cell. ⊙ operator represents the vector and matrix multiplication, e.g. (p⊙Q)ij=piQij. Leveraging \(\left (1-{\boldsymbol {D}}^{\top }_{j}\right)\circ {\boldsymbol {X}}^{\top }_{j}\) as target indicates that genes with high dropout probability in j-th cell will not contribute to optimization. Furthermore, the multiplication of \(\left (1-{\boldsymbol {D}}_{\bar {j}}^{\top }\right)\) and \({\boldsymbol {A}}_{j\bar {j}}\) ensures that the information is only borrowed from the trusted genes with low dropout probabilities in the similar cells. Non-negative weights \({\boldsymbol {B}}^{\top }_{j}\) are extra contributions of all other cells learned from regression.
For j-th cell, the objective is:
$$\begin{array}{*{20}l} \begin{aligned} \underset{{\boldsymbol{B}}^{\top}_{j}}{\min} &\sum\limits_{j = 1}^{N}||\left(1-{\boldsymbol{D}}^{\top}_{j}\right)\circ{\boldsymbol{X}}^{\top}_{j} \\ &- \left[\left(1-{\boldsymbol{D}}_{\bar{j}}^{\top}\right)\circ\left({\boldsymbol{A}}_{j\bar{j}} \odot {\boldsymbol{X}}^{\top}_{\overline{j}}\right)\right]{\boldsymbol{B}}^{\top}_{j}||_{F}^{2} \\ &+ \lambda \left|\left|{\boldsymbol{B}}^{\top}_{j}\right|\right|_{1},\\ \text{ sub}&\text{ject to }{\boldsymbol{B}}^{\top}_{j} \geq 0 \\ \end{aligned} \end{array} $$
(7)
\(\mathcal {L}\)1 is applied to avoid over-fitting and further ensure that the imputation borrow information from the cell’s most similar neighbors.
Assume \(y \in \mathbb {R}^{M} = \left (1-{\boldsymbol {D}}^{\top }_{j}\right)\circ {\boldsymbol {X}}^{\top }_{j},\beta \in \mathbb {R}^{N} = {\boldsymbol {B}}_{j}^{\top }, X \in \mathbb {R}^{M \times N} = \left (1-{\boldsymbol {D}}_{\bar {j}}^{\top }\right)\circ \left ({\boldsymbol {A}}_{j\bar {j}} \odot {\boldsymbol {X}}^{\top }_{\overline {j}}\right)\), for each j-th cell we can simplify the objective to non-negative least squares lasso regression \(\min _{\beta }\left |\left |y - X\beta \right |\right |_{2}^{2} + \lambda \left |\left |\beta \right |\right |_{1},\beta \geq 0\), and solve it by coordinate descent [31].
I-Impute
As mentioned, I-Impute performs a self-consistent imputation on scRNA-seq data. The method is as illustrated in Fig. 1b. I-Impute utilizes C-Impute to iteratively refine SAVER processed data. After a few iterations, the result converges to a self-consistent matrix (<θ) and is given as I-Impute’s output.
We define self-consistency of a functional mapping f:x→x given by input data X∈M×N:
$$\begin{array}{*{20}l} \begin{aligned} {\boldsymbol{X}}_{output} &= f({\boldsymbol{X}}) \\ \text{self-consistency}\,(f; {\boldsymbol{X}}) &= \frac{\left|\left|{\boldsymbol{X}}_{output} - f\left({\boldsymbol{X}}_{output}\right)\right|\right|^{2}_{F}}{M \times N} \\ \text{self-consistency}\,(f; {\boldsymbol{X}}) &< \theta \rightarrow f \text{ is self-consistent} \end{aligned} \end{array} $$
(8)
Evaluation metrics
Adjusted rand index and normalized mutual information
The adjusted Rand index (ARI) [32] and normalized mutual information (NMI) [33] are adopted as clustering accuracy. They measure the similarity between a clustering result and the actual clusters. A value close to 0 indicates random labeling or no mutual information, and a value of 1 demonstrates 100% consistency between the clustering and the actual clusters.
Silhouette width
The silhouette width (SW) measures the similarity of a sample to its class compared to other categories [34]. It ranges from -1 to 1. A higher silhouette value suggests a more appropriate clustering. A silhouette value near 0 indicates overlapping clusters and a negative value indicates that the clustering has been performed incorrectly. We adopted the silhouette width to evaluate the model’s imputation power. We used the ground-truth subtype classes as the input cluster labels.
Simulation and benchmark settings
Splatter are used to generate simulated scRNA-seq data. The parameters used for our simulation dataset are nGroups=3, nGenes=2000, batchCells=150, seeds=42, dropout.type=“experiment”, dropout.shape=-1 and dropout.mid=2, 3, 5 for three different dropout rate data.
SAVER and scImpute are the state-of-the-art tools which I-Impute is compared against. For the SAVER R package, we used the “saver” function with the parameters ncores=12 and estimates.only=TRUE to perform the imputation tasks. The parameters for scImpute are drop_thre=0.5, ncores=10, Kclusters=(number of true clusters in input data).
On synthetic data, I-Impute configuration is n=40, normalize=False, and iteration=True. On real data sets, I-Impute configuration is n=40, and iteration=True when tested with the mouse Bladder cell dataset and ES cell dataset, and is n=20, and iteration=True when tested with the mouse Aortic Leukocyte cell dataset.