 Methodology
 Open Access
 Published:
IImpute: a selfconsistent method to impute single cell RNA sequencing data
BMC Genomics volume 21, Article number: 618 (2020)
Abstract
Background
Singlecell RNAsequencing (scRNAseq) is becoming indispensable in the study of cellspecific transcriptomes. However, in scRNAseq techniques, only a small fraction of the genes are captured due to “dropout” events. These dropout events require intensive treatment when analyzing scRNAseq data. For example, imputation tools have been proposed to estimate dropout events and denoise data. The performance of these imputation tools are often evaluated, or finetuned, using various clustering criteria based on groundtruth cell subgroup labels. This limits their effectiveness in the cases where we lack cell subgroup knowledge. We consider an alternative strategy which requires the imputation to follow a “selfconsistency” principle; that is, the imputation process is to refine its results until there is no internal inconsistency or dropouts from the data.
Results
We propose the use of “selfconsistency” as a main criteria in performing imputation. To demonstrate this principle we devised IImpute, a “selfconsistent” method, to impute scRNAseq data. IImpute optimizes continuous similarities and dropout probabilities, in iterative refinements until a selfconsistent imputation is reached. On the in silico data sets, IImpute exhibited the highest Pearson correlations for different dropout rates consistently compared with the stateofart methods SAVER and scImpute. Furthermore, we collected three wetlab datasets, mouse bladder cells dataset, embryonic stem cells dataset, and aortic leukocyte cells dataset, to evaluate the tools. IImpute exhibited feasible cell subpopulation discovery efficacy on all the three datasets. It achieves the highest clustering accuracy compared with SAVER and scImpute.
Conclusions
A strategy based on “selfconsistency”, captured through our method, IImpute, gave imputation results better than the stateoftheart tools. Source code of IImpute can be accessed at https://github.com/xikanfeng2/IImpute.
Background
Singlecell RNAsequencing (scRNAseq) is becoming indispensable in studying the landscapes of cellspecific transcriptomes [1]. It demonstrates robust efficacy in capturing transcriptomewide celltocell heterogeneity with high resolution [2–5]. With meta information such as time series or patient histology, scRNAseq has the potential to decipher the underlying patterns in cell cycles [6–8], complex diseases [9–11], and cancers [8, 12–16].
As with other sequencing techniques, scRNAseq produces a count matrix which captures expression profiles, with genes as rows, cells as columns, and the gene counts as the matrix elements. scRNAseq only captures a small fraction of the genes due to “dropout” events. That is, it produces a zeroinflated count matrix where only about 10% entries are nonzero values [17]. This is mainly due to the missing of truly expressed transcripts from some cells during sequencing. The dropout rate is protocoldependent [18]. When analyzing scRNAseq data, the excess zero counts from dropout events needs to be remedied. Otherwise, the zero count distribution from different protocols may lead to diverging potency, which will affect downstream analyses [18], such as clustering, cell type recognition, dimension reduction, differential gene expression analysis, identification of cell specific genes and reconstruction of differentiation trajectory on zeroinflated singlecell gene expression data [18]. The correctness of all these analyses are contingent on the correctness of the expression profile.
As a remedy, downstream scRNAseqbased analyses such as clustering, cell type recognition, and dimension reduction, can be adapted to implicitly incorporate considerations for dropout events [19–22]. On the other hand, dropout events can be treated prior to downstream analysis with scRNAseq imputation tools. Two such leading tools are SAVER and scImpute. SAVER [23] imputes by borrowing information across genes using a Bayesian approach which estimates the expression levels. It aims to reduce meaningless biological variation and retain valuable biological variation. One caveat is that SAVER would unfairly adjust all gene expression levels including the actual nonexpression of genes, hence possibly interject new biases and abolish real biological meanings. scImpute [18] is designed to first identify dropout values with GammaNormal mixture model, and then impute the dropout events by borrowing information from similar cells, with the expression level of undropout events unchanged. It automatically excludes the outlier cells and their gene information, which are likely to influence the original imputation values. While scImpute is able to avoid the problem which SAVER faces, it is not good with extremely sparse datasets.
On in silico data where the ground truth counts are known, the root mean square error (RMSE) between imputed and ground truth entries is the most common metrics for imputation evaluation [24]. For wetlab data sets, the ground truth counts for missing events are unknown. One common practice is to randomly remove nonzeros entries and employ an imputation tool to impute these removed entries. Then, the RMSE for the removed entries is calculated as a criterion to evaluate the performance of the imputation [24, 25]. Another common practice is to implicitly validate imputation efficacy by checking whether the imputation improves the downstream analysis result. This check, on the other hand, typically requires additional knowledge. For instance, clustering measurements such as adjusted Rand index (ARI), normalized mutual information (NMI), silhouette width (SW), and withincluster sum of squares are commonly adopted for scRNAseq imputation evaluation [18, 26], but these evaluations all require the true cluster labels, which are often hard to obtain.
As an explicit measurement, imputation consistency has been discussed in several studies. Buuren et al. [27] stated that the imputed entries should remain internal homogeneous to the nonmissing data. Liang et al. adopts consistent estimate after imputation step for highdimensional data [28]. Here, we propose a new interpretation for imputation consistency. As a reliable imputation tool should assume its output contains no dropout or errors. We want the imputation tool to be consistent in its output: If we are to feed the output to the imputation tool again after eliminating a number of entries, the tool should be able to reproduce these entries. We refer this property as selfconsistency.
Therefore, in this study, to study the effects of the new criterion, we developed a selfconsistent method called IImpute for scRNAseq data imputation. We compared IImpute with the stateoftheart imputation tools, by evaluating their imputation performance as well as their selfconsistency. On the in silico data sets, IImpute exhibited consistently the highest Pearson correlations for different dropout rates compared to SAVER and scImpute. Furthermore, several discrete cell subpopulations have been reported in scRNASeq data collected from the wet lab; the identification of subpopulations of cells is crucial [29]. Here, we collected three wetlab datasets, mouse bladder cells dataset, embryonic stem (ES) cells dataset, and aortic leukocyte cells dataset to evaluate the tools. IImpute exhibited feasible cell subpopulation discovery efficacy on all the three datasets. It achieves the highest clustering accuracy compared to SAVER and scImpute.
Results
Evaluating the selfconsistency of existing imputation tools in synthetic data
To evaluate the imputation tools, we applied the R package Splatter [30] to generate scRNAseq reads count data. We simulated 150 cells of three groups, each with 2,000 genes. Then we generated three sparse matrices by setting the dropout rates as 88.45%, 63.29%, and 45.16%; and their corresponding zero rates are 90.87%, 70.98%, and 56.65%, respectively.
We first validated whether the existing imputation tools are selfconsistent. We consider the imputation process as a complex function f:x→x that maps the zeroinflated matrix into an output matrix of the same shape. We say that f is selfconsistent if and only if the root mean square error (RMSE) between x and f(x) is less than a predetermined threshold θ, that is, x−f(x)_{2}≤θ. The results are shown in Table 1. We found that SAVER and scImpute are not selfconsistent. scImpute has RMSE values of 7.346 at 88.45% dropout data, 0.2392 at 63.29% dropout data, and 0.2677 at 45.16% dropout data. For these data sets, SAVER has RMSE value of 0.5613, 1.0245, and 1.3561 respectively. Nevertheless, when ground truth group labels are incorporated, traditional evaluation metrics show SAVER to outperformed scImpute with respect to adjusted Rand index (ARI), normalized mutual information (NMI), and silhouette width (SW) (see Additional file 1, Table S1).
Our tool, IImpute, is constructed on both the principle of selfconsistency as well as to optimizing the existing imputation metrics (ARI, NMI, and SW). As illustrated in Fig. 1a, IImpute first calls an internal subroutine (called CImpute), which uses continuous similarities and dropout probabilities to infer missing entries. Then, IImpute invokes SAVER as a subroutine to preprocess the data. Finally, it deploys CImpute iteratively on the processed data (see Fig. 1b). As illustrated in Additional file 1, Fig. S1, after some number of iterations, the RMSE of IImpute approaches to below 0.1, which is much smaller than SAVER and scImpute. That is, assume θ=0.1, the imputed result converges to a selfconsistent matrix, with RMSE values of 0.0936, 0.0806, and 0.0381 in three synthetic datasets, respectively (see Table 1).
IImpute recovers gene expression affected by dropouts in synthetic data
To validate the performance of IImpute, we plotted the heatmap of the raw matrix, the 88.45% dropout matrix, and the imputed matrices, respectively (see Fig. 2af). IImpute’s output are closest to the raw matrices, compared to SAVER, scImpute, and CImpute. As illustrated in Fig. 2g, SAVER failed in reproducing many entries in the raw matrices, leading to the lowest Pearson correlation 0.58 between its output and the ground truth. scImpute and CImpute changed some highly expressed elements into zero, hence introducing new bias after imputation (see Fig. 2hi). With no extreme pulldown or pullup prediction, IImpute exhibited the most robust recovery power, with the highest Pearson correlation 0.78 (see Fig. 2j). On data with 63.29% and 45.16% dropout rate, IImpute also gave the highest Pearson correlation of 0.90 and 0.94, respectively (see Additional file 1, Table S4).
The tSNE embedding plots of the raw matrix, 88.45% dropout matrix, and recovered matrices show that SAVER, CImpute, and IImpute recover the missing entries, while preserving cell subgroups structures well (see Fig. 3af). Silhouette width (SW) further validated that the ingroup similarity and outgroup separation were enhanced after the imputation by SAVER, CImpute, and IImpute. That is, the average silhouette value increased from 0.0862 (dropout data) to 0.1075 (SAVER), 0.1705 (CImpute), and 0.2429 (IImpute), respectively (see Additional file 1, Table S1). Figure 3g demonstrates that IImpute achieves the most noticeable improvement, while scImpute illustrates lower SW values than dropout data. Next, we applied hierarchical clustering into all matrices, and computed the adjusted Rand index (ARI) and normalized mutual information (NMI) to evaluate the clustering accuracy. ARI and NMI measure the overlap between the inferred groups and groundtruth clusters; a score of 0 implies random labeling while 1 indicates perfect inference. In Fig. 3g, IImpute outperforms all other tools and exhibits the best subpopulation identification strength, with the highest clustering accuracy (ARI: 0.8721, NMI: 0.8521, see Additional file 1, Table S1). Experiments on data sets of 63.2% and 45.16% dropout rate also proved that IImpute produced the best recovered matrices; with ARI 1.0, NMI 1.0, SW 0.3908 for 63.2% dropout rate, and ARI 0.9801, NMI 0.9710, and SW 0.4123 for 45.16% dropout rate (see Additional file 1, Table S2S3).
Overall, the synthetic experiment demonstrates that by incorporating CImpute to refine the SAVER processed data iteratively, IImpute is able to mitigate the inconsistency in SAVER’s result and this resulted in improved imputation.
IImpute promotes cell subpopulation identification in real data sets
To examine the effects of IImpute on the identification of cell subpopulations, we performed tests on three real scRNASeq datasets. The first test involves a dataset of mouse Bladder cells which contains 162 cells of three cell types. Due to dropout events, 73.5% of the read counts in the raw count matrix are zeros. We evaluated the imputation power by reviewing the tSNE embedding result and silhouette width (SW). ScImpute mixes part of Unknowntype cells (purple dots) with the Fibroblasts1 cells (blue dots) and Fibroblasts2 cells (yellow dots); SAVER, CImpute, and IImpute distinguish the Unknowntype cells from Fibroblasts1 cells and Fibroblasts2 cells well. Compared with raw and other imputed data, IImpute produced the most compact clusters with highest silhouette width of 0.1758 (Fig. 4a). We then compared the hierarchical clustering accuracy, ARI and NMI. Both measurements show that with 0.6054 ARI and 0.7892 NMI, IImpute resulted in the best clustering (ARI:0.1937, NMI:0.45), compared to those based on the imputations by SAVER (ARI:0.5253, NMI:0.7085), scImpute (ARI:0.1937, NMI:0.45), or CImpute (ARI:0.1664, NMI:0.4317) (Fig. 4a, Additional file 1, Table S5).
We next tested the tools on a mouse embryonic stem (ES) cells dataset. This dataset contains 2717 cells of four cell types (mouse ES cells sample 1, mouse ES cells LIF 2 days, mouse ES cells LIF 4 days and mouse ES cells LIF 7 days). Due to the high running time of scImpute on large cells dataset, we randomly selected 200 cells and no subpopulations and genes were excluded during this process. Due to dropout events, 67.0% of read counts in the raw count matrix are zeros. Figure 4b shows that SAVER and IImpute achieved overwhelmingly better imputation power than other tools. In the 2D tSNE embedding space, the results from SAVER and IImpute both separate the 2 days cells (the yellow dot) from the 4 days cells (the green dots) and the 7 days cells (the blue dots) well. From the Silhouette width, adjusted Rand index, and normalized mutual information, we found that IImpute (ARI:0.7047, NMI:0.7444, SW:0.2275) produced a tighter and more accurate incluster structure than SAVER (ARI:0.692, NMI:0.7329, SW:0.2235)(Additional file 1, Table S6). Hence IImpute was able to allow identification of the cell subpopulations in spite of the 67.0% missing rate.
Finally, we performed test with a mouse Aortic Leukocyte cells dataset. This dataset contains 378 cells of six cell types (B cells, T cells, T memory cells, Macrophages, Nuocytes, and Neutrophils). Due to dropout events, 91.2% of read counts in the raw count matrix are zeros. Both SAVER and IImpute grouped the T memory cells (the yellow dots) into big cluster, while in raw data and other imputed matrices, T memory cells are separated into different clusters (see Fig. 4c). In this test, IImpute gave a silhouette width of 0.0711, which is poorer than the result from SAVER. Nevertheless, IImpute outperformed all other tools in hierarchical clustering tasks with the highest ARI (0.522) and NMI (0.7728) (Additional file 1, Table S7).
Discussion
In this paper, we introduced IImpute, which is designed to impute scRNA missing entries iteratively. Experiments using synthetic and real data demonstrated IImpute to be particularly suited for cell subpopulation discovery.
There are some advantages of IImpute compared with scImpute and SAVER. First, IImpute produces results which will be treated consistently when they are given back as input, and the imputed matrix are of tighter hierarchical structure. Second, scImpute requires the user to decide the cell groups number K and assign cells in the same group equal weights during imputation, whereas IImpute does not require such a hyperparameter K but instead builds a continuous affinity matrix by leveraging on the Gaussian kernel. Last but not least, Lasso regression makes unimportant weights zero, which can help to filter the distant cells for the regression.
Concerning the hyperparameter pruning, the parameter t denotes the threshold of dropout probabilities. We have conducted experiments to guide the pruning. The result in Additional file 1, Fig. S2 suggests that the value of parameter t should not be too small, and t=0.5 is adequate as the default setting.
Conclusions
Imputation is an essential step in the use of scRNAseq. In this work we introduced an imputation criterion called selfconsistency and demonstrated the effectiveness of this criterion with an iterative imputation tool called IImpute. Experiments on simulation data and real data sets showed IImpute to be highly feasible in imputation and in the discovery of cell subpopulation.
Methods
CImpute
IImpute utilizes a subroutine called CImpute, which performs imputation with an objective function based on continuous similarity and Lasso penalty (see Fig. 1a). The following describes this subroutine.
Data prepossessing
The input of CImpute is a count matrix \({{{\dot {\boldsymbol {X}}}^{C}} \in M \times N_{total}}\) which contains rows as genes and columns as cells, where M and N_{total} represent the total number of genes and cells correspondingly. The dropout values are replaced by zero counts.
First, CImpute performs normalization, dimension reduction, and outlier removal as in scImpute [18]. This results in a matrix X∈M×N and Z∈K×N, where K is the reduced dimensionality of metagenes, N is the number of remained cells.
Affinity matrix constructing
From Z, a cell affinity matrix A∈N×N is computed with Euclidean distance and Gaussian Kernel:
where i,j represent two different cell indices, \({\boldsymbol {Z}}^{\top }_{i}\) and \({\boldsymbol {Z}}^{\top }_{j}\) indicate the principle components of ith and jth cell respectively, ·_{F} is the Frobenius norm. For the ith cell, the kernel width will be set to the distance between it and its nnearest neighbor, cell k, which stands for the cell whose distance to cell i is nth smallest in all other cells, where n is a hyperparameter.
Identification of dropout values and calculating dropout rate
With preprocessed gene expression matrix X, we utilize a statistical model to infer which entries are influenced by the dropout effects. Instead of treating all zero values as missing entries, we use the GammaNormal mixture model to learn whether a zero observation originates from dropout or not. We use the Normal distribution to present the actual gene expression level and Gamma distribution to take the dropout events into account. Since the preprocessed matrix X is no longer of integral values, we cannot adopt zeroinflated negative binomial (ZINB) distribution.
For the ith gene and its observed value x in prepossessed gene profiling X_{i}, the GammaNormal mixture model will be:
where π_{i} is the dropout rate of gene i, α_{i} and β_{i} is the shape and rate parameter of Gamma distribution respectively, μ_{i} and σ_{i} are the mean and standard deviation of Normal distribution. The estimated model parameters \(\hat {{\boldsymbol {\pi }}}, \hat {{\boldsymbol {\alpha }}}, \hat {{\boldsymbol {\beta }}}, \hat {{\boldsymbol {\mu }}}\), and \(\hat {{\boldsymbol {\sigma }}}\) are obtained by ExpectationMaximization (EM) algorithm. Then, we can calculate the dropout probability matrix D∈M×N.
This mixture model enables the identification of whether an observed value is a dropout value or not, since a zero value can be either caused by a technical error or may reflect the actual expression value. If a gene has high expression and low variation in most of its similar cells, a zero count will have high dropout probability and more likely to be a dropout value; otherwise, the zero value may exhibit real biological variability [18].
Imputation of dropout values
To impute the gene expression levels, we first define a hyperparameter t which is used as the threshold to determine if X_{ij} is a dropout event. An entry of dropout probability less than t is considered a real observation, in which case its value is retained. Otherwise, while values with dropout probability higher than t will be replaced by imputation result. We perform imputation by linear regression weighted by dropout probability and cell affinity.
where \({\boldsymbol {D}}^{\top }_{j}\) and \({\boldsymbol {X}}^{\top }_{j}\) are the jth column of D and X respectively. The ∘ operator is the Hadamard product which follows (P∘Q)_{ij}=P_{ij}Q_{ij}. \(\bar {j}\) denotes all indices except index j, thus \({\boldsymbol {D}}^{\top }_{\bar {j}}\) and \({\boldsymbol {X}}^{\top }_{\bar {j}}\) denotes the submatrix of D and X which contains all cells except the jth cell, respectively. \({\boldsymbol {A}}_{j\bar {j}}\) stores the pairwise affinity between jth cell and all other cells; \({\boldsymbol {X}}^{\top }_{\bar {j}}\) is a submatrix of X which contains all cells except the jth cell. ⊙ operator represents the vector and matrix multiplication, e.g. (p⊙Q)_{ij}=p_{i}Q_{ij}. Leveraging \(\left (1{\boldsymbol {D}}^{\top }_{j}\right)\circ {\boldsymbol {X}}^{\top }_{j}\) as target indicates that genes with high dropout probability in jth cell will not contribute to optimization. Furthermore, the multiplication of \(\left (1{\boldsymbol {D}}_{\bar {j}}^{\top }\right)\) and \({\boldsymbol {A}}_{j\bar {j}}\) ensures that the information is only borrowed from the trusted genes with low dropout probabilities in the similar cells. Nonnegative weights \({\boldsymbol {B}}^{\top }_{j}\) are extra contributions of all other cells learned from regression.
For jth cell, the objective is:
\(\mathcal {L}\)1 is applied to avoid overfitting and further ensure that the imputation borrow information from the cell’s most similar neighbors.
Assume \(y \in \mathbb {R}^{M} = \left (1{\boldsymbol {D}}^{\top }_{j}\right)\circ {\boldsymbol {X}}^{\top }_{j},\beta \in \mathbb {R}^{N} = {\boldsymbol {B}}_{j}^{\top }, X \in \mathbb {R}^{M \times N} = \left (1{\boldsymbol {D}}_{\bar {j}}^{\top }\right)\circ \left ({\boldsymbol {A}}_{j\bar {j}} \odot {\boldsymbol {X}}^{\top }_{\overline {j}}\right)\), for each jth cell we can simplify the objective to nonnegative least squares lasso regression \(\min _{\beta }\left \left y  X\beta \right \right _{2}^{2} + \lambda \left \left \beta \right \right _{1},\beta \geq 0\), and solve it by coordinate descent [31].
IImpute
As mentioned, IImpute performs a selfconsistent imputation on scRNAseq data. The method is as illustrated in Fig. 1b. IImpute utilizes CImpute to iteratively refine SAVER processed data. After a few iterations, the result converges to a selfconsistent matrix (<θ) and is given as IImpute’s output.
We define selfconsistency of a functional mapping f:x→x given by input data X∈M×N:
Evaluation metrics
Adjusted rand index and normalized mutual information
The adjusted Rand index (ARI) [32] and normalized mutual information (NMI) [33] are adopted as clustering accuracy. They measure the similarity between a clustering result and the actual clusters. A value close to 0 indicates random labeling or no mutual information, and a value of 1 demonstrates 100% consistency between the clustering and the actual clusters.
Silhouette width
The silhouette width (SW) measures the similarity of a sample to its class compared to other categories [34]. It ranges from 1 to 1. A higher silhouette value suggests a more appropriate clustering. A silhouette value near 0 indicates overlapping clusters and a negative value indicates that the clustering has been performed incorrectly. We adopted the silhouette width to evaluate the model’s imputation power. We used the groundtruth subtype classes as the input cluster labels.
Simulation and benchmark settings
Splatter are used to generate simulated scRNAseq data. The parameters used for our simulation dataset are nGroups=3, nGenes=2000, batchCells=150, seeds=42, dropout.type=“experiment”, dropout.shape=1 and dropout.mid=2, 3, 5 for three different dropout rate data.
SAVER and scImpute are the stateoftheart tools which IImpute is compared against. For the SAVER R package, we used the “saver” function with the parameters ncores=12 and estimates.only=TRUE to perform the imputation tasks. The parameters for scImpute are drop_thre=0.5, ncores=10, Kclusters=(number of true clusters in input data).
On synthetic data, IImpute configuration is n=40, normalize=False, and iteration=True. On real data sets, IImpute configuration is n=40, and iteration=True when tested with the mouse Bladder cell dataset and ES cell dataset, and is n=20, and iteration=True when tested with the mouse Aortic Leukocyte cell dataset.
Availability of data and materials
The real scRNAseq datasets analysed during the current study are all publicly available. The mouse ES cell dataset [35] was downloaded from the Gene Expression Omnibus (GEO) with the accession code GSE65525. The mouse Bladder cell dataset and Aortic Leukocyte cell dataset were downloaded from the PanglaoDB [36] with the accession code SRS3044239and SRS2747908respectively. The Python package IImpute is freely available at https://github.com/xikanfeng2/IImpute.
Abbreviations
 scRNAseq:

Singlecell RNAsequencing
 ARI:

Adjusted Rand Index
 NMI:

Normalized Mutual Information
 SW:

Silhouette Width
 RMSE:

Root Mean Square Error
References
 1
McDavid A, Finak G, Chattopadyay PK, Dominguez M, Lamoreaux L, Ma SS, Roederer M, Gottardo R. Data exploration, quality control and testing in singlecell qpcrbased gene expression experiments. Bioinformatics. 2012; 29(4):461–7.
 2
Saliba AE, Westermann AJ, Gorski SA, Vogel J. Singlecell rnaseq: advances and future challenges. Nucleic Acids Res. 2014; 42(14):8845–60.
 3
Vallejos CA, Marioni JC, Richardson S. Basics: Bayesian analysis of singlecell sequencing data. PLoS Comput Biol. 2015; 11(6):1004333.
 4
Kolodziejczyk AA, Kim JK, Svensson V, Marioni JC, Teichmann SA. The technology and biology of singlecell rna sequencing. Mol Cell. 2015; 58(4):610–20.
 5
Liu S, Trapnell C. Singlecell transcriptome sequencing: recent advances and remaining challenges. F1000Research. 2016; 5:182.
 6
Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS, Rinn JL. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014; 32(4):381.
 7
Liu Z, Lou H, Xie K, Wang H, Chen N, Aparicio OM, Zhang MQ, Jiang R, Chen T. Reconstructing cell cycle pseudo timeseries via singlecell transcriptome data. Nat Commun. 2017; 8(1):22.
 8
Horning AM, Wang Y, Lin CK, Louie AD, Jadhav RR, Hung CN, Wang CM, Lin CL, Kirma NB, Liss MA, et al. Singlecell rnaseq reveals a subpopulation of prostate cancer cells with enhanced cellcycle–related transcription and attenuated androgen response. Cancer Res. 2018; 78(4):853–64.
 9
Baruch K, Deczkowska A, Rosenzweig N, TsitsouKampeli A, Sharif AM, MatcovitchNatan O, Kertser A, David E, Amit I, Schwartz M. Pd1 immune checkpoint blockade reduces pathology and improves memory in mouse models of alzheimer’s disease. Nat Med. 2016; 22(2):135.
 10
Segerstolpe Å, Palasantza A, Eliasson P, Andersson EM, Andréasson AC, Sun X, Picelli S, Sabirsh A, Clausen M, Bjursell MK, et al. Singlecell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 2016; 24(4):593–607.
 11
Lawlor N, George J, Bolisetty M, Kursawe R, Sun L, Sivakamasundari V, Kycia I, Robson P, Stitzel ML. Singlecell transcriptomes identify human islet cell signatures and reveal celltype–specific expression changes in type 2 diabetes. Genome Res. 2017; 27(2):208–22.
 12
Chung W, Eum HH, Lee HO, Lee KM, Lee HB, Kim KT, Ryu HS, Kim S, Lee JE, Park YH, et al. Singlecell rnaseq enables comprehensive tumour and immune cell profiling in primary breast cancer. Nat Commun. 2017; 8:15081.
 13
Karaayvaz M, Cristea S, Gillespie SM, Patel AP, Mylvaganam R, Luo CC, Specht MC, Bernstein BE, Michor F, Ellisen LW. Unravelling subclonal heterogeneity and aggressive disease states in tnbc through singlecell rnaseq. Nat Commun. 2018; 9(1):3588.
 14
Guo X, Zhang Y, Zheng L, Zheng C, Song J, Zhang Q, Kang B, Liu Z, Jin L, Xing R, et al. Global characterization of t cells in nonsmallcell lung cancer by singlecell sequencing. Nat Med. 2018; 24(7):978.
 15
Kim C, Gao R, Sei E, Brandt R, Hartman J, Hatschek T, Crosetto N, Foukakis T, Navin NE. Chemoresistance evolution in triplenegative breast cancer delineated by singlecell sequencing. Cell. 2018; 173(4):879–93.
 16
Bartoschek M, Oskolkov N, Bocci M, Lövrot J, Larsson C, Sommarin M, Madsen CD, Lindgren D, Pekar G, Karlsson G, et al. Spatially and functionally distinct subclasses of breast cancerassociated fibroblasts revealed by single cell rna sequencing. Nat Commun. 2018; 9(1):5150.
 17
Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to singlecell differential expression analysis. Nat Methods. 2014; 11(7):740.
 18
Li WV, Li JJ. An accurate and robust imputation method scimpute for singlecell rnaseq data. Nat Commun. 2018; 9(1):997.
 19
Xu C, Su Z. Identification of cell types from singlecell transcriptomes using a novel clustering method. Bioinformatics. 2015; 31(12):1974–80.
 20
Lin P, Troup M, Ho JW. Cidr: Ultrafast and accurate clustering through imputation for singlecell rnaseq data. Genome Biol. 2017; 18(1):59.
 21
Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of singlecell gene expression data. Nat Biotechnol. 2015; 33(5):495.
 22
Pierson E, Yau C. Zifa: Dimensionality reduction for zeroinflated singlecell gene expression analysis. Genome Biol. 2015; 16(1):241.
 23
Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, Murray JI, Raj A, Li M, Zhang NR. Saver: gene expression recovery for singlecell rna sequencing. Nat Methods. 2018; 15(7):539.
 24
Deng Y, Bao F, Dai Q, Wu LF, Altschuler SJ. Scalable analysis of celltype composition from singlecell transcriptomics using deep recurrent learning. Nat Methods. 2019; 16(4):311.
 25
Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for singlecell transcriptomics. Nat Methods. 2018; 15(12):1053.
 26
Eraslan G, Simon LM, Mircea M, Mueller NS, Theis FJ. Singlecell rnaseq denoising using a deep count autoencoder. Nat Commun. 2019; 10(1):390.
 27
Van Buuren S, Van Rijckevorsel JL. Imputation of missing categorical data by maximizing internal consistency. Psychometrika. 1992; 57(4):567–80.
 28
Liang F, Jia B, Xue J, Li Q, Luo Y. An imputation–regularized optimization algorithm for high dimensional missing data problems and beyond. J R Stat Soc Ser B Stat Methodol. 2018; 80(5):899–926.
 29
Wang Y, Hoinka J, Przytycka TM. Subpopulation detection and their comparative analysis across singlecell experiments with scpopcorn. Cell Syst. 2019; 8:506–13.
 30
Zappia L, Phipson B, Oshlack A. Splatter: simulation of singlecell rna sequencing data. Genome Biol. 2017; 18(1):174.
 31
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010; 33(1):1.
 32
Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971; 66(336):846–50.
 33
Cover TM, Thomas JA. Elements of Information Theory, vol. 68. New York: Wiley; 1991, pp. 69–73.
 34
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987; 20:53–65.
 35
Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, Kirschner MW. Droplet barcoding for singlecell transcriptomics applied to embryonic stem cells. Cell. 2015; 161(5):1187–201.
 36
Franzén O, Gan LM, Björkegren JL. Panglaodb: a web server for exploration of mouse and human singlecell rna sequencing data. Database. 2019; 2019:baz046.
Acknowledgements
We would like to express sincere gratitude to Yen Kaow Ng from Kotai Biotechnologies, Japan for manuscript revision.
About this supplement
This article has been published as part of BMC Genomics Volume 21 Supplement 10, 2020: Selected articles from the 18th Asia Pacific Bioinformatics Conference (APBC 2020): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume21supplement10.
Funding
This work and publication costs are funded by the GRF Research Projects 9042348 (CityU 11257316). The funding body did not play any role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Affiliations
Contributions
SCL conceived the idea and supervised the project. XF, LC, ZW, SCL discussed the algorithm and designed the experiments. XF implemented the code and conducted the analysis. LC, XF drafted the manuscript. ZW, SCL revised the manuscript. All author(s) read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
The PDF file includes all the supporting materials for the manuscript.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Feng, X., Chen, L., Wang, Z. et al. IImpute: a selfconsistent method to impute single cell RNA sequencing data. BMC Genomics 21, 618 (2020). https://doi.org/10.1186/s1286402007007w
Published:
Keywords
 scRNAseq
 Imputation
 Selfconsistency
 Cell subpopulation identification