Selected articles from the 16th International Symposium on Bioinformatics Research and Applications (ISBRA20): genomics
 Research
 Open access
 Published:
A new and effective twostep clustering approach for single cell RNA sequencing data
BMC Genomics volume 23, Article number: 864 (2022)
Abstract
Background
The rapid devolvement of single cell RNA sequencing (scRNAseq) technology leads to huge amounts of scRNAseq data, which greatly advance the research of many biomedical fields involving tissue heterogeneity, pathogenesis of disease and drug resistance etc. One major task in scRNAseq data analysis is to cluster cells in terms of their expression characteristics. Up to now, a number of methods have been proposed to infer cell clusters, yet there is still much space to improve their performance.
Results
In this paper, we develop a new twostep clustering approach to effectively cluster scRNAseq data, which is called TSC — the abbreviation of TwoStep Clustering. Particularly, by dividing all cells into two types: core cells (those possibly lying around the centers of clusters) and noncore cells (those locating in the boundary areas of clusters), we first clusters the core cells by hierarchical clustering (the first step) and then assigns the noncore cells to the corresponding nearest clusters (the second step). Extensive experiments on 12 real scRNAseq datasets show that TSC outperforms the state of the art methods.
Conclusion
TSC is an effective clustering method due to its twosteps clustering strategy, and it is a useful tool for scRNAseq data analysis.
Background
As the basic structural and functional units of all known organisms, cells vary broadly in types and states [1]. Assessing celltocell variability in expression is crucial for disentangling heterogeneous tissues and understanding dynamic biological processes [2]. In traditional sequencing, gene expression is measured over a bulk of cells. Thus, it is hard to study the heterogeneity of cells and characterize rare cell types such as stem cells and cancer cells [3]. Encouragingly, the recent breakthrough in single cell RNA sequencing (scRNAseq) enables us to screen heterogeneous cells [4, 5].
One important task in scRNAseq data analysis is to infer the categories of cells, which is crucial to elucidate cell types and understand cell functions. Clustering is a widely used solution to this task. However, scRNAseq data characteristics of high noise level, dropout events (i.e. expressed genes that are fail to be detected) and high dimensionality complicate this task [6]. By far, a number of clustering methods have been developed for scRNAseq data. For example, Prabhakaran et al. proposed the BISCUIT method, which clusters scRNAseq data by incorporating parameters of technical variation into a Hierarchical Dirichlet Process mixture model [7]. Lin et al. developed an ultrafast algorithm CIDR that takes dropout events into account with a simple implicit imputation approach [8]. By combining multiple clustering solutions, a consensus clustering approach SC3 was designed to cluster scRNAseq data [9]. DIMMSC was specifically proposed for processing dropletbased scRNASeq data, which is based on the Dirichlet mixture model [10]. To handle the challenge of high dimensionality in scRNAseq data, dimension reduction techniques were widely used. For example, pcaReduce integrated principal component analysis (PCA) with an agglomerative clustering method [11]. Shao et al. adapted nonnegative matrix factorization (NMF) to identify subpopulations in scRNAseq data and showed that NMF outperforms PCA in accuracy and robustness [12]. CellTree applies latent dirichlet allocation (LDA) and produces the tree structure of single cells [13]. As shared nearest neighbor (SNN) has been demonstrated more stable and robust for highdimensional data than traditional distance metrics, Chen et al. proposed SNNCliq, which identifies clusters by a quasicliquebased clustering algorithm on a SNN graph [14], while the Seurat method finds clusters of cells by a modularity optimizationbased clustering algorithm on a SNN graph [15]. Other methods like GiniClust and RaceID were developed to solve specific clustering task of rare cell type detection [16, 17]. Recently, deep learningbased methods such as scVI and SAUCIE were proposed to analyze scRNAseq data [18, 19].
Although significant progress has been made in clustering scRNAseq data, existing clustering methods still suffer from various limitations and there is much space to improve clustering accuracy. Most existing methods require to prespecify the number of clusters to be output, which is impractical or even impossible for complex and largescale datasets. Some methods such as probability modelbased or deep learningbased methods, are sensitive to parameters and difficult to implement in practice. As for graph theorybased approaches, they usually use sparse SNN graphs, which tends to obtain excessive amounts of subgraphs, resulting in low clustering accuracy. In summary, the rapidly increasing of scRNAseq data and the drawbacks of existing methods call for novel scRNAseq data clustering solutions.
In this paper, we propose a new and effective approach for scRNAseq data clustering. It is a twostep clustering method called TSC — the abbreviation of TwoStep Clustering. That is, after splitting all cells into core cells that are closely connected with their neighbors and possibly lie around the centers of the underlying clusters, and noncore cells that are less closely connected with their neighbors and possibly located in the boundary areas of the clusters, we first group the core cells by hierarchical clustering (the first step) and then assign the noncore cells into the corresponding nearest clusters (the second step).
Technically, our method features in the following aspects: 1) we employ a “twostep clustering” strategy, which aims to cluster core cells and noncore cells separately, thus alleviate the negative impact of noncore cells (or boundary cells) on clustering accuracy. 2) In datapreprocessing, we propose the rightskewed coefficient (RSC) to measure the degree of rightskewedness in scRNAseq data, and with RSC we can correctly determine whether or not to conduct Logtransformation on the data. 3) We apply random walk to represent the relationship between cells and define the random walk distance, which is used in hierarchical clustering of scRNAseq data. 4) To generate reliable cell graph, we consider five simialrity/distance metrics, including three distance metrics and two correlation metrics. 5) We adopt an effective criterion to automatically determine the number of clusters to generate.
To evaluate the proposed method, we conduct extensive experiments on 12 real scRNAseq datasets. Our experimental results show that the proposed method outperforms several state of the art methods in clustering scRNAseq data.
Results
In this section, we evaluate TSC in clustering scRNAseq data. First, we introduce 12 publicly available scRNAseq datasets and clustering evaluation metric. Then, we compare the effects of similarity/distance metrics applied in TSC on clustering accuracy. Third, we compare the clustering results of TSC with other methods. Fourth, we present the advantage of twostep clustering. Finally, we discuss the effectiveness of Logtransformation.
Datasets and performance metric
We collected twelve real and publicly available scRNAseq datasets from published papers. These datasets mainly contain scRNAseq data about different cell types of mouse embryos, mouse cortex and mouse distal lung epithelium. The datasets have been widely used in evaluating existing scRNAseq data clustering methods.
Table 1 presents the statistical information of these datasets, including the number of cells, clusters and genes and their sequencing protocols. Datasets are named by the accession numbers provided in the original publications. We can note that these datasets range in size from dozens to thousands, with more than 14,000 genes/transcripts. The number of cell types varies from 3 to 14. Units of gene/transcript levels include FPKM (Fragments Per Kilobase of exon model per Million mapped reads), CPM (Counts of exon model per Million mapped reads) and UMI (Unique Molecule Identifier). Specifically, UMI uses a direct measurement of transcript copies for each transcript [20], while FPKM and CPM normalize the raw read counts based on sequencing depth and gene length. In addition, these scRNAseq data were generated from some representative sequencing platforms, such as Smartseq [21], SMARTer [22], SmartSeq2 [23, 24] and inDrop [25].
In our experiments, we use Adjusted Rand Index (ARI) to measure the clustering performance. Given the ground truth class assignments \(labels\_true\) and the predicted class assignments \(labels\_predict\), ARI measures the similarity of these two assignments [32]. Concretely, the overlapping between two assignments can be summarized as a contingency table, which reports the intersection cardinality of each truepredicted cluster pair. ARI is calculated as follows:
where m is the number of cells totally in the dataset, \(t_{ij}\) is the value at the \(i^{th}\)row and the \(j^{th}\)column in the contingency table, \(a_i\) is the sum of the \(i^{th}\)row of the contingency table, \(b_j\) is the sum of the \(j^{th}\)column of the contingency table, and () denotes a binomial coefficient. ARI ranges from 1 to 1, where a negative value means mismatch and ‘1’ indicates a perfect match. Other three commonly used clustering performance evaluation metrics are also applied in this paper, including Normalized Mutual Information (NMI) [33], Adjusted Mutual Information (AMI) [34] and Accuracy (Acc) [35].
Comparison among different similarity/distance metrics
Here we compare the performance when using the five different similarity/distance metrics: ED (Euclidean distance), MD (Manhattan distance), PCC (Pearson correlation coefficient), SCC (Spearman correlation coefficient) and SNN (shared nearest neighbors). We denote the methods used these metrics as TSC\(_{ED}\), TSC\(_{MD}\), TSC\(_{PCC}\), TSC\(_{SCC}\) and TSC\(_{SNN}\), respectively.
Figure 1 shows the ARI results on the 12 datasets. We can see that TSC\(_{SCC}\) achieves the best results on the first four datasets, and TSC\(_{PCC}\) performs best on the last nine datasets. Their average ARI values over the 12 datasets are 0.62 and 0.79 respectively, larger than those of the other three metrics. Overall, TSC\(_{ED}\) and TSC\(_{MD}\) are in the middle, and TSC\(_{SNN}\) performs worst. So in the remaining experiments, we consider only TSC\(_{SCC}\) and TSC\(_{PCC}\).
Comparison with existing methods
Here, we compare our method with six existing methods, including SC3 [9], CIDR [8], SINCERA [36], pcaReduce [11], Seurat [15] and SNNCliq [14]. They represent the state of the art of scRNAseq data clustering [37, 38]. In addition, we also applied spectral clustering (a classical graphbased clustering method) to the scRNAseq data. The ARI results are illustrated in Fig. 2, where the value in the parentheses following each method’s name in the legend is the average ARI over the 12 datasets.
From Fig. 2, we can see that TSC\(_{PCC}\) outperforms the others on 8 of the 12 datasets, and TSC\(_{SCC}\) performs best on 4 of the 12 datasets. They achieve 0.79 and 0.62 of average ARI over the 12 datasets respectively, which are much higher than those of the 6 existing methods. This result validates the advantage of our method over the existing ones. For the existing methods, SINCERA performs best on average, followed by Seurat, CIDR, SC3, pcaReduce and spectral clustering. SNNCliq performs worst. Results of the other three clustering performance metrics show similar trends as that of ARI, which are presented in the Additional file (Additional file 1: Table S1).
Advantage of twostep clustering
Our TSC method adopts a “twostep clustering” strategy. To further demonstrate the advantage of our method, here we compare the performance of our “twostep clustering” strategy and that of the “onestep clustering” strategy. In the “onestep clustering” strategy, we do not split cells to core cells and noncore cells, instead we directly cluster all cells. Note that in the “onestep clustering” strategy, we use similar data processing strategy, random walk distance and hierarchical clustering as in the “twostep clustering” strategy. Both use PCC in graph construction for random walk. The results are presented in Table 2. Here, the 2nd column (“ARI1Step”) presents the ARI results of “onestep clustering”. The 3rd column and the 4th column give the ARI results of TSC\(\_PCC\), but the former “ARI2Stepscore” indicates the ARI computed only on core cells, and the latter “ARI2Steps” is the ARI computed on all cells.
From Table 2, we can see that our “twostep clustering” strategy is more effective than the “onestep clustering” strategy on 10 of the 12 datasets. On average, the ARI of our method is 28% higher than that of the “onestep clustering” strategy. Furthermore, by comparing the results of “ARI2stepscore” and “ARI2steps” over 12 datasets, we can find that the ARI of “ARI2stepscore” is higher than that of “ARI2steps” on all 12 datasets. This is consistent with our expectation that core cells are easier to be clustered than noncore cells.
Effectiveness of Logtransformation
TSC will examine whether or not to perform Logtransformation in data preprocessing. We propose RSC as the criterion of Logtransformation. To evaluate the effectiveness of RSC, in Table 3 we present the RSC values and the corresponding ARI values of TSC\(_{PCC}\) on the 12 scRNAseq datasets. The 3rd/4th column is the ARI values of TSC\(_{PCC}\) without/with Logtransformation.
As shown in Table 3, we can see that the first five datasets (from EMTAB3321 to GSE71585) have relatively large RSC (\(> 0.80\)), and their ARI values when using Logtransformation are much larger than that when not using Logtransformation. On the contrary, for the other seven datasets, they have relatively small RSC (\(<0.5\)), and their ARI values when not using Logtransformation are much larger than that when using Logtransformation.
In summary, from Table 3 we can conclude that 1) RSC is effective in correctly deciding whether or not to perform Logtransformation; 2) When Logtransformation is properly performed according to our RSC criterion, significant improvement on ARI can be achieved.
Effects of parameters in TSC
To select core cells, we adopted a threshold to filter the edges from the fully connected graph. Here, we check the clustering performance of TSC under four cases, i.e., keeping 25% , 50%, 75% and 100% the edges in the fully connected graph. From the results shown in the Additional file (Additional file 2: Fig. S1), we can see that TSC achieves the best clustering accuracy on the twelve datasets when keeping 25% edges in the fully connected graph.
To calculate the distance between cells, we perform random walk on the cell graph, in which the step size (parameter t) plays a key role in cells’ similarity evaluation. Here, we analyze the effect of parameter t on the clustering performance of TSC. Concretely, we evaluate the robustness of TSC to t as follows: changing t’s value from 2 to 15, and evaluating the clustering performance by ARI, the results are shown in the Additional file (Additional file 3: Fig. S2). We can see that TSC has relatively stable ARI when t increases from 2 to 15 on most of the datasets, and by setting t to 4 or 6 can get better performance.
Discussion
scRNAseq clustering is the most direct and effective method to identify novel cell types and characterize the heterogeneous cell populations. Here, we introduce TSC, a novel twostep clustering method, to improve the clustering accuracy. To create a graph for core cells, we considered five different similarity/distance metrics. However, each metric owns its advantages, and it is not sufficient to choose one metric to measure the similarity between cells. For future work, we will try to improve cell graph construction by integrating multiple similarity/distance measurements to make the graphs more reliable, thus further boost clustering performance. On the other hand, considering that deep learning is effective in processing big data, we will also explore new deep learning models for effectively clustering scRNAseq data. Last but not least, considering that annotated scRNAseq data are much less than raw data without annotations, we will also intend to extend our TSC framework to large datasets by exploring semisupervised strategies.
Conclusion
In this paper, we develop a new and effective scRNAseq data clustering method TSC, which adopts a twostep clustering strategy, by first splitting all cells into core cells and noncore cells. Then, the core cells are clustered by hierarchical clustering with random walk distance, and the noncore cells are finally assigned to the clusters according to their distances to these clusters. With the twostep clustering strategy, TSC is able to guarantee the clustering accuracy of core cells and improve the overall accuracy subsequently. In addition, TSC does not need to specify the number of clusters, but determines the cluster number automatically. Moreover, we design the RSC criterion to determine whether or not to perform Logtransformation on data before clustering. Extensive experiments on 12 real datasets show that the proposed method outperforms the state of the art methods in scRNAseq data clustering analysis. In addition, our experiments also show that 1) the twostep clustering strategy is much better than the onestep clustering strategy (directly clustering all cells); 2) The proposed RSC criterion is effective in deciding whether or not to perform Logtransformation on scRNAseq data; 3) PCC and SCC are more effective in constructing cell graphs for clustering than the other three metrics ED, MD and SNN.
Methods
In this section, we describe the TSC method in detail. Figure 3 illustrates the pipeline of TSC, which consists of four major steps: 1) Data preprocessing; 2) Selecting core cells; 3) Calculating distance between core cells by random walk; 4) Grouping core cells by hierarchical clustering (the first clustering step); (5) Assigning the remaining noncore cells to the corresponding nearest clusters (the second clustering step).
In what follows, we give the technical detail of each module above.
Data preprocessing
Since features with excessive amounts of 0 value are not informative for clustering, we first remove genes/transcripts that express (expression value >0) in less than 2% of cells. Actually, a small change to this percentage threshold does not significantly impact clustering result [9].
In scRNAseq data, the expression levels of different genes vary greatly, which leads to the rightskewed distribution phenomena, i.e., the mean is greater than the median. Thus, the similarity or distance between cells would be largely determined by the genes with large values. Many scRNAseq clustering approaches employ Logtransformation to handle rightskewed distribution. However, it is improper to perform Logtransformation on data not fitting rightskewed distribution. Otherwise, the difference between genes will be distorted. To solve this problem, we define a rightskewed coefficient (RSC) to measure the degree of rightskewness of data as follows:
where \(g_i^{max}\) is the maximum expression value of gene i, \(\mu\) is the average of all genes’ maximum values, and l is the number of genes whose maximum expression values are larger than \(\mu\). RSC indicates the average deviation of data points that lie in the right of mean. The larger RSC is, the more the data are rightskewed. In this paper, when RSC is greater than 0.8, we think that the data are heavily rightskewed and Logtransformation is performed. To eliminate the effect of outliers, we remove genes that do not fall in [Q11.5*IQR, Q3+1.5*IQR] before computing RSC [39]. Here, Q1 and Q3 are the first and the third quartile of all genes’ maximum values, and the interquartile range (IQR) is (Q3Q1).
Selecting core cells
Given a scRNAseq dataset, we find the core cells by first constructing a fullyconnected weighted graph \(G_c\) where each node corresponds to a cell and each edgeweight represents the similarity between the two respective cells.
Usually, the similarity between two cells can also evaluated by the difference between 1 and their corresponding distance when the distance is normalized into [0, 1]. So we can treat similarity and distance equally. We consider five similarity/distance measures: Euclidean distance (ED), Manhattan distance (MD), Pearson correlation coefficient (PCC), Spearman correlation coefficient (SCC) and shared nearest neighbors (SNN) [40]. ED and MD are commonly used distance measurements. PCC and SCC range from 1 to 1, we use only the positive values. SNN is also called secondorder distance, which measures the similarity between two samples based on their shared neighbors.
Then, we set a similarity threshold \(s_c\). In the graph \(G_c\), we discard all the edges whose weights are less than \(s_c\). The remaining edges and the nodes connected by any of these remaining edges form a new graph \(G_{cc}\). We call the nodes in \(G_{cc}\) core nodes as they are relatively close to their neighbors and possibly lie around the centers of the underlying cell clusters. Thus, the cells corresponding to the core nodes are core cells, and we call \(G_{cc}\) corecell graph. On the other hand, we call the remaining nodes noncore nodes, and the corresponding cells noncore cells. Noncore nodes are not close to their neighbors as the similarity values between them and their neighbors are less than \(s_c\). So they may be located in the boundary areas of the underlying clusters.
As a rule of thumb, we choose \(s_c\) such that the number of edges in \(G_{cc}\) is around 25% of the total number of edges in \(G_c\).
Calculating distance between core cells by random walk
To calculate the distance between any two core cells, we perform random walk on the corecell graph \(G_{cc}\) constructed above. The random walk process is as follows: Given the transition matrix M where \(M_{ij}=\frac{w_{ij}}{Deg(i)}\), \({Deg(i)}=\sum _{j=1}^{n_i}w_{ij}\), \(n_i\) means the number of neighbors of cell i, \(w_{ij}\) is the similarity between cell i and cell j. Suppose there are n nodes in \(G_{cc}\). If a walker starts from node (or cell) i, then the initial probability \(P_{i.}^0\) is set as a ndimension vector with only the \(i^{th}\) dimension value being 1 and the others being 0. As the walker goes on the graph, the vector of probability is updated according to \(P^{t+1}={M^T}*P^t\) where \(P_{ij}^{t}\) is the probability of the walker going from node i to node j in t steps. It has been shown that if t becomes infinity, the probability \(P_{ij}^{t}\) depends only on the degree of node j. Therefore, it is crucial to choose the value of t: too short will not be enough to capture the graph’s topological information, while too long will result in a stationary distribution. In our experiments, we set t = 4, which is empirically advised by previous study [41].
For cell i, we can obtain a vector of walking probability starting from it. The random walk distance \(d_{ij}\) between cell i and cell j is defined as below:
Grouping core cells by hierarchical clustering
We employ bottomup hierarchical clustering to cluster the core cells. That is, first treat each core cell as a cluster, and then merge the nearest cluster pairs iteratively. The distance between two cells is calculated by Eq. (3). The distance \(d_{Ck}\) between cell k and cluster C and the distance \(d_{C_iC_j}\) between cluster \(C_i\) and cluster \(C_j\) are defined as follows:
where \(\left C \right\) indicates the number of cells in cluster C. One important issue in hierarchical clustering is the criteria for selecting two clusters to merge each time. Here, we adopt the strategy from the Wards method [42]. The change of the average intracluster distance before and after the merging of cluster \(C_i\) and cluster \(C_j\) is evaluated as follows:
where \(C_u=C_i \cup C_j\). We select the two clusters with the smallest value of \(\Delta \sigma\) to merge each time.
Another important issue is to determine the number of clusters to be generated, we use the criteria introduced in [41]. First, evaluating the average intracluster distance \(\sigma _K\) of K clusters as follows:
where \(C_{k}\) means the \(k^{th}\) cluster. Then, calculating the change of the average intracluster distance when the number of clusters increases from K to \(K+1\) by
The optimal number K of clusters is that with the maximum value of \(\eta _{K}\).
Assigning the noncore cells
After clustering the core cells, we get K clusters. To assign the noncore cells to the generated clusters, we first evaluate the center of each cluster as follows:
where \(c_{kj}\) is the value in the \(j^{th}\) dimension of the center vector of cluster k, \(x_{cj}\) is the expression value of the \(j^{th}\) gene of core cell c, \(\chi _k\) is the set of core cells in cluster k and \(\left \chi _k \right\) indicates the number of core cells in cluster k.
For each noncore cell, we then calculate its distance to the center of each cluster, and assign it to the cluster whose center is nearest to the cell.
Availability of data and materials
The datasets used and/or analysed in this study are available from the corresponding articles. Ten datasets are available in the GEO repository with accession number GSE59892, GSE52583, GSE71585, GSE65525 and GSE84133 (including the datasets from GSM2230757 to GSM2230762) (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE59892, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52583, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE71585, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE65525), https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84133). Two datasets are available in the ArrayExpress repository with accession number EMTAB3321 and EMTAB2600 (https://www.ebi.ac.uk/arrayexpress/experiments/EMTAB3321/, https://www.ebi.ac.uk/arrayexpress/experiments/EMTAB2600/).
The source code of TSC is available at https://github.com/LiRuiyiraptor/TSC_Project.
Abbreviations
 scRNAseq:

single cell RNAsequencing
 TSC:

TwoStep Clustering
 SC3:

Singlecell consensus clustering
 SINCERA:

Single cell RNAseq profiling analysis
 CIDR:

Clustering through imputation and dimensionality reduction
 RaceID:

Rare cell type identification
 DIMMSC:

Dirichlet mixture model for clustering dropletbased single cell transcriptomic data
 NMF:

Nonnegative matrix factorization
 ED:

Euclidean distance
 MD:

Manhattan distance
 PCC:

Pearson correlation coefficient
 SCC:

Spearman correlation coefficient
 SNN:

Shared nearest neighbors
 FPKM:

Fragments per kilobase of exon model per million mapped reads
 CPM:

Counts of exon model per million mapped reads
 UMI:

Unique molecule identifier
 ARI:

Adjusted rand index
References
Pavlovic M. Cell physiology: Liaison between structure and function. Springer; 2015.
Chen H, Albergante L, Hsu JY, Lareau CA, Bosco GL, Guan J, et al. Singlecell Trajectories Reconstruction, Exploration and Mapping of omics data with STREAM. Nat Commun. 2019;10(1):1903.
Kalisky T, Blainey P, Quake SR. Genomic analysis at the singlecell level. Annu Rev Genet. 2011;45:431–45.
Shapiro E, Biezuner T, Linnarsson S. Singlecell sequencingbased technologies will revolutionize wholeorganism science. Nat Rev Genet. 2013;14(9):618–30.
Biase F, Wu Q, Calandrelli R, RivasAstroza M, Zhou S, Chen Z, et al. Rainbowseq: combining cell lineage tracking with singlecell RNA sequencing in preimplantation embryos. iScience. 2018;7:16–29.
Kalisky T, Quake SR. Singlecell genomics. Nat Methods. 2011;8(4):311–4.
Prabhakaran S, Azizi E, Carr A, Pe’er D. Dirichlet process mixture model for correcting technical variation in singlecell gene expression data. JMLR Workshop and Conference Proceedings. NY: Curran Associates, Inc.; 2016. p. 1070–1079.
Lin P, Troup M, Ho JW. CIDR: Ultrafast and accurate clustering through imputation for singlecell RNAseq data. Genome Biol. 2017;18(1):1–11.
Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, et al. SC3: consensus clustering of singlecell RNAseq data. Nat Methods. 2017;14(5):483–6.
Sun Z, Wang T, Deng K, Wang XF, Lafyatis R, Ding Y, et al. DIMMSC: a Dirichlet mixture model for clustering dropletbased single cell transcriptomic data. Bioinformatics. 2018;34(1):139–46.
Yau C, et al. pcaReduce: hierarchical clustering of single cell transcriptional profiles. BMC Bioinformatics. 2016;17(1):1–11.
Shao C, Höfer T. Robust classification of singlecell transcriptome data by nonnegative matrix factorization. Bioinformatics. 2017;33(2):235–42.
Yotsukura S, Nomura S, Aburatani H, Tsuda K, et al. Cell Tree: an R/bioconductor package to infer the hierarchical structure of cell populations from singlecell RNAseq data. BMC Bioinformatics. 2016;17(1):1–17.
Xu C, Su Z. Identification of cell types from singlecell transcriptomes using a novel clustering method. Bioinformatics. 2015;31(12):1974–80.
Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of singlecell gene expression data. Nat Biotechnol. 2015;33(5):495–502.
Jiang L, Chen H, Pinello L, Yuan GC. GiniClust: detecting rare cell types from singlecell gene expression data with Gini index. Genome Biol. 2016;17(1):1–13.
Grün D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, et al. Singlecell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015;525(7568):251–5.
Amodio M, Van Dijk D, Srinivasan K, Chen WS, Mohsen H, Moon KR, et al. Exploring singlecell data with deep multitasking neural networks. Nat Methods. 2019;16(11):1139–45.
Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for singlecell transcriptomics. Nat Methods. 2018;15(12):1053–8.
Islam S, Zeisel A, Joost S, La Manno G, Zajac P, Kasper M, et al. Quantitative singlecell RNAseq with unique molecular identifiers. Nat Methods. 2014;11(2):163–6.
Goetz JJ, Trimarchi JM. Transcriptome sequencing of single cells with SmartSeq. Nat Biotechnol. 2012;30(8):763–5.
Verboom K, Everaert C, Bolduc N, Livak KJ, Yigit N, Rombaut D, et al. SMARTer single cell total RNA sequencing. Nucleic Acids Res. 2019;47(16):e93–e93.
Picelli S, Björklund ÅK, Faridani OR, Sagasser S, Winberg G, Sandberg R. Smartseq2 for sensitive fulllength transcriptome profiling in single cells. Nat Methods. 2013;10(11):1096–8.
Picelli S, Faridani OR, Björklund ÅK, Winberg G, Sagasser S, Sandberg R. Fulllength RNAseq from single cells using Smartseq2. Nat Protoc. 2014;9(1):171–81.
Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet barcoding for singlecell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–201.
Biase FH, Cao X, Zhong S. Cell fate inclination within 2cell and 4cell mouse embryos revealed by singlecell RNA sequencing. Genome Res. 2014;24(11):1787–96.
Treutlein B, Brownfield DG, Wu AR, Neff NF, Mantalas GL, Espinoza FH, et al. Reconstructing lineage hierarchies of the distal lung epithelium using singlecell RNAseq. Nature. 2014;509(7500):371–5.
Goolam M, Scialdone A, Graham SJ, Macaulay IC, Jedrusik A, Hupalowska A, et al. Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4cell mouse embryos. Cell. 2016;165(1):61–74.
Kolodziejczyk AA, Kim JK, Tsang JC, Ilicic T, Henriksson J, Natarajan KN, et al. Single cell RNAsequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell. 2015;17(4):471–85.
Tasic B, Menon V, Nguyen TN, Kim TK, Jarsky T, Yao Z, et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat Neurosci. 2016;19(2):335–46.
Baron M, Veres A, Wolock SL, Faust AL, Gaujoux R, Vetere A, et al. A singlecell transcriptomic map of the human and mouse pancreas reveals interand intracell population structure. Cell Syst. 2016;3(4):346–60.
Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2(1):193–218.
Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J Mach Learn Res. 2010;11(Oct):2837–54.
Fowlkes EB, Mallows CL. A method for comparing two hierarchical clusterings. J Am Stat Assoc. 1983;78(383):553–69.
Lopez R, Regier J, Cole MB, et al. Deep generative modeling for singlecell transcriptomics. Nat Methods. 2018;15:1053–8.
Guo M, Wang H, Potter SS, Whitsett JA, Xu Y. SINCERA: a pipeline for singlecell RNASeq profiling analysis. PLoS Comput Biol. 2015;11(11):e1004575.
Duò A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for singlecell RNAseq data. F1000Research. 2018;7:1141.
Li R, Guan J, Zhou S. Singlecell RNAseq data clustering: A survey with performance comparison study. J Bioinforma Comput Biol. 2020;18(04):2040005.
Hubert M, Van der Veeken S. Outlier detection for skewed data. J Chemom J Chemom Soc. 2008;22(3–4):235–46.
Jarvis RA, Patrick EA. Clustering using a similarity measure based on shared near neighbors. IEEE Trans Comput. 1973;100(11):1025–34.
Pons P, Latapy M. Computing communities in large networks using random walks. J Graph Algorithms Appl. 2006;10(2):191–218.
Ward JH Jr. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 1963;58(301):236–44.
Acknowledgements
Not applicable.
About this supplement
This article has been published as part of BMC Genomics Volume 23 Supplement 6, 2022: Selected articles from the 16th International Symposium on Bioinformatics Research and Applications (ISBRA20): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume23supplement6.
Funding
Shuigeng Zhou was supported by the National Natural Science Foundation of China (NSFC) under grant No. 61972100. RuiYi Li and Jihong Guan were supported by the National Natural Science Foundation of China (NSFC) under grant No. 61772367. NSFC funded the design of the study, the analysis and interpretation of data, and the collection of data and the writing of the manuscript. Publication costs are funded by NSFC No. 61972100.
Author information
Authors and Affiliations
Contributions
RYL and SGZ conceived this work and designed the experiments. RYL carried out the experiments and drafted the manuscript. RYL, ZYW and JHG collected the data and analyzed the results. SGZ revised the manuscript. All authors have read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1: Table S1.
Clustering performance evaluation with four metrics.
Additional file 2: Figure S1.
ARI vs. edge filtering threshold. For the subgraph of each database, the horizontal coordinate corresponds to four cases: the number of edges in the graph is N_{e} , 3/4 N_{e}, 1/2Ne and 1/4N_{e} , where N_{e} indicates the number of edges in the fully connected graph. Curves of different colors represent results of TSC with different similarity/distance measurements.
Additional file 3: Figure S2.
ARI of TSC_{PCC} vs. parameter t. The horizontal coordinate corresponds to the value of parameter t, and curves of different colors correspond to the results on different data sets.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Li, R., Guan, J., Wang, Z. et al. A new and effective twostep clustering approach for single cell RNA sequencing data. BMC Genomics 23 (Suppl 6), 864 (2022). https://doi.org/10.1186/s1286402309577x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1286402309577x