The scReClassify framework
Here we summarise the proposed framework of scReClassify. Briefly, scReClassify (Fig. 1) expects an N×M expression matrix (denoted as X) with N being the number of cells and M being the number of genes. Importantly, it also expects that an initial cell type annotation of cells (denoted as y) is available. This initial cell type annotation may be inferred using prior biological knowledge such as cell functions, morphologies, physiologies and marker genes, and computational techniques such as PCA, tSNE, clustering and SOMs, or combinations of these approaches. Assuming both X and y are given for a scRNA-seq dataset, scReClassify performs post hoc cell type classification by first using PCA (“Determining ensemble size” section) to reduce the dimensionality of the expression matrix and then applying a semi-supervised learning procedure, AdaSampling (“AdaSampling and ensemble learning” section), to learn and adjust cell type labels for cells that are likely to be mislabelled in the initial annotation.
The PCA dimension reduction procedure
Due to the high feature-dimensionality (i.e. the large number of measured genes in each cell) and that only a small fraction of them are cell type-specific and therefore are informative for cell type identification, it is often necessary to apply dimension reduction techniques prior to downstream cell type identification and analysis [16].
Starting from the original scRNA-seq expression dataset, which we denote as an N×M matrix as previously described, we apply PCA to perform dimension reduction. We select the number of PCs to use (M′) by first determining d, the number of PCs required to capture at least 70% of overall variability in the dataset. If d falls outside of the range of 10 and 20, we set M′ to either be 10 or 20 (summarised in Eq. 1). After the PCA procedure, the dimension of our original expression matrix will be reduced to N×M′.
$$ M' = \left\{ \begin{array}{ll} 10, & d < 10 \\ d, & 10 \leq d \leq 20 \\ 20, & d > 20 \end{array} \right. $$
(1)
AdaSampling and ensemble learning
Most scRNA-seq studies annotate each cell profiled in the experiments by its most likely cell type, and classify all cells into a finite set of cell types. Due to the technical limitations and/or limitations in current biological knowledge, the cell type annotation for a given scRNA-seq dataset may contain a proportion of mislabelled cells (a.k.a label noise and denoted as ρ). AdaSampling is a wrapper approach that couples a weighted sampling procedure with a probabilistic classification model for learning and correcting mislabelled instances in a dataset [27, 31]. Here we extend AdaSampling for multi-class classification and tailor its application for scRNA-seq datasets.
Let us denote the initial observed cell type labels as yi=k, where k∈{1,...,K} and K is the total number of cell types in a dataset, and i denotes cell index, where i∈{1,...,N} and N is the total number of cells. In a multi-class classification problem (K≥ 3), the probability of a cell being mislabelled εi can be estimated by a probabilistic classification model as follows:
$$ \varepsilon_{i} = P(\hat{y}_{i} \neq k | \mathbf{\mathrm{x}}_{i},y_{i} = k, D_{\rho}) = 1 - P(\hat{y}_{i} = k | \mathbf{\mathrm{x}}_{i},y_{i} = k, D_{\rho}) $$
(2)
where \(P(\hat {y}_{i} \neq k |\mathbf {\mathrm {x}}_{i},y_{i} = k, D_{\rho })\) is the probability of the ith cell with a pre-assigned type of k been classified as a cell type that is not k by a given classifier trained on the dataset Dρ. Here xi∈X is the expression profile of the ith cell.
To identify these mislabelled cells and subsequently reclassify them to their correct cell types, AdaSampling starts by treating all cells with uniform probability \(\frac {1}{n}\) of being selected and training a given classification model on Dρ that estimate the mislabelling probability of each cell using Eq. (2). Subsequently, AdaSampling applies a weighted sampling from Dρ in which each cell i will have a probability of 1−εi to be included in the updated training data Dρ,l (where l=1,...,L denotes number of iterations). In other words, cells that are likely to be mislabelled (i.e. those with large εi) will have less chance to be included in next iteration of model training. This procedure iteratively improves the accuracy of the trained classification model by improving the quality of the cell labels included in training the model. Our previous evaluation suggests that only a few iteration is needed to achieve a good performance [31]. Our implemented package allows users to specify this parameter and a default of 3 iterations was used in this study.
The final estimation of mislabelling probability for each cell is εi,L (i=1,...,N). A εi,L weighted sampling from Dρ can be applied to generate the final training dataset Dρ,L+1, and the classification model trained on Dρ,L+1 can be used to reclassify all cells in the original dataset Dρ. Alternatively, εi,L weighted sampling can be performed multiple times and will result in \(D^{b}_{\rho,L+1}, (1,..., B)\) training datasets. Each of the dataset then can be used to train a base classification model. An ensemble classification of all cells in Dρ can be obtained by combining the classification probabilities of each cell from each of the base models.
Base classifiers
The AdaSampling framework, essentially a wrapper procedure, relies on base classification models to estimate the mislabelling probability of each cell in a scRNA-seq dataset. In this work, we used either support vector machine (SVM) [32] or random forest (RF) [33] as base classifiers to provide such probabilistic estimates, but any other classifiers capable of providing probabilistic estimates can be used. Specifically, for SVM, the prediction probability of each cell is estimated using Platt’s method [34] as follows:
$$ P(\hat{y} = k|\mathbf{\mathrm{x}},D) = \frac{1}{1 + \text{exp}(A \times f(\mathbf{\mathrm{x}}) + B)} $$
$$ f(\mathbf{\mathrm{x}}) = \beta + \sum_{\tau \in S} \alpha_{\tau} \text{exp}(-\gamma ||\mathbf{\mathrm{x}} - \mathbf{\mathrm{x}}_\tau||^{2}_{2}) $$
where S is the support vector set and A and B are parameters (estimated by maximum likelihood) of sigmoid link function that converts the output f(x) from the SVM into a probability where ατ and β represent the coefficient and intercept term of a classification vector respectively, and exp(\(-\gamma ||x-x_{\tau }||^{2}_{2}\)) represents radial basis kernel. Note that we used one-against-one approach for multi-class learning with SVM.
For RF, the prediction probability of each cell is estimated from a collection of trees as follows:
$$ P(\hat{y} = k|\mathbf{\mathrm{x}},D) = \frac{1}{B}\sum_{b=1}^{B} \text{round}(h(\mathbf{\mathrm{x}}|\theta_{b})) $$
(3)
where h(x|θb) denote classification tree, each parameterised by θb by training on the bth bootstrap data sample and a subset of features randomly sampled from M′. The round(.) function converts the probability value to the nearest integer to either 0 or 1. We estimate the probabilities by making a class prediction for each tree, and counting the fraction of trees that vote for a certain class. B is set to a default value of 100.
Simulated and experimental scRNA-seq datasets
We used both simulated and real-world experimental scRNA-seq datasets to evaluate the proposed method. To simulate scRNA-seq datasets, we used the Splatter R package [35]. Specifically, datasets with 3, 5, 7 and 9 cell types were simulated. The number of cells in each type was set to 100, the number of genes in each transcriptome was set to 10,000 and the probability of a gene being differentially expressed was set to 0.1. To simulate mislabelling, cell type label of 10%, 20%, 30%, 40% and 50% of randomly selected cells in each cell type group were flipped to the next cell type group. This was done for all cell types in each dataset and therefore introduced an overall of 10%, 20%, 30%, 40% and 50% mislabelled cells in each dataset with known ground truth label.
For experimental scRNA-seq datasets (Table 1), the cell type annotation information from their original studies were treated as ‘gold standard’ for method evaluation. The same procedure, in which 10%, 20%, 30%, 40% and 50% of the cells randomly selected in each cell type group were flipped into the next cell type group, was used to introduce mislabelled cells in each dataset. Both simulated and experimental scRNA-seq datasets were in unit fragments per kilobase of transcript per million mapped reads (FPKM) and were converted into log2(FPKM+1). Genes with more than 80% of missing values across cells were removed. This is because that while typically all genes are included in a scRNA-seq dataset, only a subset of genes are expressed. Therefore, many of the genes are expected to have no quantification and removing non-expressed genes in a scRNA-seq dataset is a common pre-processing approach [36]. After filtering, these datasets were used to assess if and to what degree a method can correctly recover the mislabelled cells in each dataset. These datasets were used to assess if and to what degree a method can correctly recover the mislabelled cells in each dataset.
Evaluation metrics
The mean classification accuracy and adjusted Rand index (ARI) were used to assess the performance of cell type identification from each dataset. For each cell i in a scRNA-seq dataset, let us denote the observed cell type label as yi and the ground truth label (or gold standard label in the cases of experimental datasets) as si, then the mean classification accuracy can be defined as follows:
$$ \mathrm{mean~accuracy} = \frac{1}{K} \sum_{k=1}^{K} \frac{1}{N_{k}}\sum_{s_{i} = k} I\left(\hat{y}_{i}=s_{i}\right) $$
(4)
where \(\hat {y}_{i}\) is the cell type given by the classification model, k∈{1,...,K} is the index of cell types in a dataset with K cell types, I(.) is the indicator function, and Nk is the number of cells the kth cell type.
ARI is a measure of the concordance between two lists of groupings [37] and is widely used for benchmark performance of clustering method for scRNA-seq data. One advantage of ARI over mean classification accuracy is that the size of each cell type (i.e. the number of cells in each cell type group) is adjusted for in ARI whereas the mean classification accuracy does not distinguish between the different numbers of cells across cell type groups. Let \(y_{i}, \hat {y}_{i}\), and si be the lists of cell type labels for each cell in a scRNA-seq dataset, then for any given two lists we can define a the number of pairs of cells labelled as the same type in both lists, b the number of pairs of cells labelled as the same cell type in the first list but as different cell types in the second list, c the number of pairs of cells labelled as different cell types in the first list but as the same cell type in the second list, and d the number of pairs of cells labelled as from different cell types in both the first and the second list. ARI is then defined as follows:
$$ \text{ARI}=\frac{2(ad-bc)}{(a+b)(b+d)+(a+c)(c+d)} $$
(5)