Here we summarized the framework of gCAnno. gCAnno adopts graph structure for cell type specific gene set detection and accurate cell type annotation. Firstly, gCAnno builds cell type-gene bipartite graph based on gene expression abundances and intensities, in which gene expression abundance is the proportion of cells expressing the gene in a given cell type while intensity is the average expression in cells expressing the gene. Then, graph embedding is adopted to obtain the embedding vectors of gene nodes and cell type nodes. Next, gCAnno selects a set of genes for each cell type with similar profiles in the embedding space. Finally, based on the detected cell type specific genes, gCAnno trains naïve Bayes and SVM classifiers. The workflow of gCAnno is depicted in Fig. 1.
Cell type-gene bipartite graph construction
Starting from the well-annotated reference scRNA-seq data, we constructed a weighted cell type-gene bipartite graph (wCGBG) containing both cell type nodes (CTN) and gene nodes (GN). Edges between CTN and GN indicate the correlation of a gene and a cell type while weight W measures significance of correlation. The weight is calculated by:
$$ {w}_{k,j}=\Big\{{\displaystyle \begin{array}{l}\frac{m_{k,j}}{n_k}\times mean\left(\overrightarrow{g_{k,j}}\right),\kern2em if\kern1em {n}_k\ne 0\\ {}0\kern7.8em ,\kern2em others\end{array}}\operatorname{} $$
(1)
where nk is the cell count of cell type k, mj, k is the number of cells expressed gene j in cell type k. \( \overrightarrow{g_{j,k}} \) is the expression vector of gene j in cell type k. W is the product of the gene expression abundance and intensity. We use gene expression abundance and intensity to establish a relationship between cell types and genes in the form of proportion to reduce the impact of individual gene loss (dropout) or cell number imbalance.
Graph embedding and cell type-gene specific relation detection
After wCGBG construction, we used node2vec to obtain the low dimensional vectors (the embedding vectors) of gene nodes and cell type nodes. The first step is construction of a neighborhood set N(u) of each node u (either gene or cell type node) by a probability walk [22]. Then, we optimized the following objective function f(u) by maximizing the log-probability of observing a neighborhood set.
$$ {\max}_f\sum \limits_{u\in V}\log P\left(N(u)|f(u)\right) $$
(2)
This optimization step enables the embedding vectors to capture the specificity and strength of interactions between cell node and gene node, e.g. if one gene is specific and highly expressed in one cell type, the corresponding two embedding vectors are similar. Then, we calculated Euclidean distance between the vector of genes and cell types. We selected top n (a user defined parameter, default n = 65, Additional file 1: Figure S1) closest genes for each cell type as the cell type specific gene set based on the overall performance on the five datasets we used [3,4,5,6, 18].
Classifier construction
After obtaining the cell type specific gene set, we build naïve Bayes (gCAnno-Bayes) and SVM (gCAnno-SVM) classifiers for annotation. For gCAnno-SVM, we directly use the expression of cell type specific genes as features to train an SVM classifier. For gCAnno-Bayes, we build a binary matrix to presents cell type and its corresponding specific genes, e.g. the element bij = 1 indicates gene j is one of the specific genes in cell type i. We train a Bernoulli Naïve Bayes to get genes’ conditional probability in each cell type and the prior probability of cell types. The query dataset is binarized and the annotation is based on maximum posterior probability of single cell’s cell type specific genes expression.
Performance measurement and dataset
Performance assessment and comparison
Cell type annotation is a typical multi-classification problem. We applied kappa coefficient as the performance measurement of classification, defined as Eq. (3).
$$ \kappa =\frac{p_o-{p}_e}{1-{p}_e},\kern0.5em {p}_o=\frac{N_{corr}}{N_t},\kern0.5em {p}_e=\frac{\sum \limits_{i=1}^K{a}_i\times {b}_i}{N_t\times {N}_t} $$
(3)
where Ncorr is the ratio of total number of cells with corrected cell type annotation, Nt is the total number of cells in the dataset, K is the number of truly cell types, ai is the number of corrected annotated cells in the i-th cell type, and bi is the number of cells in the i-th cell type, po is the accuracy, ai × bi is the product of the actual and predicted quantity, pe punishes bias for unbalance evaluation.
To evaluate the performance of gCAnno, we performed both cross-validation test and independent heterogeneous test (cross-platform test). First, we adopted the five-fold cross-validation strategy following recent single cell analysis comparison published earlier [15, 17] on four published datasets and simulated noise datasets to evaluate the overall and robustness performance (Additional file 2: File S1). Then, we performed independent test on datasets from different sequencing platforms (the cross-platform testing) to evaluate the generalization capability of gCAnno.
Tools in comparison
The calculation results of Scmap, Chetah and scPred were obtained from the corresponding publications [13,14,15]. For SVM, we followed the previous report [17] which is using drop-based method [23] for feature selection.
Datasets used in basic overall performance test
To illustrate the stable performance of gCAnno across various species and tissue types, we compared gCAnno with other methods using four published datasets, including liver, pancreas, Arabidopsis thaliana root (AT root), hepatocellular carcinoma and intrahepatic cholangiocarcinoma (HCC and ICCA) datasets (Table 1; Additional file 2: File S1; Additional file 3: Figure S2; Additional file 4: Table S1). The true labels of the cells in each dataset are obtained from the corresponding publications.
Large dataset with deep annotation level
To demonstrate the performance of gCAnno in large dataset (cell number more than 50,000) with deep annotation level (more than 20 cell types). We compared gCAnno with other methods in 20 mouse organs dataset with 54,246 cells, 29 cell types and 23,433 genes. The true labels of the cells in each dataset are also obtained from the original publications [18] (Additional file 2: File S1; Additional file 5: Figure S3; Additional file 6: Table S2).
Simulated dropout and imbalance datasets
To evaluate the robustness of gCAnno in the presence of dropout noise, we simulated different dropout rates in four above datasets (Table 1), by modifying the expression level of a random gene subset (10, 20, 30, 40 and 50% of all genes) to zero (Additional file 2: File S1). Similarly, we used five-fold cross validation to evaluate its performance. In each validation, we simulated the dropout noise in either training group (reference dropout) or test group (query dropout), and calculated the kappa coefficient for each method.
To simulate the cell number imbalance noise, we randomly sampled different proportions (0.1:1, 0.3:1, 0.5:1, 0.7:1, 0.9:1, 1:0.9, 1:0.7, 1:0.5, 1:0.3 and 1:0.1) of cell count in two cell types (Hepatocyte and GamaDetaT) in liver dataset as the reference data for classifier constructing. To get more accuracy testing, this simulation was repeated five times (Additional file 2: File S1).
Cross platform datasets
To compare cross platform performance (various studies using different sequencing platforms), we searched and identified four datasets suitable for this purpose, including two liver datasets from 10x and mCel-seq2 platforms and two pancreas datasets from drop-seq and smart-seq2 platforms (Table 2). We noticed that the cell type annotation labels of the same tissue from different platforms are not identical. Thus, we unified the labels by removing cell types absent in either of the datasets (Additional file 7: Figure S4; Additional file 8: Table S3; Additional file 2: File S1).