gCAnno: a graph-based single cell type annotation method

Background Current single cell analysis methods annotate cell types at cluster-level rather than ideally at single cell level. Multiple exchangeable clustering methods and many tunable parameters have a substantial impact on the clustering outcome, often leading to incorrect cluster-level annotation or multiple runs of subsequent clustering steps. To address these limitations, methods based on well-annotated reference atlas has been proposed. However, these methods are currently not robust enough to handle datasets with different noise levels or from different platforms. Results Here, we present gCAnno, a graph-based Cell type Annotation method. First, gCAnno constructs cell type-gene bipartite graph and adopts graph embedding to obtain cell type specific genes. Then, naïve Bayes (gCAnno-Bayes) and SVM (gCAnno-SVM) classifiers are built for annotation. We compared the performance of gCAnno to other state-of-art methods on multiple single cell datasets, either with various noise levels or from different platforms. The results showed that gCAnno outperforms other state-of-art methods with higher accuracy and robustness. Conclusions gCAnno is a robust and accurate cell type annotation tool for single cell RNA analysis. The source code of gCAnno is publicly available at https://github.com/xjtu-omics/gCAnno. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-020-07223-4.

to establish mappings between the query dataset and the well-studied biomarkers. In particular, Garnett trains a classifier based on the user defined markup language. CellAssign builds a probabilistic model that leverages prior knowledge of cell-type marker genes for annotation. However, collecting a comprehensive biomarker set of different cell types is cumbersome, time-consuming and subjective. Thus recently reference-based approaches, such as Scmap [13], Chetah [14] and scPred [15] have been developed and are gaining popularity after a number of well-annotated single cell data were published, especially the datasets released by human cell atlas (HCA) [16]. The reference-based methods follow data-driven strategy and construct mappings between query dataset and the well-annotated reference datasets. For example, Scmap uses drop-based method to select feature genes as variables and constructs mapping by distance and correlation coefficient. Another method, scPred selects differential principle components (PCs) calculated by gene expression value between cell types and trains an SVM model with these PCs. Recently, a comprehensive benchmark study [17] of 22 cell type classification methods indicated that SVM classifier has overall the best performance. However, these methods are sensitive to experiment batches, sequencing platforms and noises, all of which are intrinsic properties of the single cell datasets.
Here, we propose a reference-based method, gCAnno, using graph representation feature selection strategy to comprehensively represent the global view of associations between cell types and genes for robust and high accuracy single cell-level annotation. Our gCAnno method starts with construction of a weighted cell typegene bipartite graph. Then, graph embedding is applied to capture the cell type specific genes and naïve Bayes (gCAnno-Bayes) and SVM (gCAnno-SVM) classifiers are built for further annotation (Fig. 1). We compared gCAnno with the state-of-the-art methods on four published datasets as the basic test [3][4][5][6]. We also reported the performance comparison on large dataset with deep annotation level [18], different single cell platforms, simulated datasets with either various cell type imbalance situations and different dropout noise levels as the advanced test. Finally, runtime is summarized to demonstrate the efficiency of gCAnno.

Results
To evaluate the performance of gCAnno, we first evaluated the cell type-gene specific relation, and then compared gCAnno with five state-of-art methods, including Fig. 1 Overview of gCAnno. a Cell type-gene graph building. The graph contains gene nodes (gray circles) and cell type nodes (other color circles). b Graph embedding converts graphs into low dimensional vectors. Genes are selected based on the distance between the two types of vectors. c Training Naïve Bayes and SVM classifiers for annotation. d Cell type annotation for new query dataset Scmap-cell, Scmap-cluster, Chetah, scPred and SVM, in the following four aspects: 1) cell type specificity of gCAnno detected genes, 2) overall performance on different scRNA-seq datasets, 3) robustness test on simulated drop-out and imbalance noise data, 4) cross platform annotation.
Cell type specificity of gene sets detected by gCAnno After graph embedding step, gCAnno selects cell type specific gene sets, which largely determines the performance of our approach. Thus, we first evaluated the cell type specificity of gene sets detected in the four datasets. We noticed that clear cell type specific expression patterns are observed for these selected genes ( Fig. 2; Additional file 9: Figure S5; Additional file 10: Figure  S6). Among the reported marker genes from the corresponding publications, gCAnno is able to capture an average of 57% of them, indicating gCAnno's effectiveness of cell type specific gene identification (Additional file 11: Figure S7; Additional file 12: Table S4).

Robustness on dropout and imbalance noisy data
Besides basic accuracy, we examined its robustness in the presence of different types of noises. Dropout and cell count imbalance noises are two major types and the most challenging in scRNA-seq data. Dropout is a technical noise in the form of missing value in gene expression [10], while cell number imbalance among cell types is coming from biology itself. We found gCAnno achieved the highest and rather stable kappa coefficients for both reference dropout and query dropout tests in four datasets ( Fig. 4; Additional file 15: Figure S9; Additional file 16: Table S6; Additional file 17: Figure  S10). Remarkably, gCAnno achieved average kappa coefficients of 0.88 (gCAnno-SVM) and 0.79 (gCAnno-Bayes) even when dropout rate was as high as 50%, while other methods achieve 0 (Scmap-cluster), 0.44 (Scmapcell), 0.37 (Chetah), 0.25 (scPred) and 0.79 (SVM), respectively. Moreover, we found gCAnno, SVM and Scmap-cell achieved the highest and stable kappa coefficients (average values are about 0.99) for different cell count imbalance ratios (Additional file 15: Figure S9; Additional file 18: Table S7). All of these results show gCAnno is better than other methods for dropout and cell count imbalance noises and achieved the best performance on highly noisy data (e.g. 50% dropout rate and 1:0.1 imbalance rate), suggesting the effectiveness of the wCGBG in selecting accurate features in the presence of high noise.

Cross platform annotation
Different single cell sequencing platforms have platform specific features or bias [19], limiting cross platform cell type annotation. We evaluated the platform compatibility of gCAnno on two liver datasets [4,20] and two pancreas datasets [3,21] from four platforms (10x, mCel-seq2, Drop-seq, and Smart-seq2) ( Table 2). We used one platform dataset as the training data and the other as the testing data. For the performance comparison, gCAnno achieved consistently high kappa coefficient values for liver dataset tests ( Fig. 5a and b) and for pancreas dataset tests ( Fig. 5c and d) (Additional file 19: Table S8). These results show gCAnno is able to maintain high annotation accuracy for real heterogeneous and cross platform data in the presence of systematic platform specific bias.

Runtime evaluation
Finally, we evaluated the runtime of gCAnno based on datasets in above tests (Additional file 20: Table  S9; Additional file 21: Figure S11). We found that the time takes in model building (including graph construction and embedding) step is positive correlated with the number of graph nodes (Pearson's correlation is 0.94). Once the model has been built, the annotation step only takes less than 1 min (e.g. for mCel-seq2 platform liver dataset with 8103 cells only takes 48 s).

Discussion
In this study, we present gCAnno, a novel graph-based cell type identification method for scRNA-seq data. The most significant feature of gCAnno is the construction of wCGBG, enabling gCAnno to capture the global characteristics of association between cell types and genes. This feature allows gCAnno to detect accurate feature genes for each cell type, leading to accurate annotation results and robustness for different noise types and rates. In addition, gCAnno is able to annotate not only human scRNA-seq, but also plant scRNA-seq (e.g. Arabidopsis data) and its stable and high performance across two platforms.
gCAnno contains SVM version (gCAnno-SVM) and naïve Bayes version (gCAnno-Bayes). The SVM version takes into account the effect of expression value while naïve Bayes version only considers the existence of cell type specific genes. From the evaluation result, the SVM version seems suitable for the dataset with deep Since gCAnno is a reference-based cell type annotation method, it lacks the ability to identify novel cell types. For novel type cells, gCAnno assigns the closest cell types with the most similar expression profiles to them, which might be reasonable in most of applications but probably require further improvement. Integrating the biomarker-based method for novel cell type annotation and reference-based method for accurate predefined cell type annotation, we think, will be one direction to explore.

Conclusion
We have implemented a stable and high-performance automated cell type annotation tool, gCAnno, for scRNA-seq datasets. With an easy use Python running script as an example, we hope gCAnno will be useful for the scRNA-seq data analysis.

Methods
Here we summarized the framework of gCAnno. gCAnno adopts graph structure for cell type specific gene set detection and accurate cell type annotation. Firstly, gCAnno builds cell type-gene bipartite graph based on gene expression abundances and intensities, in which gene expression abundance is the proportion of cells expressing the gene in a given cell type while intensity is the average expression in cells expressing the gene. Then, graph embedding is adopted to obtain the embedding vectors of gene nodes and cell type nodes. Next, gCAnno selects a set of genes for each cell type with similar profiles in the embedding space. Finally, based on the detected cell type specific genes, gCAnno trains naïve Bayes and SVM classifiers. The workflow of gCAnno is depicted in Fig. 1.

Cell type-gene bipartite graph construction
Starting from the well-annotated reference scRNA-seq data, we constructed a weighted cell type-gene bipartite graph (wCGBG) containing both cell type nodes (CTN) and gene nodes (GN). Edges between CTN and GN indicate the correlation of a gene and a cell type while weight W measures significance of correlation. The weight is calculated by: where n k is the cell count of cell type k, m j, k is the number of cells expressed gene j in cell type k. g j;k ! is the expression vector of gene j in cell type k. W is the product of the gene expression abundance and intensity. We use gene expression abundance and intensity to establish a relationship between cell types and genes in the form of proportion to reduce the impact of individual gene loss (dropout) or cell number imbalance.

Graph embedding and cell type-gene specific relation detection
After wCGBG construction, we used node2vec to obtain the low dimensional vectors (the embedding vectors) of gene nodes and cell type nodes. The first step is construction of a neighborhood set N(u) of each node u (either gene or cell type node) by a probability walk [22]. Then, we optimized the following objective function f(u) by maximizing the log-probability of observing a neighborhood set.
This optimization step enables the embedding vectors to capture the specificity and strength of interactions between cell node and gene node, e.g. if one gene is specific and highly expressed in one cell type, the corresponding two embedding vectors are similar. Then, we calculated Euclidean distance between the vector of genes and cell types. We selected top n (a user defined parameter, default n = 65, Additional file 1: Figure S1) closest genes for each cell type as the cell type specific gene set based on the overall performance on the five datasets we used [3][4][5][6]18].

Classifier construction
After obtaining the cell type specific gene set, we build naïve Bayes (gCAnno-Bayes) and SVM (gCAnno-SVM) classifiers for annotation. For gCAnno-SVM, we directly use the expression of cell type specific genes as features to train an SVM classifier. For gCAnno-Bayes, we build a binary matrix to presents cell type and its corresponding specific genes, e.g. the element b ij = 1 indicates gene j is one of the specific genes in cell type i. We train a Bernoulli Naïve Bayes to get genes' conditional probability in each cell type and the prior probability of cell types. The query dataset is binarized and the annotation is based on maximum posterior probability of single cell's cell type specific genes expression.

Performance measurement and dataset Performance assessment and comparison
Cell type annotation is a typical multi-classification problem. We applied kappa coefficient as the performance measurement of classification, defined as Eq. (3).
where N corr is the ratio of total number of cells with corrected cell type annotation, N t is the total number of cells in the dataset, K is the number of truly cell types, a i is the number of corrected annotated cells in the i-th cell type, and b i is the number of cells in the i-th cell  # means the number of type, p o is the accuracy, a i × b i is the product of the actual and predicted quantity, p e punishes bias for unbalance evaluation.
To evaluate the performance of gCAnno, we performed both cross-validation test and independent heterogeneous test (cross-platform test). First, we adopted the five-fold cross-validation strategy following recent single cell analysis comparison published earlier [15,17] on four published datasets and simulated noise datasets to evaluate the overall and robustness performance (Additional file 2: File S1). Then, we performed independent test on datasets from different sequencing platforms (the cross-platform testing) to evaluate the generalization capability of gCAnno.

Tools in comparison
The calculation results of Scmap, Chetah and scPred were obtained from the corresponding publications [13][14][15].
For SVM, we followed the previous report [17] which is using drop-based method [23] for feature selection.

Datasets used in basic overall performance test
To illustrate the stable performance of gCAnno across various species and tissue types, we compared gCAnno with other methods using four published datasets, including liver, pancreas, Arabidopsis thaliana root (AT root), hepatocellular carcinoma and intrahepatic cholangiocarcinoma (HCC and ICCA) datasets (Table 1; Additional file 2: File S1; Additional file 3: Figure S2; Additional file 4: Table S1). The true labels of the cells in each dataset are obtained from the corresponding publications.

Large dataset with deep annotation level
To demonstrate the performance of gCAnno in large dataset (cell number more than 50,000) with deep Fig. 5 Platform compatibility evaluation. Performance comparisons of gCAnno with Scmap-Cluster, Scmap-Cell, scPred, Chetah and SVM on cross platform datasets. a liver datasets, where reference is mCel-seq2 and query is 10x; b liver datasets, where reference is 10x and query is mCel-seq2; c pancreas dataset, where reference is drop-seq and query is smart-seq2 d pancreas datasets, where reference is smart-seq2 and query is dropseq. The reference is the training data and the query is the testing data annotation level (more than 20 cell types). We compared gCAnno with other methods in 20 mouse organs dataset with 54,246 cells, 29 cell types and 23, 433 genes. The true labels of the cells in each dataset are also obtained from the original publications [18] (Additional file 2: File S1; Additional file 5: Figure S3; Additional file 6: Table S2).

Simulated dropout and imbalance datasets
To evaluate the robustness of gCAnno in the presence of dropout noise, we simulated different dropout rates in four above datasets (Table 1), by modifying the expression level of a random gene subset (10,20,30,40 and 50% of all genes) to zero (Additional file 2: File S1). Similarly, we used five-fold cross validation to evaluate its performance. In each validation, we simulated the dropout noise in either training group (reference dropout) or test group (query dropout), and calculated the kappa coefficient for each method.

Cross platform datasets
To compare cross platform performance (various studies using different sequencing platforms), we searched and identified four datasets suitable for this purpose, including two liver datasets from 10x and mCel-seq2 platforms and two pancreas datasets from drop-seq and smart-seq2 platforms ( Table 2). We noticed that the cell type annotation labels of the same tissue from different platforms are not identical. Thus, we unified the labels by removing cell types absent in either of the datasets (Additional file 7: Figure S4; Additional file 8: Table S3; Additional file 2: File S1).