LightCpG: a multi-view CpG sites detection on single-cell whole genome sequence data

Background DNA methylation plays an important role in multiple biological processes that are closely related to human health. The study of DNA methylation can provide an insight into the mechanism behind human health and can also have a positive effect on the assessment of human health status. However, the available sequencing technology is limited by incomplete CpG coverage. Therefore, it is crucial to discover an efficient and convenient method capable of distinguishing between the states of CpG sites. Previous studies focused on identifying methylation states of the CpG sites in single cell, which only evaluated sequence information or structural information. Results In this paper, we propose a novel model, LightCpG, which combines the positional features with the sequence and structural features to provide information on the CpG sites at two stages. Next, we used the LightGBM model for training of the CpG site identification, and further utilized sample extraction and merged features to reduce the training time. Our results indicate that our method achieves outstanding performance in recognition of DNA methylation. The average AUC values of our method using the 25 human hepatocellular carcinoma cells (HCC) cell datasets and six human heptoplastoma-derived (HepG2) cell datasets were 0.9616 and 0.9213, respectively. Moreover, the average training times for our method on the HCC and HepG2 datasets were 8.3 and 5.06 s, respectively. Furthermore, the computational complexity of our model was much lower compared with other available methods that detect methylation states of the CpG sites. Conclusions In summary, LightCpG is an accurate model for identifying the DNA methylation status of CpG sites in single cells. Furthermore, three types of feature extraction methods and two strategies used in LightCpG are helpful for other prediction problems. Electronic supplementary material The online version of this article (10.1186/s12864-019-5654-9) contains supplementary material, which is available to authorized users.


Background
DNA methylation is a topic of much debate in the epigenetic world, but understanding of DNA methylation has great room to upgrade [1,2]. One of the most common ways for identifying DNA methylation is identifying the cytosine-5 methylation within the CpG dinucleotides [3]. DNA methylation can affect the functional state of regulatory regions and affect DNA replication and gene transcription. These functions are closely related to many human diseases, including malignant tumors, efficient computational methods to identify DNA methylation is very important and is also critical to making methylation predictions more reliable [12].
The characteristics of methylation sites are inevitably digitized when using computational methods that identify them. Many previous studies [13][14][15][16] have demonstrated that the sequence of neighboring nucleotides of one methylation site is specific and that the methylation state is closely related to the sequence information, which allows for the prediction of the methylation state only based on the sequence composition. The Methylator method [14] proposed by Bhasin et al. used conventional binary sparse encoding to directly convert sequences into a feature vector. The method described by Das et al. [17] involves extracting a sequence with the window size of 800 bp, counting the methylation propensity, and using the principal component analysis (PCA) with recursive feature elimination for feature selection. Recently, Pan et al. [18] employed an n-gram, multivariate mutual information [19], Discrete Wavelet Transform [20] and Pseudo Amino Acid Composition [21] to extract DNA sequence features with a window size of 100 bp.
With the discovery of various biological processes, methylation is found to be closely related to many proteins [22,23]. Therefore, the structural information of the protein can be used for the identification and better profiling of methylation sites. Structural features discussed by Bock et al. [24] included the frequency and distribution of CpG islands (CGIs), exon distribution, transcription factor binding sites (TFBS), and single nucleotide polymorphisms (SNP), with a total of 918 features representing the properties of the CpG sites. Zhang et al. [25] extracted a total of 841 features including histone modification features and then used PCA to select features for downstream CpG site identification. Fan et al. [26] extracted four histone methylation marks to identify CpG sites. Zhang et al. [1] extracted genomic positional features, neighbor features, sequence properties, and cisregulatory elements to identify CpG sites. Saif et al. [27] identified highly methylated regions using promoter region information.
Following the feature extraction at CpG sites, it is important to select an appropriate model for CpG site identification. Most of the previous methods used support vector machine (SVM) as the classification model, which resulted in an excellent performance and creation of tools such as The Methylator [14] and HDFINDER [17]. Moreover, additional methods using random forest (RF) [28] have also achieved excellent results. Furthermore, the method by Pan et al. used sparse Bayesian learning model [29]and also achieved good performance results. Therefore, the selection of the appropriate classifier can directly affect the model performance.
In the methods discussed above, the performance of single-cell methylation state prediction can be affected by the density of the sites measured in the dataset. Recently, researchers have developed several singlecell DNA methylation group sequencing methods, such as single-cell bisulfite sequencing (scBS-seq) [30] and single-cell reduced-representation bisulfite sequencing (scRRBS-seq). Smallwood et al. [31] discovered that the CpG coverage of the scBS-seq method is only 20 − 40% and that of the scRRBS-seq method is only 1 − 10% [32][33][34]. It is important to note that the decrease in coverage may result in a loss of information. Therefore, the key focus is to determine the state of the missing CpG sites in the entire genome. The methods cited above, which use sequence and structural features can only resolve methylation state prediction at different sites within a single cell and cannot account for associations between multiple cells. Therefore, these methods are not suitable for the examination of methylation states in multiple cells. The DeepCpG model, proposed by Christof et al. [35], used 25 CpG sites upstream and downstream of different sites in different cells, and used the site state, distance between each site and target site as features. This method allowed for the connection between various cells through the use of the deep learning model gated recurrent network (GRU), and also extracted features from the DNA sequence by convolutional neural network (CNN) and a fully connected hidden layer. Next, the use of the Deep-CpG fully connected the deep learning to identify CpG sites and achieved an impeccable accuracy. However, the DeepCpG model utilizes a large amount of time during the training process.
Inspired by the DeepCpG model, we posit that some of the same CpG sites with unknown methylation states can be detected in multiple cells, and that the states of these sites can vary between different cells. We extracted the CpG site information as novel positional features to build the model. Importantly, we used three-part feature approach (sequence features, structural features, and novel positional features) to identify the multi-cell CpG sites. Moreover, we produced the sparse binary features, such as most of the structural features and half of the positional features. Finally, we constructed the CpG recognition model using the LightGBM model [36]. Experiments demonstrate that our method can predict the states of missing CpG sites in multiple cells with high precision and efficiency.

Methods
In this paper, we propose a novel method to resolve the issue of methylation identification, as shown in Fig. 1. First, we extracted sequence features, structural features and positional features of known CpG sites. Then, we applied the LightGBM model to train the classifier for Fig. 1 The flow chart of LightCpG. CpG profiles are obtained from scTrio-seq. Dataset includes multiple single-cell CpG profiles. Feature extraction: positional feature includes methylation state and the distance between the sites; structural feature includes CpG islands (CGIs) status (CGIs , CGIs shore, CGIs shelf), cis-regulatory elements (TFBS, DNase, chromatin states, histone modification), and DNA properties (integrated haplotype score (iHS), constrain score); sequence feature includes 84 dimension features that are extracted using DNA sequence and n-gram method. Training: LightGBM is used to construct a model for each single-cell CpG data; sample selection is used to reduce the number of samples; feature merging is used to reduce the number of features. Testing: the trained LightCpG model can be used for prediction of the new CpG sites each cell and also adjusted the model parameters to get the best performance. Finally, we used our trained model, called LightCpG, to predict the methylation states of unknown CpG sites.

Dataset
We downloaded two benchmark datasets Homo sapiens GM12878 (ENCFF001TLS) and heart left ventricle (ENCFF001TKC), which were extracted by reduced representation bisulfite sequencing (RRBS) from ENCODE [37,38]. Single-cell triple omics sequencing (scTrio-seq) is a sparse single-cell CpG profile. We downloaded two datasets of scTrio-seq profiled cells, 26 human hepatocellular carcinoma cells (HCCs) cells and six human heptoplastoma-derived (HepG2) cells, from the gene expression omnibus (GSE65364). Based on the study by Hou et al. [34], the Ca26 was excluded because the distribution of methylation state is seriously abnormal, so there were only 25 cells in the HCCs cell dataset. Every position of CpG sites was mapped to hg19 by using the liftOver tool (http://www.genome.ucsc.edu/cgi-bin/ hgLiftOver) from the UCSC Genome Browser [39]. In this paper, we examined these sites as research objects, which were covered by at least four reads.
Inspired by the DeepCpG [35], we adopted the same validation method for all datasets. In the experimental part, the CpG sites in the training set were from chromosomes 1, 3, 5, 7, 9, and 11, and those in the test set were from chromosomes 2, 4, 6, 8, 10, and 12, and finally those in the validation set were from chromosomes 13, 14, 15, 16, 17, 18, and 19. All datasets are described in detail in Additional file 1 and Additional file 2.

Structural feature
In the mammalian genome, the CGIs are specific regions where the density of unmethylated CpG sites is greater compared to other regions. Work by Zhang et al. [1] demonstrated that the methylation level in CGIs is below 50%, the methylation level in CGIs shores ranges between 20% and 80% and the methylation level in CGIs shelves is much higher compared to the average level. Therefore, all samples in this paper were assigned three binary features. Specifically, the value of CGIs feature was 1 if the sample sites were within CGIs regions, otherwise it was 0. We applied the same principles to the CGIs shore feature and the CGIs shelf feature. These parts of the data were downloaded from the UCSC genome browser [40].
Studies indicate that CGIs have been shown to colocalize with the DNA regulatory elements, including TFBS, histone modification marks, chromatin states and DNase I hypersensitive sites (DHSs) [1,41]. Moreover, many studies have found that the methylation states of CGIs are closely related to TFBSs [42][43][44]. DNA methylation and histone modifications are involved in regulating gene repression patterns during cell development as demonstrated by traditional experiments [45]. Some studies found that chromatin modification and DNA methylation are mutually dependent in the aspect of gene regulation [46]. Moreover, DHSs are linked with a strong enrichment of CpG methylation [47,48]. These feature data were downloaded from the ENCODE [49]. All of the above-referenced features were binary. Importantly, DNA methylation, an important epigenetic modification, is one of the major mechanisms regulating gene expression during cell differentiation. Although it does not change the genetic sequence, it can be inherited by offspring. Therefore, we used the integrated haplotype score (iHS) http://hgdp.uchicago. edu/Browser_tracks/iHS and the GERP++ constraint score on hg19 [50] http://mendel.stanford.edu/SidowLab/ downloads/gerp/ to recognize the CpG sites in the DNA sequence.

Sequence feature
According to the position of the CpG sites in the raw data files, we extracted the sequence from the reference hg19, including the extracted DNA sequence of 101 bp with 50 bp before and 50 bp after the CpG site. The DNA primary sequence is composed of adenine (A), thymine (T), cytosine (C) and guanine (G). Some studies [51] suggest that the primary sequence composition is critical for the methylation recognition.
In this paper, we use DNA sequence information extracted by n-gram to identify the CpG sites. Each feature of n-gram can be denoted as a pair of value (v i , f i ), where v i is one feature that can be recorded as a combination of n nucleotides and f i represents the frequency of v i in the DNA sequence. The equation for f i is shown below: where N(v i ) represents the number of v i in the DNA sequence; L represents the length of the DNA sequence and n represents the number of nucleotides in the v i . In this paper, we use 1/2/3-gram to represent one DNA sequence, denoted as Finally, we used 84 sequence features to identify the CpG sites

Positional feature
In this section, we extracted positional features by using the information on the CpG states and the distance between the adjacent CpG sites. We proposed a novel skip-k method to analyze the correlation between two adjacent CpG sites for DNA methylation recognition. The skip-k method is composed of two parts: skip-k 1 and skip-k 2 . The sketch map of skip-k is shown in Fig. 2.
The correlation between two adjacent CpG sites can be verified by skip-k 1 . The method skip-k 1 separately extracts one CpG site at the k 1 -th distance upstream and downstream of the target CpG site, and then extracts the states of these two CpG sites and calculates the distances between them and the target CpG site. These two states and two distances are used to predict the target CpG sites. Next, we analyzed the change in prediction accuracy in the different k 1 values. Specifically, we employed two benchmark datasets (GM12878 and heart left ventricle) to discuss the skip-k 1 method. The experimental results demonstrate that the correlation between two CpG sites became weaker as the distance between them increased.
The correlation between CpG sites in a specific window can be analyzed by skip-k 2 . The skip-k 2 method separately extracts the k 2 nearest the CpG sites from upstream and downstream of the target CpG site, and then separately extracts the states of these CpG sites and calculates the distances between them and the target CpG site. These 2k 2 states and 2k 2 distances are used to distinguish the state of the target CpG site. We analyzed the change in prediction accuracy with the different windows. Next, we used the 25 cells of the human HCCs dataset to analyze the skip-k 2 method. The experimental results demonstrate that the prediction accuracy became smaller as the length of window increased. Therefore, in this paper, both the k 1 and k 2 values were set as 1. We separately extracted one CpG site in the vicinity (upstream and downstream) of the target CpG site and then extracted the states of these two CpG sites and calculated the distances between them and the target CpG site. These four features were used to predict the CpG sites. For multiple human cell line cells, Christof et al. [35] established a bidirectional GRU model that builds the connection between multiple cells using the window length k 2 = 25. This model achieved excellent performance based on five datasets. However, in the modeling process, the model only considered the methylation states and the distances of adjacent sites in the window for each cell, ignoring some information about the methylation states of the same sites in different cells.
In this section, we extracted the features of the same CpG sites in other cells. In different cells, some of the same CpG sites have unknown methylation states. We solved the problem of feature extraction at the same CpG sites in different cells. Assuming that there are m cells in the data, we defined a set G i that contains n CpG sites on each chromosome in the i-th cell, denoted as follows: where p c i and s c i represent the position and methylation state of the c-th CpG site in the i-th cell, respectively.
To maximize the features for the same sites in different cells, we use G i and G j to establish the following feature extraction method. If one CpG site existed in the j-th cell satisfying p c i = p l j , we denoted F c i,j as the distance and the methylation states of the nearest CpG sites on both sides of the l-th CpG site in the j-th cell, as shown below: where If one CpG site is unknown methylation status in the jth cell, we selected two neighboring CpG sites in the j-th cell satisfying p l j < p c i < p l+1 j , following which F c i,j was denoted as shown below: where Furthermore, we represented the features for the same CpG sites in different cells and established the following feature extraction method. If one CpG site existed in the j-th cell satisfying p c i = p l j and i = j, we denoted D c i,j as the distance and the methylation states of the nearest CpG sites on both sides of the (l −1)-th and (l +1)-th CpG sites in the j-th cell, as shown below: where P These features included the methylation states of the same sites in different cells. If one CpG site is unknown methylation status in the j-th cell, we selected two neighboring CpG sites in the j-th cell satisfying p l j < p c i < p l+1 j , and then D c i,j was denoted as shown below: where , which contains the methylation state of the p c i site. Therefore, D c i,i must avoid the condition of i = j. Finally, we extracted 4m + 8(m − 1) features to solve the problem of multi-cell methylation identification. The sketch maps of feature F and feature D are shown in Fig. 3. All features are described in detail in Additional file 3.

LightGBM model
In this paper, we used the LightGBM method [36] to distinguish the states of various CpG sites. The model uses a novel gradient boosting decision tree (GBDT) algorithm, including gradient-based one-side sampling (GOSS) to extract relatively small number of samples according to gradient values and exclusive feature bundling (EFB) to reduce the number of features. The GOSS and EFB approaches are described in Fig. 4.

Sample selection
For the GOSS approach, we first sorted all the data samples according to the gradient values. Then, the top a% samples were drawn and b% of the remaining samples were randomly selected. Finally, the two parts were combined for further analysis. Following the GOSS process, the dataset samples were reduced.

Feature merging
The EFB method was mainly used to handle sparse feature sets. First, we constructed a graph based on the number of non-zero values in each feature. The feature F i was marked as a node in the graph and the connection weight of two features F i and F j was calculated as shown below: where L represents the number of samples, and N i,j represents the total number of F i and F j not equal to 0 at the same time. Next, we sorted features according to the degree value in the graph, and feature with the largest degree value was used as an initial set. We defined d as the maximum conflict value for a feature set. Each feature was included in the existing set when the total connection weight was less than d. Otherwise, this feature was included as a new set when the total connection weight in all sets was greater than d.
Finally, we obtained a composite feature for each set using the histogram method, including multiple feature (2) EFB was used to reduce the number of features. First, we bundle multiple sparse features into one set and then combined a set into one feature with the help pf a histogram information. As the number of features decreased, the time complexity of the training process also decreased.

Training model
The input training set was recorded as where m is the number of samples, x m is the features of m-th sample and y m is the real output of m-th sample. We defined the number of trees as T. The loss function was denoted as L(y, c), where y represents the expected output and c is the real output.
First, we used a weak classifier f 0 (x). Before training the t-th tree, the gradient of each sample was calculated separately as shown below: where g t (x i ) represents the gradient value of the i-th sample inputted into the t-th tree and F t−1 (x) represents the strong classifier of the linear combination of t-1 weak classifiers. Gradient values were used to train the t-th tree, and the learning model equation is denoted as follows: where w t represents the parameter in the training tree process, and w * t represents the value of the w t when the loss function takes the minimum.
Next, we solve the coefficient as shown below: , the model becomes updated as shown below: Finally, the output model was summarized as follows: In this work, we examined whether a CpG site was covered by the reads which were mostly methylated in the raw files. If this was true, then the CpG site was considered to be a methylation site and the state of site was denoted as 1. If a CpG site was covered by the reads which are mostly unmethylated, then the site was considered as unmethylation site and the state of CpG site was denoted as 0. Therefore, the methylation recognition is a dual classification problem. Our method utilized (259+4m+8(m−1))-d features to train the LightGBM classifier and predict the states of new CpG sites.

Results
In this section, we analyzed the performance of our method from different aspects. First, we analyzed the skip-k 1 and skip-k 2 methods using three datasets. Second, we examined the effectiveness of our positional feature extraction approach. Third, we analyzed the performance of our feature extraction method via comparison with other two methods. Fourth, we analyzed the importance score of each feature to select the most important features. Fifth, we compared the LightGBM with four other classifiers to demonstrate which one has the most accurate performance. Sixth, we compared our LightCpG model with two other methods of methylation recognition using two databases. Finally, we compared the running time of five machine learning methods, including the LightCpG, to evaluate which method was the most efficient.

Evaluation criteria
To establish the evaluation criteria for the prediction, we took eight mathematical measurements, including Acc, area under curve (AUC), area under the precision-recall curve (AUPR), Fscore, precision, Matthews correlation coefficient (MCC), sensitivity (SE), and specificity (SP). The formulas for each of the parameters are summarized by the following equations: where TP represents the number of the true methylation state, TN represents the number of the true unmethylation state, FP represents the number of false methylation state, and FN represents the number of false unmethylation state. These evaluation indicators are often used to evaluate the performance of the classifier, following which they can quantify the performance from different perspectives. AUC is the area present under the receiver operating characteristic (ROC) curve. The ROC curve is created by plotting true positive rate against false positive rate at various threshold settings. Here, we used AUC as the main criterion. We often encountered unbalanced DNA methylation datasets, which required us to use the AUPR measurement. Specifically, AUPR is the area present under the curve that is created by plotting precision against recall at various threshold settings. When dealing with unbalanced datasets, AUPR was better at evaluating the performance of the classifier.

Analysis of parameter k
In this section, we analyzed the value of parameter k 1 using two datasets (heart left ventricle and GM12878) in single cells. Heart left ventricle dataset was composed of 68,129 methylation sites and 280,683 unmethylation sites for training, with 56,691 methylation sites and 237,758 unmethylation sites available for testing. GM12878 contained 113,059 methylation sites and 276,808 unmethylation sites for training, with 95,348 methylation sites and 233,016 unmethylation sites available for testing. The CpG site was defined as 1 if the site was methylated, otherwise it was assigned 0. We ranged the value of parameter k 1 from 1 to 100 with a step length of 1. We trained the RF model with 500 trees and the results are summarized in Figs. 5 and 6 (details available in Additional file 4. The data indicate that most of the examined mathematical measurements gradually decreased with increasing distance in both datasets. This performance revealed that the correlation between two CpG sites became weaker when the distance between them increased, so k 1 was assigned a value of 1. Moreover, we analyzed the state of the CpG site using its upstream and downstream CpG sites in all cells. We  (Fig. 7 with additional information available in Additional file 5). Data presented in Fig. 7 indicate that the value of AUC decreased with the increasing value of k 2 . In 22 out of the 25 cells, the AUC value for k 2 = 1 was greater than that for k 2 = 2. Therefore, the value of k 2 was set as 1.

Analysis of position feature
To analyze the distribution of CpG sites at identical positions in different cells, we examined chromosomes 1-12   Table 1). The results suggest that the methylation states of the 56.21% CpG sites were known at the same position in at least two cells, while the methylation states of the 14.22% CpG sites were identical at the same position in at least in two cells. Moreover, the 0.63% CpG sites had the same methylation states in specific sites where methylation state was known. For proper analysis, it is important to refer to the methylation states at the same position in different cells. For the 25 cells of the human HCCs dataset, we Table 1 The distribution of CpG sites with known methylation status in different cells on the HCCs dataset  1 Chr represents the chromosome ID 2 N represents the total number of known CpG sites 3 N 1 represents the number of sites at the same position at least in two cells 4 P 1 (%) represents the proportion of N 1 in all sites 5 N 2 represents the number of sites with same states at the same position in at least two cells 6 P 2 (%) represents the proportion of N 2 in all sites 7 N 3 represents the number of sites with the same states in specific sites where methylation state was known 8 P 3 (%) represents the proportion of N 3 in all sites extracted (4×25+8×24) positional features to address the multi-cell methylation identification issue.

Performance of different features
Our method considers sequence features, structural features and positional features. We used RF classifier to analyze the performance of each feature in the 25 cells of the human HCCs dataset ( Table 2). Detailed information on each evaluation criteria can be found in Additional file 6. Based on the data presented in Table 2

Performance of feature selection
Each feature exerts a different impact on the performance of the CpG site recognition. To examine the importance of each feature, we set up an experiment to score each feature using the LightGBM model. The importance score is the sum of the number of times each feature participates in node splitting when building a tree model. If the feature participates in the node splitting many times, the importance score will be higher. We used specific steps in the experiment with the 25 cells of the human HCCs dataset. The first step was to train 25 LightGBM models using the 25 cell training sets and extract the importance score for each feature. In the second step, the importance score of each feature in the 25 cells was added and the final statistical results were presented in Fig. 8. It is critical to note that for our method the positional features were the most important among all available features. In addition, we obtained different features according to the importance score from top 10 to top 551 in 25 cells, and trained the LightGBM model separately. The accuracy of different dimension features is shown in Fig. 9, where the x-axis is the dimension of the features and y-axis is the accuracy of the prediction. The dimension of all features was equal to 551. When the feature dimension was lower than 110, the Acc value increased steadily. In the 110 -115 dimension range, the Acc value for the 25 cells increased rapidly. The five features from the top 110-115 dimensions were four positional features and the frequency of the combination of Adenine and Cytosine (AC) in the DNA sequence. Beyond the top 115 dimensions, the value of Acc continued to increase at a fast rate. When the dimension value reached 320, the accuracy reached the maximum in the 25 cells. Among the top 320 dimensions, we found that there were 233 positional features, eight structural features and 79 sequence features. The eight structural features included the constraint score, CGIs, histone H3 lysine 9 acetylation (H3K9ac), CGIs shelf, CCNT2 (Cyclin T2), iHS, CGIs shore, and HMGN3 (High mobility group nucleosome-binding domain-containing protein 3). The H3K9ac acetylation is very important and it can be easily silenced during DNA methylation. Some studies suggest that hypomethylated DNA is preferentially bound by the HMGN3 protein [52]. With more than 320 dimensions in our model, we determined the accuracy to be stable. The last 232 features included 167 structural features, five sequence features and 60 positional features. Overall, the accuracy of the LightGBM did not decrease with the increasing dimension but remained stable, indicating that LightGBM had a very stable performance.

Comparison of different feature extraction methods
To evaluate the performance of our method, we applied it to the human HCCs and human HepG2 datasets and compared it with the DeepCpG [35], and another method proposed by Zhang et al. [1], where RF was used as a benchmark classifier. The DeepCpG method examined 25 CpG sites upstream and downstream of the known CpG sites and used their methylation states and distances to train the GRU network model. Zhang's method used four aspects to train the RF classifier: genomic position feature, DNA sequence properties, cis-regulatory elements (CREs) and the states and distances between the  1 Seq represents the sequence features 2 Str represents the structural features 3 Pos represents the positional features 4 The boldface is the best value in the column Fig. 8 The importance score for all features on the HCCs dataset neighboring CpG sites. Figure 10 shows the AUC values of different feature extraction methods in the 25 cells in the human HCCs dataset, where the x-axis represents the cell number and the y-axis represents the predicted AUC value. Detailed information on other evaluation indicators in each cell can be found in Additional file 7 and Additional file 8. The AUC value for our method was the highest one among the 22 cells. In the remaining three cells, the difference between the AUC value for our method and that of the DeepCpG method was very small. These results suggest that our feature extraction method is likely more significant compared to the two other methods. To further demonstrate superiority and significance of our method in the feature extraction we analyzed the feature extraction method using the HepG2 dataset. Data presented in Fig. 11 indicate that the AUC value of our Fig. 9 The trends of accuracy on the feature dimension in all cells on the HCCs dataset

Comparison of different classifiers
To explore the performance of the LightGBM classifier, we compared various classifiers using the human HCCs dataset. Since the LightGBM classifier improved the Gradient Boost Decision Tree (GBDT) algorithm in terms of sample selection and feature mergence, we used the GBDT algorithm for comparison. Similarly, the XGBoost classifier [53] is also based on the GBDT algorithm, with additional improvements. Because the reference indicator is completely redefined when the tree leaf nodes split, the significant performance of XGBoost has been shown in many previous studies [54,55]. Then, the GBDT became an integrated tree model. One of its characteristics is the nonparallel training process, which uses the gradient of the previous tree as the input for the next tree. In addition, since RF is a very stable integrated tree model, we also used it as a comparator in our analysis. The DeepCpG model has an outstanding performance due to the use of the deep learning method, we used the Fully Connected Neural Network for comparison as well. The comparative results are summarized in Table 3, where each evaluation indicator represents the average value of all cells. Detailed information on each evaluation indicator for each cell can be found in Additional file 7 and Additional file 8. Based on the data presented in Table 3, in the HCCs dataset, the Acc value of LightGBM was higher compared to other classifiers, improving by at least 0.37%-1.79%. The AUC value of the LightGBM was higher compared to other classifiers, improving by at least 0.0062-0.0214. The Fscore, MCC, SP and SN values of the LightGBM classifier were 84.66%, 78.97%, 93.73% and 86.84%, respectively. The distribution of the evaluation criteria in the HepG2 dataset in each classifier was consistent with the distribution observed in the HCCs dataset.
The experimental results indicate that the LightGBM classifier was more suitable for the most efficient CpG site recognition.

Performance evaluation using different datasets
For the HCCs dataset and the HepG2 dataset, three methods achieved excellent performance. The data presented in Tables 4 and 5 list the values of best evaluation and average evaluation in all cells, respectively. In Table 4, we first select the best performance cell by the highest ACC. Then, we list all the evaluation criteria on the cell. Results presented in Table 4 indicate that the performance of our method was similar to the other two methods. In the HepG2 dataset, the AUC value of our method was 0.9246, while DeepCpG and RF Zhang reached the values of 0.9239 and 0.8954, respectively. Moreover, our method was also better at SP (90.33%). In the analysis of the HCCs dataset, our method also performed better compared to the other two methods. The MCC value of our method reached 82.10%, which was 0.43% higher compared with the DeepCpG method. Moreover, the Acc and SP values were increased by 0.87% and 1.8%, respectively, and the Fscore and AUC values were almost the same as those obtained with the DeepCpG method. Additionally, based on the results presented in

Feasibility Analysis
For large-scale data, the running time is used to evaluate the feasibility of the model. For the RF classifiers, we implemented it using MATLAB scripts and executed it using a Think Station P700 computer. For the GBDT classifiers, we implemented those using Python2.7 scripts and executed them using a Think Station P700 computer.  [28] is an ensemble learning model that uses the idea of bagging and the random selection of features to avoid data over-fitting 2 GBDT [60] is a non-parallel model that uses the gradient from previous tree as the input for the next tree 3 XGBoost [53] is an improved GBDT algorithm. The reference indicator of XGBoost is completely redefined when the tree leaf nodes split 4 LightGBM [36] is based on the GBDT algorithm and employs sample selection and feature mergence to reduce the running time 5 FCNN represents the Fully Connected Neural Network 6 The boldface is the best value in the column In this section, we used the HCCs and the HepG2 datasets to calculate the running time of each classifier in each cell for training and listed the average time consumption of the two datasets in Table 6. The detailed information for each evaluation indicator can be found in Additional file 9. The time consumption of Fully Connected Neural Network using the HCCs dataset was an average of 4138.88 s, and the time consumption of the Fully Connected Neural Network using the HepG2 dataset was an average of 1889.55 s. The time consumption of RF and GBDT in the HCCs dataset was 252 and 20 times longer than that of the LightGBM model, respectively. The time consumption of RF and GBDT in the HepG2 dataset was 88 and 9.4 times longer than that of the LightGBM model, respectively. These results suggest that traditional machine learning method is faster than the Fully Connected Neural Network method, but both methods have similar precision.

Discussion
The sequence features and structural features are predominantly used as the prime features in the methods focusing on the CpG site recognition. The DeepCpG model uses the connection between multiple cells to construct a deep learning model to achieve excellent accuracy. Inspired by this model, we extracted the sequence features and structural features of each site and also considered the same sites in different cells to construct the information vectors of the CpG sites. In addition, two methods of sample extraction and feature mergence were used to reduce feature redundancy and to speed up the training process.
In the beginning, we verified the correlation between the two CpG sites using the heart left ventricle and the GM12878 datasets. The experimental results indicated that as the distance between the two CpG sites increased, the majority of the evaluation indicators gradually decreased in all the datasets. In addition, we verified the correlation between the window of nearest sites and the target site, and extracted k 2 sites  . 12 The distribution of the evaluation values using the HCCs dataset. O represents our method, D represents the DeepCpG method, and Z represents the method of RF Zhang Fig. 13 The distribution of the evaluation values using the HepG2 dataset. O represents our method, D represents the DeepCpG method, and Z represents the method of RF Zhang from the upstream and downstream regions surrounding the target site. These results demonstrated that the AUC value decreased steadily as the window size increased. Next, we calculated the correlation between the CpG sites in the 25 cells of the HCCs dataset and discovered that up to 63% of the CpG site states were known in at least two cells. Moreover, up to 19% of the CpG sites had consistent state in at least two cells. Overall, this demonstrates the effectiveness and interpretability of our positional feature approach.
We then trained the RF model using sequence features, structural features, positional features, and their combinations. Our data showed that the positional features play a major role. In addition, we used RF as a classifier for our features, DeepCpG features, and Zhang's features, and compared the results of the three feature extraction methods. We found that when using the HCCs dataset containing 25 cells, our method achieved the best performance in 22 cells. Using the HepG2 dataset containing six cells, our method achieved the best performance in all of the cells.
Next, we used the LightGBM to rank features based on the importance scores, where we discovered that the sequence features and positional features had a positive effect on the performance of the model. We took 10-551 features with the highest importance score and established the correlation between feature dimension and accuracy. When the dimension of features was more than 320, the accuracy was stable. Therefore, feature mergence in the LightGBM model, which reduced feature redundancy, greatly improved the performance. Among the 320 features, there were eight structural features, including constraint score, CGIs, H3K9ac, CGIs shelf, CCNT2, iHS, CGIs shore and HMGN3. It demonstrates that these regions and proteins close tie with methylation sites.
Using the HCCs and the HepG2 datasets, we applied our features to the LightGBM, RF, XGBoost, GBDT, and the Fully Connected Neural Network model. Experimental results indicated that the LightGBM model performed its best in both the datasets, suggesting that LightGBM performed better in the recognition of methylation.
We then compared the LightCpG, DeepCpG, and Zhang methods using two datasets (HCCs and HepG2), and discovered that our method had excellent performance. In the cells with the best performance in the HCCs dataset, the Acc, MCC, and SP of our method were higher when compared with those of the DeepCpG model. Moreover, the AUC was only 0.0023 lower compared with that of the DeepCpG and the difference in Fscore was only 0.14%. In the average results from the 25 cells, the AUC of our method was 0.9616, which was lower by 0.0073, compared to that of the DeepCpG method, while RF Zhang reached the value of 0.9351. Using the HepG2 dataset, it is observed that the AUC of our method was higher compared with the DeepCpG in the cells with the best performance. On average, our method was 0.0035 lower than that in the DeepCpG in six cells. Based on both the datasets, our method was significantly better when compared with the RF Zhang method in all cells.
Finally, we used the same datasets and the same features to calculate the model training time of RF, GBDT, XGBoost, Fully Connected Neural Network and Light-GBM. The experimental results indicated that Fully Connected Neural Network took the longest, which were average of 4138.88 s and 1889.55 s on the HCCs and HepG2 datasets, respectively. Our method took only average of 8.3 s and 5.06 s on two datasets, respectively. Since the LightGBM model took a long time for feature mergence, it was 3 s longer compared to XGBoost. These data indicate that our method greatly shortened the training time.

Conclusion
In this paper, we presented the LightCpG model capable of distinguishing the CpG sites using the single-cell, wholegenome sequence data. Three types of features (positional feature, sequence feature, and structural feature) were extracted to identify the CpG sites. Two strategies (sample extraction and feature mergence) were used to reduce the training time. A comprehensive series of experiments with supporting data demonstrate that our model has a very effective feature extraction method. Two strategies significantly sped up the training of our model, making it more stable.