In this paper, we propose a novel method to resolve the issue of methylation identification, as shown in Fig. 1. First, we extracted sequence features, structural features and positional features of known CpG sites. Then, we applied the LightGBM model to train the classifier for each cell and also adjusted the model parameters to get the best performance. Finally, we used our trained model, called LightCpG, to predict the methylation states of unknown CpG sites.

### Dataset

We downloaded two benchmark datasets Homo sapiens GM12878 (ENCFF001TLS) and heart left ventricle (ENCFF001TKC), which were extracted by reduced representation bisulfite sequencing (RRBS) from ENCODE [37, 38]. Single-cell triple omics sequencing (scTrio-seq) is a sparse single-cell CpG profile. We downloaded two datasets of scTrio-seq profiled cells, 26 human hepatocellular carcinoma cells (HCCs) cells and six human heptoplastoma-derived (HepG2) cells, from the gene expression omnibus (GSE65364). Based on the study by Hou et al. [34], the Ca26 was excluded because the distribution of methylation state is seriously abnormal, so there were only 25 cells in the HCCs cell dataset. Every position of CpG sites was mapped to hg19 by using the liftOver tool (http://www.genome.ucsc.edu/cgi-bin/hgLiftOver) from the UCSC Genome Browser [39]. In this paper, we examined these sites as research objects, which were covered by at least four reads.

Inspired by the DeepCpG [35], we adopted the same validation method for all datasets. In the experimental part, the CpG sites in the training set were from chromosomes 1, 3, 5, 7, 9, and 11, and those in the test set were from chromosomes 2, 4, 6, 8, 10, and 12, and finally those in the validation set were from chromosomes 13, 14, 15, 16, 17, 18, and 19. All datasets are described in detail in Additional file 1 and Additional file 2.

### Feature extraction

#### Structural feature

In the mammalian genome, the CGIs are specific regions where the density of unmethylated CpG sites is greater compared to other regions. Work by Zhang et al. [1] demonstrated that the methylation level in CGIs is below 50%, the methylation level in CGIs shores ranges between 20% and 80% and the methylation level in CGIs shelves is much higher compared to the average level. Therefore, all samples in this paper were assigned three binary features. Specifically, the value of CGIs feature was 1 if the sample sites were within CGIs regions, otherwise it was 0. We applied the same principles to the CGIs shore feature and the CGIs shelf feature. These parts of the data were downloaded from the UCSC genome browser [40].

Studies indicate that CGIs have been shown to co-localize with the DNA regulatory elements, including TFBS, histone modification marks, chromatin states and DNase I hypersensitive sites (DHSs) [1, 41]. Moreover, many studies have found that the methylation states of CGIs are closely related to TFBSs [42–44]. DNA methylation and histone modifications are involved in regulating gene repression patterns during cell development as demonstrated by traditional experiments [45]. Some studies found that chromatin modification and DNA methylation are mutually dependent in the aspect of gene regulation [46]. Moreover, DHSs are linked with a strong enrichment of CpG methylation [47, 48]. These feature data were downloaded from the ENCODE [49]. All of the above-referenced features were binary.

Importantly, DNA methylation, an important epigenetic modification, is one of the major mechanisms regulating gene expression during cell differentiation. Although it does not change the genetic sequence, it can be inherited by offspring. Therefore, we used the integrated haplotype score (iHS) http://hgdp.uchicago.edu/Browser_tracks/iHS and the GERP++ constraint score on hg19 [50] http://mendel.stanford.edu/SidowLab/downloads/gerp/ to recognize the CpG sites in the DNA sequence.

Overall, we obtained 175 structural features for each CpG site, including 144 specific TFBSs, 15 chromatin states, 10 histone modification marks, CGIs, CGIs shores and shelves, DHSs, iHSs, and constraint scores.

#### Sequence feature

According to the position of the CpG sites in the raw data files, we extracted the sequence from the reference hg19, including the extracted DNA sequence of 101 bp with 50 bp before and 50 bp after the CpG site. The DNA primary sequence is composed of adenine (A), thymine (T), cytosine (C) and guanine (G). Some studies [51] suggest that the primary sequence composition is critical for the methylation recognition.

In this paper, we use DNA sequence information extracted by *n*-gram to identify the CpG sites. Each feature of *n*-gram can be denoted as a pair of value (*v*_{i},*f*_{i}), where *v*_{i} is one feature that can be recorded as a combination of *n* nucleotides and *f*_{i} represents the frequency of *v*_{i} in the DNA sequence. The equation for *f*_{i} is shown below:

$$ f_{i} = \frac{N(v_{i})}{L-(n-1)}, $$

(1)

where *N*(*v*_{i}) represents the number of *v*_{i} in the DNA sequence; *L* represents the length of the DNA sequence and *n* represents the number of nucleotides in the *v*_{i}.

In this paper, we use 1/2/3-gram to represent one DNA sequence, denoted as *v*_{i}∈{*A,C*,*T,G*}∪{*A**A,A**C*,…,*G**T,G**G*}∪{*AA**A,A**AC*,…,*GG**T,G**GG*}. Finally, we used 84 sequence features to identify the CpG sites

#### Positional feature

In this section, we extracted positional features by using the information on the CpG states and the distance between the adjacent CpG sites. We proposed a novel skip-*k* method to analyze the correlation between two adjacent CpG sites for DNA methylation recognition. The skip-*k* method is composed of two parts: skip- *k*_{1} and skip- *k*_{2}. The sketch map of skip-*k* is shown in Fig. 2.

The correlation between two adjacent CpG sites can be verified by skip- *k*_{1}. The method skip- *k*_{1} separately extracts one CpG site at the *k*_{1}-th distance upstream and downstream of the target CpG site, and then extracts the states of these two CpG sites and calculates the distances between them and the target CpG site. These two states and two distances are used to predict the target CpG sites. Next, we analyzed the change in prediction accuracy in the different *k*_{1} values. Specifically, we employed two benchmark datasets (GM12878 and heart left ventricle) to discuss the skip- *k*_{1} method. The experimental results demonstrate that the correlation between two CpG sites became weaker as the distance between them increased.

The correlation between CpG sites in a specific window can be analyzed by skip- *k*_{2}. The skip- *k*_{2} method separately extracts the *k*_{2} nearest the CpG sites from upstream and downstream of the target CpG site, and then separately extracts the states of these CpG sites and calculates the distances between them and the target CpG site. These 2*k*_{2} states and 2*k*_{2} distances are used to distinguish the state of the target CpG site. We analyzed the change in prediction accuracy with the different windows. Next, we used the 25 cells of the human HCCs dataset to analyze the skip- *k*_{2} method. The experimental results demonstrate that the prediction accuracy became smaller as the length of window increased.

Therefore, in this paper, both the *k*_{1} and *k*_{2} values were set as 1. We separately extracted one CpG site in the vicinity (upstream and downstream) of the target CpG site and then extracted the states of these two CpG sites and calculated the distances between them and the target CpG site. These four features were used to predict the CpG sites. For multiple human cell line cells, Christof et al. [35] established a bidirectional GRU model that builds the connection between multiple cells using the window length *k*_{2}=25. This model achieved excellent performance based on five datasets. However, in the modeling process, the model only considered the methylation states and the distances of adjacent sites in the window for each cell, ignoring some information about the methylation states of the same sites in different cells.

In this section, we extracted the features of the same CpG sites in other cells. In different cells, some of the same CpG sites have unknown methylation states. We solved the problem of feature extraction at the same CpG sites in different cells. Assuming that there are *m* cells in the data, we defined a set *G*_{i} that contains *n* CpG sites on each chromosome in the *i*-th cell, denoted as follows:

$$ \begin{aligned} G_{i} &= \left\{\left(p_{i}^{1},s_{i}^{1}\right),\left(p_{i}^{2},s_{i}^{2}\right),\ldots,\left(p_{i}^{c},s_{i}^{c}\right),\ldots\left(p_{i}^{n},s_{i}^{n}\right)\right\}; \\& \quad i=1,2,\ldots,m \end{aligned} $$

(2)

where \(p_{i}^{c}\) and \(s_{i}^{c}\) represent the position and methylation state of the *c*-th CpG site in the *i*-th cell, respectively.

To maximize the features for the same sites in different cells, we use *G*_{i} and *G*_{j} to establish the following feature extraction method. If one CpG site existed in the *j*-th cell satisfying \(p_{i}^{c}=p_{j}^{l}\), we denoted \(\mathcal {F}_{i,j}^{c}\) as the distance and the methylation states of the nearest CpG sites on both sides of the *l*-th CpG site in the *j*-th cell, as shown below:

$$ \mathcal{F}_{i,j}^{c} = \left\{P_{i,j}^{l-1},S_{i,j}^{l-1},P_{i,j}^{l+1},S_{i,j}^{l+1}\right\} $$

(3)

where \(P_{i,j}^{l-1}=p_{j}^{l-1}-p_{i}^{c}, P_{i,j}^{l+1}=p_{j}^{l+1}-p_{i}^{c}, S_{i,j}^{l-1}=s_{j}^{l-1}\) and \(S_{i,j}^{l+1}=s_{j}^{l+1}\).

If one CpG site is unknown methylation status in the *j*-th cell, we selected two neighboring CpG sites in the *j*-th cell satisfying \(p_{j}^{l} < p_{i}^{c} < p_{j}^{l+1}\), following which \(\mathcal {F}_{i,j}^{c}\) was denoted as shown below:

$$ \mathcal{F}_{i,j}^{c} = \left\{P_{i,j}^{l},S_{i,j}^{l},P_{i,j}^{l+1},S_{i,j}^{l+1}\right\} $$

(4)

where \(P_{i,j}^{l}=p_{j}^{l}-p_{i}^{c}, P_{i,j}^{l+1}=p_{j}^{l+1}-p_{i}^{c}, S_{i,j}^{l}=s_{j}^{l}\) and \(S_{i,j}^{l+1}=s_{j}^{l+1}\).

Furthermore, we represented the features for the same CpG sites in different cells and established the following feature extraction method. If one CpG site existed in the *j*-th cell satisfying \(p_{i}^{c}=p_{j}^{l}\) and *i*≠*j*, we denoted \(\mathcal {D}_{i,j}^{c}\) as the distance and the methylation states of the nearest CpG sites on both sides of the (*l*−1)-th and (*l*+1)-th CpG sites in the *j*-th cell, as shown below:

$$ \begin{aligned} \mathcal{D}_{i,j}^{c} = &\left\{P_{j,j}^{(l-2)(l-1)},S_{j,j}^{l-2},P_{j,j}^{(l-1)(l)},S_{j,j}^{l}\right\}\\ \cup&\left\{P_{j,j}^{(l)(l+1)},S_{j,j}^{l},P_{j,j}^{(l+1)(l+2)},S_{j,j}^{l+2}\right\} \end{aligned} $$

(5)

where \(P_{j,j}^{(l-2)(l-1)}=p_{j}^{l-1}-p_{j}^{l-2}, S_{j,j}^{l-2}=s_{j}^{l-2}, P_{j,j}^{(l-1)(l)}=p_{j}^{l}-p_{j}^{l-1}, S_{j,j}^{l}=s_{j}^{l}, P_{j,j}^{(l)(l+1)}=p_{j}^{l+1}-p_{j}^{l}\) and \(P_{j,j}^{(l+1)(l+2)}=p_{j}^{l+2}-p_{j}^{l+1}, S_{j,j}^{l+2}=s_{j}^{l+2}\).

These features included the methylation states of the same sites in different cells. If one CpG site is unknown methylation status in the *j*-th cell, we selected two neighboring CpG sites in the *j*-th cell satisfying \(p_{j}^{l} < p_{i}^{c} < p_{j}^{l+1}\), and then \(\mathcal {D}_{i,j}^{c}\) was denoted as shown below:

$$ \begin{aligned} \mathcal{D}_{i,j}^{c} = &\left\{P_{j,j}^{(l-1)(l)},S_{j,j}^{l-1},P_{j,j}^{(l)(l+1)},S_{j,j}^{l+1}\right\}\\ \cup&\left\{P_{j,j}^{(l)(l+1)},S_{j,j}^{l},P_{j,j}^{(l+1)(l+2)},S_{j,j}^{l+2}\right\} \end{aligned} $$

(6)

where \(P_{j,j}^{(l-1)(l)}=p_{j}^{l}-p_{j}^{l-1}, S_{j,j}^{l-1}=s_{j}^{l-1}, P_{j,j}^{(l)(l+1)}=p_{j}^{l+1}-p_{j}^{l}, S_{j,j}^{l+1}=s_{j}^{l+1}, S_{j,j}^{l}=s_{j}^{l}, P_{j,j}^{(l+1)(l+2)}=p_{j}^{l+2}-p_{j}^{l+1}\) and \(S_{j,j}^{l+2}=s_{j}^{l+2}\).

If \(i=j, \mathcal {F}_{i,i}^{c}\) are positional features in the same cell. \(\mathcal {D}_{i,i}^{c} = \left \{P_{i,i}^{l-2},S_{i,i}^{l-2},P_{i,i}^{l},S_{i,i}^{l},P_{i,i}^{l},S_{i,i}^{l},P_{i,i}^{l+2},S_{i,i}^{l+2}\right \}\), which contains the methylation state of the \(p_{i}^{c}\) site. Therefore, \(\mathcal {D}_{i,i}^{c}\) must avoid the condition of *i*=*j*.

Finally, we extracted 4*m*+8(*m*−1) features to solve the problem of multi-cell methylation identification. The sketch maps of feature \(\mathcal {F}\) and feature \(\mathcal {D}\) are shown in Fig. 3. All features are described in detail in Additional file 3.

### LightGBM model

In this paper, we used the LightGBM method [36] to distinguish the states of various CpG sites. The model uses a novel gradient boosting decision tree (GBDT) algorithm, including gradient-based one-side sampling (GOSS) to extract relatively small number of samples according to gradient values and exclusive feature bundling (EFB) to reduce the number of features. The GOSS and EFB approaches are described in Fig. 4.

#### Sample selection

For the GOSS approach, we first sorted all the data samples according to the gradient values. Then, the top *a**%* samples were drawn and *b**%* of the remaining samples were randomly selected. Finally, the two parts were combined for further analysis. Following the GOSS process, the dataset samples were reduced.

#### Feature merging

The EFB method was mainly used to handle sparse feature sets. First, we constructed a graph based on the number of non-zero values in each feature. The feature *F*_{i} was marked as a node in the graph and the connection weight of two features *F*_{i} and *F*_{j} was calculated as shown below:

$$ {w}_{i,j}= \frac{N_{i,j}}{L} $$

(7)

where *L* represents the number of samples, and *N*_{i,j} represents the total number of *F*_{i} and *F*_{j} not equal to 0 at the same time.

Next, we sorted features according to the degree value in the graph, and feature with the largest degree value was used as an initial set. We defined *d* as the maximum conflict value for a feature set. Each feature was included in the existing set when the total connection weight was less than *d*. Otherwise, this feature was included as a new set when the total connection weight in all sets was greater than *d*.

Finally, we obtained a composite feature for each set using the histogram method, including multiple feature information. As the number of features decreased, the time complexity of the training process also decreased.

#### Training model

The input training set was recorded as *D* = {(*x*_{1},*y*_{1}),(*x*_{2},*y*_{2}),…,(*x*_{m},*y*_{m})}, where *m* is the number of samples, *x*_{m} is the features of *m*-th sample and *y*_{m} is the real output of *m*-th sample. We defined the number of trees as *T*. The loss function was denoted as *L*(*y,c*), where *y* represents the expected output and *c* is the real output.

First, we used a weak classifier *f*_{0}(*x*). Before training the *t*-th tree, the gradient of each sample was calculated separately as shown below:

$$ g_{t}(x_{i}) = -\left[\frac{\partial{L(y_{i},f(x_{i}))}}{\partial{f(x_{i})}}\right]_{f(x)=F_{t-1}(x)} ~~~i=1,2\ldots{m} $$

(8)

where *g*_{t}(*x*_{i}) represents the gradient value of the *i*-th sample inputted into the *t*-th tree and *F*_{t−1}(*x*) represents the strong classifier of the linear combination of *t*-1 weak classifiers. Gradient values were used to train the *t*-th tree, and the learning model equation is denoted as follows:

$$ w^{\ast}_{t} = \arg \min \limits_{w}\sum\limits_{i=1}^{m}L(g_{t}(x_{i}),h_{t}(x_{i},w_{t})) $$

(9)

where *w*_{t} represents the parameter in the training tree process, and \(w^{\ast }_{t}\) represents the value of the *w*_{t} when the loss function takes the minimum.

Next, we solve the coefficient as shown below:

$$ {\rho_{t}}^{\ast} = \arg \min \limits_{\rho_{t}} \sum\limits_{i=1}^{m}L\left(y_{i},F_{t-1}(x_{i})+{\rho_{t}}h_{t}\left(x_{i},w^{\ast}_{t}\right)\right)^{2} $$

(10)

If *f*_{t}(*x*)=*ρ*_{t}^{∗}*h*_{t}(*x*_{i},*w*^{∗}), the model becomes updated as shown below:

$$ F_{t}(x)=F_{t-1}(x)+f_{t}(x) $$

(11)

Finally, the output model was summarized as follows:

$$ F(x)=f_{0}(x)+\sum\limits_{t=1}^{T} f_{t}(x) $$

(12)

In this work, we examined whether a CpG site was covered by the reads which were mostly methylated in the raw files. If this was true, then the CpG site was considered to be a methylation site and the state of site was denoted as 1. If a CpG site was covered by the reads which are mostly unmethylated, then the site was considered as unmethylation site and the state of CpG site was denoted as 0. Therefore, the methylation recognition is a dual classification problem. Our method utilized (259+4*m*+8(*m*−1))-*d* features to train the LightGBM classifier and predict the states of new CpG sites.