Accurate prediction of DNA N4-methylcytosine sites via boost-learning various types of sequence features

Background DNA N4-methylcytosine (4mC) is a critical epigenetic modification and has various roles in the restriction-modification system. Due to the high cost of experimental laboratory detection, computational methods using sequence characteristics and machine learning algorithms have been explored to identify 4mC sites from DNA sequences. However, state-of-the-art methods have limited performance because of the lack of effective sequence features and the ad hoc choice of learning algorithms to cope with this problem. This paper is aimed to propose new sequence feature space and a machine learning algorithm with feature selection scheme to address the problem. Results The feature importance score distributions in datasets of six species are firstly reported and analyzed. Then the impact of the feature selection on model performance is evaluated by independent testing on benchmark datasets, where ACC and MCC measurements on the performance after feature selection increase by 2.3% to 9.7% and 0.05 to 0.19, respectively. The proposed method is compared with three state-of-the-art predictors using independent test and 10-fold cross-validations, and our method outperforms in all datasets, especially improving the ACC by 3.02% to 7.89% and MCC by 0.06 to 0.15 in the independent test. Two detailed case studies by the proposed method have confirmed the excellent overall performance and correctly identified 24 of 26 4mC sites from the C.elegans gene, and 126 out of 137 4mC sites from the D.melanogaster gene. Conclusions The results show that the proposed feature space and learning algorithm with feature selection can improve the performance of DNA 4mC prediction on the benchmark datasets. The two case studies prove the effectiveness of our method in practical situations.


Background
As an essential epigenetic modification, DNA base methylation expands the DNA content and plays crucial roles in regulating various cellular processes [1][2][3]. According to the location where a methylated group occurs in the DNA sequence, there are many kinds of DNA base methylation. For example, 5-Methylcytosine (5mC), N6-methyladenine restriction-modification system that provides a bacterial immune response against occupying DNA, DNA repair, expression, or replication [15][16][17]. Compared with the studies for 5mC and 6mA, biological functions of 4mC are much investigated due to the lack of sufficient detection methods.
The precise location of the DNA base methylation was a hard problem in the past for a long time. It is not affordable to locate the DNA 5mC on a large scale until the whole-genome bisulfite sequencing, and the next-generation sequencing techniques were developed [18,19]. The detection of 6mA and 4mC in the level of whole-genome became available after the single-molecule real-time sequencing (SMRT) technology was introduced [4,20]. Then a next-generation sequencing method called 4mC-Tet-assisted-bisulphitesequencing and another method based on engineered transcription-activator like effectors were developed for 4mC identification [21]. However, the experimental methods were of high cost and cannot identify 4mC sites with time efficiency. Recently, the rapid development of machine learning algorithms provides a promising computational approach to address classification problems in bioinformatics, and researchers have explored using computational methods to identify 4mC sites from DNA sequences.
Collecting data from public SMRT sequencing experiments, Ye et al built the first DNA 6mA and 4mC database named MethSMRT for 156 species [22]. Chen et al [23] proposed an SVM based prediction model iDNA4mC using the nucleotide chemical property and sequential nucleotide frequency features. Recently, 4mCPred and 4mcPred-SVM were developed to improve the site prediction performance [24,25]. In 4mCPred, He et al used two new features PSTNP and EIIP with a simple feature selection. Wei et al built 4mcPred-SVM with four kinds of sequence features and a two-step feature optimization. Recently, some other predictors have been developed to identified 4mC site in the DNA sequence for Mouse [26,27], Escherichia coli [28], Rosaceae [29] and so on [30,31].
The core idea of the previous research is to transform 4mC-contained DNA sequences into various kinds of features as the input of the machine learning algorithms.However, these features are not adequate to make the prediction methods to achieve excellent performance. Through the analysis on the sequence logos, we observe that the adjacent nucleotides' characteristics are potentially essential. Thus we extract the contiguous nucleotides sequence characteristics like k-nucleotide frequency, k-spectrum nucleotide pair frequency, and PseDNC as features to describe the sequences. Besides, two global sequence features, one-hot binary and sequential nucleotide frequency, are also merged into our feature space. As global features have the complete information of DNA sequence and the local features can underline specific sequence patterns, the combined feature space is highly expected to improve the prediction performance.
Since feature selection can reduce the feature space dimension and the modelling complexities [32], the existing 4mC prediction methods, including 4mCPred and 4mCPred_SVM, employed a feature selection scheme based on the F-score and sequential forward search (SFS) strategy [24,25]. Although the F-score can evaluate the feature importance according to the relevance between the feature and label, the performance of the selected feature subset was still under-realized. In this paper, we propose an embedded feature selection scheme, in which features are ranked with the feature importance scores derived by the XGBoost classifier training process. Supported by information entropy theory, the feature importance here is more meaningful than F-score. Then lowerranked features are removed one by one, each round with a cross-validation assessment on the performance of the selected feature subset.
The flowchart of our approach is shown in Fig. 1 where the new sequence feature space and feature selection scheme are depicted for DNA 4mC site prediction. First, the DNA sequence is encoded into five kinds of features, a total of 292 dimensions. Second, an XGBoost machine is trained and the feature importance scores from the training are used to rank all the features. Last, an SVMbased prediction model is built, and the parameters are optimized with 10-fold cross-validation.
In the results section, we analyze the feature importance in our feature space and show that feature selection improves the model performance significantly in the independent test. Besides, we compare the proposed method with three state-of-art methods, iDNA4mC, 4mCPred, and 4mCPred_SVM in independent test and 10-fold cross-validation on benchmark datasets, and the proposed method achieves much better performance. Two detailed case studies for 4mC site prediction on the dlk-1 and DSCAM genes partly prove the effectiveness of our approach in practical situations.

Results
This section reports the feature importance scores obtained from the XGBoost machine and analyzes the influence of the feature selection on prediction performance. Then three state-of-the-art predictors are compared with the proposed method in the independent test and 10-fold cross-validation on benchmark datasets. At last, we present results from two case studies which were conducted to identify the 4mC sites in the C.elegans and D.melanogaster genes.

Feature importance analysis
As stated, five types of sequence features are created to constitute a 292-dimensional feature space. Among the 292 dimensions, OHB is from D1 to D164; SNF is from D165 to D205; KNF is from D206 to 225; KSNPF is from D226 to 273 and PseDNC is from D274 to D292. The feature importance scores are obtained from the training process of the XGBoost machine. The importance score distributions for all the datasets are illustrated in Fig. 2. Top 30 feature dimensions are reported in Table S2 of Additional File 1 and feature importance scores of all the feature dimensions are in Additional File 2.
It is understood that each feature dimension has distinct importance scores in different species. OHB and PseDNC features have relatively high average scores in all species. In particular, OHB features have the highest average score in C.elegans, D.melanogaster and A.thaliana. KSNPF feature not only gets a high importance score in A.thaliana, E.coli and G.subterraneus like KNF features, but also has the highest average score in G.pickeringii. SNF feature just stands out in E.coli. The features' importance score ranges from 0 to 50 and some feature dimensions' scores are such low that they are less important in the classification and may have noise effects on model performance. Thus, the feature selection before the training is potentially useful to improve model accuracy.

Impact of feature selection on classification
We first evaluate the model performance via independent test without feature selection before model training. Then the independent test is carried out with feature selection, where the benchmark datasets divisions and SVM parameters are kept the same. Table 1 and Fig. 3 show the independent test performance before and after feature selection.
The independent test after feature selection improves the model performance in all the species. In C.elegans, feature selection improved Sn, Sp, ACC and MCC by 7.54%, 3.85%, 7.74% and 0.16. In D.melanogaster, the model performance has the most considerable improvement by 10.17%, 9.32%, 9.74% and 0.19 for Sn, Sp, ACC and MCC, respectively. For A.thaliana, Sp increased by 6.82% while ACC and MCC slightly increased by 2.27% and 0.05. Besides, Sp, ACC and MCC improved by 9.23%, 7.7% and 0.14 in E.coli dataset. In G.subterraneus, the metrics improvement is by 8.34% for Sn, 6.67% for Sp, 7.5% for ACC and 0.15 for MCC. As for G.pickeringii, the performance is improved by 5.17%, 10.73%, 7.89% and

Comparison with state-of-the-art predictors
Three state-of-the-art DNA 4mC prediction methods, iDNA4mC, 4mCPred, and 4mCPred_SVM are compared with the proposed method. The comparison was conducted using the independent test and cross-validation test on the benchmark datasets. The independent test results by iDNA4mC and 4mCPred were reported in [24], and we cannot find the independent test results of 4mCPred_SVM method. Since 4mCPred_SVM only provides the final prediction model, it's not available to rebuild the independent test. Thus, here we compare our method with iDNA4mC and 4mCPred in independent test under the same division of training and testing data. The results of independent test are presented in Table 2. Oour method outperforms the other methods in all species. Generally, the proposed method improves ACC from 3.02% to 7.89% and increases MCC from 0.06 to 0.15. Especially, a significant improvement of our approach can be observed in G.pickeringii (improving Sn by 5.26%, Spby 10.52%, ACC by 7.89%, and MCC by 0.15).
We performed a 10-fold cross-validation with the same process as the existing methods. The cross-validation results of the three state-of-the-art predictors were reported in the publication of 4mCPred_SVM [25], where the reported performance of 4mCPred has been modified by solving the over-estimated problem. The summary of cross-validations are illustrated in Table 3. Except for the four evaluation metrics, we also list the sample count of TP (True Positive), FN (False Negative), FP (False Positive) and TN (True Negative). As shown in the table, in D.melanogaster, A.thaliana and Gpickeringii, our method has the most TP and TN counts, increasing ACC by 0.7% to 1.7% and MCC by 0.015 to 0.033. In G.subterraneus, our method has the highest TN, improving more ACC Fig. 3 The ROC curves before and after feature selection and MCC by 1% and 0.02% than 4mC_SVM which has the second-best performance. Additionally, the TP and TN of our method are not the highest in C.elegans and E.coli, but our method slightly improve the ACC and MCC by 1% and 0.02 in E.coli and has a comparative performance with 4mCPred, better than other two methods in C.elegans.
It's clear that our method achieves better overall performance than the existing predictors in independent and cross-validation tests. The improvement of ACC indicates that our method accurately identifies more 4mC sites and the increase of MCC means that our method has more balanced performance for classifying positive and negative samples. Therefore, our method is more effective to identify DNA 4mC sites than the existing predictors.

Case studies
To confirm the effectiveness of our method to solve practical problems, two detailed case studies are conducted.   [33][34][35][36]. As 4mC plays critical roles in DNA expression and replication in these models, we describe how our method can help identify 4mC sites more accurately in the related genes. We focus on the dlk-1 gene which can promote mRNA stability and local translation in C.elegans [37], and on DSCAM gene which can contribute to the specificity of neuronal connectivity in D.melanogaster [38]. The 26 and 137 validated 4mC sites in dlk-1 and DSCAM gene are collected from the MethSMRT database. The collected 4mC-contained DNA sequences are all 41-bit, that can be directly submitted into the web tools of three state-of-the-art methods. The prediction result are depicted in Fig. 4 and Table 4. Figure 4 shows the label confidence predicted by these four predictors, where the positive confidence refers that the corresponding site is predicted to be 4mC site and the negative confidence means the site is predicted to be a non-4mC site. As shown in the figure, iDNA4mC achieves the worst performance in both two cases, and half of the predictions are incorrect in DSCAM gene. 4mCPred, 4mCPred_SVM and the proposed method have similar performance in the DSCAM gene case, while the results made by 4mCPred and our proposed method on the dlk-1 gene are better than 4mCPred_SVM.
More details of the prediction are presented in Table 4. Since the testing data in the case study only contains positive samples, there are only TP and FN counts in the results. For the dlk-1 case, 4mCPred has only one wrong prediction and the proposed method has made two false predictions out of 26 samples, while iDNA4mC and 4mCPred_SVM have 7 and 6 incorrect predictions respectively. For the DSCAM case, there are 137 4mC sites tested, and our proposed method has made 126 correct predictions (i.e., only 11 incorrect predictions). 4mCPred and 4mCPred_SVM have 16 and 15 false predictions, while iDNA4mC has made 67 false predictions. More detailed results can be found at the supplementary Additional file 3.

Discussion
To improve the performance, we have focused on choosing more efficient features for 4mC site prediction, including extracting better sequence feature and feature selection before model learning. However, there are also some limitations in the study: first,the feature are mostly from the content of sequence, not the biological characters; second, the size of training data is limited.
In the future, we will continue to optimize our feature space with novel sequence features of important biological characteristics. Furthermore, we will expand the size of the benchmark datasets to enhance the model's accuracy and generalization ability. Also, since the number of 4mC is much smaller than non-4mC sites in practical situations, the data imbalance will be considered in the next research. At last, we will apply our method to solve other sequence site prediction problems.

Conclusions
The 4mC site prediction is a typical sequence site classification problem. The state-of-the-art research work have made some explorations, but their performance still needs improvement. For this purpose, we propose to construct a more effective feature space, integrating five types of sequence features, and suggest to use a novel learning algorithm with XGBoost based feature selection scheme. The results show that the feature selection improves the performance, and the prediction model outperforms the other three existing predictors in the independent tests and the cross-validations.

Methods
Based on the benchmark datasets, this paper proposed a new sequence feature space and a machine learning algorithm with feature selection scheme. In the sequence encoding, five types of sequence features are integrated to form a 292-dimension feature space, representing both global and local sequence characteristics. Then a feature selection scheme is applied, by which feature importance  [23]. The benchmark datasets are obtained from Chen's work. According to the reference, the 41bit 4mC-centred DNA sequences were obtained from MethSMRT with a Modification QV threshold of 30. The CD-HIT software was used to remove the redundant positive samples. The same number of negative samples were selected randomly to construct a balanced dataset. The negative samples were also 41-bit cytosine-centered DNA sequences and were not detected by SMRT. To compare with the existing predictors, we use the same division of the datasets for independent tests. The summary of benchmark datasets is listed in Table 5.

Feature space construction
To visualize the difference between the positive and negative sequences, the sequence logos of all the six  species are plotted using the web tool 'two sample logos' [39]. See Fig. 5. It is clear that the sequence characteristics are distinct among the six species; especially positions near the 4mC sites exhibit different patterns in positive and negative samples. In addition, the adjacent nucleotide and spectrum nucleotide across the entire sequence have specific patterns in different label groups. Thus an expanded feature space combining global and local patterns is good to construct accurate models for all the species. Among the existing methods, iDNA4mC only use nucleotide chemical property and frequency feature, which cannot extract the local adjacent nucleotide patterns; in 4mCPred and 4mCPred_SVM, the features mainly focus on the trinucleotide or dinucleotide nucleotide patterns, ignoring the spectrum nucleotide patterns. In this study, the feature space covers five types of features, one-hot 4-bit binary feature (OHB), sequential nucleotide frequency (SNF), k-nucleotide frequency (KNF), k-spectrum nucleotide pair frequency (KSNPF) and PseDNC. The OHB and SNF feature possess the information of the whole sequence and represent the global sequential properties, while KNF, KSNPF, and PseDNC features capture the local sequence patterns.

One-hot binary feature
The one-hot binary feature is the most widely used sequence representation feature. It converts each of the nucleotides in the DNA sequence into a 4-bit vector, which contains only one '1' . The length of the OHB feature is related to the number of nucleotide types and length of the sequence. Since the DNA sample sequence here is 41bit and has four types of nucleotide, the one-hot binary feature is 164 bits. The encoding rules in this study are as follows: ' A'-(1,0,0,0), 'G'-(0,1,0,0), 'T'-(0,0,1,0), 'C'-(0,0,0,1). From the rule, it is obvious that the OHB feature is sparser than 2-bit or 3-bit binary features. But, the one-hot binary feature makes it more reasonable to calculate the importance sore for each dimension in feature space and to discover local motifs.

Sequential nucleotide frequency
The sequential nucleotide frequency, also known as nucleotide density, is the frequency that the corresponding nucleotide occurs before the current position. SNF is commonly used together with the binary encoding feature as a global density feature. For an n-bit long sequence, SNF calculates n values for each position in the sequence and produces an n-dimensional feature that starts with '1' . The SNF feature d i is defined as: where S i denotes the length of sequence before the current position i and s i is the nucleotide at position i. For example, a sequence like 'AACGTACT' can be converted into the SNF feature vector (1, 0.5, 0.33, 0.25, 0.2, 0.5, 0.28, 0.25).

k-nucleotide frequency
The k-nucleotide (k-mer) frequency is a classic concept in DNA sequence encoding. KNF feature is the frequency that adjacent k nucleotides occur in the whole sequence. The length of the KNF feature vector is 4 k , determined by the parameter k. The calculation of KNF is as below: where n 1 n 2 ...n k donates the adjacent k nucleotides and n i ∈ (A, C, G, T). F and C is the feature value and total count of the adjacent nucleotides, while S is the length of sequence. When k = 1, the KNF is a vector like with a dimension of 4 2 = 16.

k-spectrum nucleotide pair frequency
The KSNPF feature depicts the sequence context by calculating the frequency of k-spaced nucleotide pairs (e.g., AXXT is a two-spaced nucleotide pair, and CXXXG is a three-spaced nucleotide pair). Like the adjacent nucleotides pair above, the feature dimension of the KSNPF is 16 for each k. The calculation of this feature is as follows: where n 1 X...Xn 2 donates the k-spaced nucleotides pair and n i ∈ (A, C, G, T).

PseDNC
As an essential sequence feature, PseDNC combines global and local structural properties and has been widely used in sequence site prediction problems [40]. For a DNA sequence, the PseDNC feature is a vector: where, where f k denotes the normalized frequency of two adjacent nucleotide pairs; w is the weight factor, and θ is the correlation factor of j-tier, representing the correlation of all j-tier from the sequence. The definition of θ is: where is the correlation function and given by: where μ is the length of sequence; P u (R i R i+1 ) is the numerical value of the u-th DNA local property for the adjacent nucleotide pair R i R i+1 at position i. In this study, PseDNC feature is computed by a python package 'repDNA' [41] and the λ value is default to 3. The names of 38 DNA local properties utilized in the definitions here are detailed in the supplementary Table S1 of Additional File 1.

Feature selection scheme
Feature selection can reduce the dimension of feature space and speed up the model training. A lot of feature selection strategies have been employed in machine learning [42]. In particular, a filter feature selection scheme has been used to improve the prediction performance. The filter feature selection scheme has two steps: first, F-score is calculated for each dimension in feature space according to the relevance between feature and label; second, a selection strategy called SFS is adopted to ascertain the feature subset. In this study, we proposed an embedded feature selection method also with two steps. However, we rank features with importance scores produced from the XGBoost training process [43] and select the top features with cross-validations. In our method, XGBoost is the predefined classifier to analyze the feature importance. XGBoost has been proved to be an efficient tool in data science. In the training process, the XGBoost classifier calculates the feature importance score for each dimension based on the dimension location and the split efficiency in the boosting tree. In this study, XGBoost is implemented with a python package 'xgboost of vision 0.90. The feature importance scores are obtained through the function 'get_score . According to the calculation method, the feature importance score has 5 types: 'weight , 'gain , 'cover , 'total_weight , 'total_gain and here we use the default 'weight importance score.
With the importance scores derived by the XGBoost classifier, feature dimensions are ranked from the highest to the lowest. Then the lower-ranked features are removed from the feature space one by one, and the feature subset performance is evaluated by 10-fold cross-validation with a support vector machine. The feature subset with the best performance is taken as the final feature space for 4mC prediction.

Support vector machine
Support vector machine (SVM) is a popular machine learning classifier and has been proved to be more efficient than the other algorithms for DNA 4mC prediction in the state-of-the-art researches [25]. In this study, SVM is implemented by using the python package 'scikit − learn(vision0.22) [44]. The kernel function of the SVM prediction model is set as a radial basis kernel function (RBF). The hyperparameter C and γ are optimized by a grid search with cross-validations and the search ranges are listed below: With the output of the probability scores, the ROC curve can be plotted. The threshold of probability score is set as 0.5 to obtain the predicted label. Here, we compare SVM with other three traditional machine learning methods, such as Random Forest, Naive Bayes and Neural Network, and the results are reported in Table S3 of Additional File 1.

Performance evaluation metrics
To compare with the existing predictors, the evaluation metrics in this study are consistent with the state-of-the-art methods, including Sensitivity(Sn), Specificity(Sp), Accuracy(ACC) and Matthews correlation coefficient(MCC). The definitions of these four metrics are as follows: Sn shows the model capability of identifying positive samples, while Sp tells the capacity of classifying negative samples; ACC is the prediction accuracy of all samples; MCC evaluates the overall performance of a predictor. In this study, the receiver operating characteristic(ROC) curve is also used to analyze model performance. The ROC curve is plotted in a coordinate graph where the xaxis is the false positive rate(1-Sp) and the y-axis is the true positive rate(Sn). The area under the curve(AUC) evaluates the classification performance, and larger AUC means better performance.