DeepHistone: a deep learning approach to predicting histone modifications

Motivation Quantitative detection of histone modifications has emerged in the recent years as a major means for understanding such biological processes as chromosome packaging, transcriptional activation, and DNA damage. However, high-throughput experimental techniques such as ChIP-seq are usually expensive and time-consuming, prohibiting the establishment of a histone modification landscape for hundreds of cell types across dozens of histone markers. These disadvantages have been appealing for computational methods to complement experimental approaches towards large-scale analysis of histone modifications. Results We proposed a deep learning framework to integrate sequence information and chromatin accessibility data for the accurate prediction of modification sites specific to different histone markers. Our method, named DeepHistone, outperformed several baseline methods in a series of comprehensive validation experiments, not only within an epigenome but also across epigenomes. Besides, sequence signatures automatically extracted by our method was consistent with known transcription factor binding sites, thereby giving insights into regulatory signatures of histone modifications. As an application, our method was shown to be able to distinguish functional single nucleotide polymorphisms from their nearby genetic variants, thereby having the potential to be used for exploring functional implications of putative disease-associated genetic variants. Conclusions DeepHistone demonstrated the possibility of using a deep learning framework to integrate DNA sequence and experimental data for predicting epigenomic signals. With the state-of-the-art performance, DeepHistone was expected to shed light on a variety of epigenomic studies. DeepHistone is freely available in https://github.com/QijinYin/DeepHistone.


Background
Histone modifications, as covalent post-translational modifications (PTMs) to histone proteins, have been recognized as one of the major driving forces alters chromatin structures since the early 1960s [1]. Enabled by such innovative techniques as X-ray crystallography, it has been gradually clear that the modification of histone amino (N)-terminal tails would affect inter-nucleosomal interactions, alter the overall chromatin structure or recruit histone modifiers, and eventually impact gene expression [2]. It has also been known that histone modifications, including methylation, acetylation, phosphorylation, ubiquitylation and sumoylation, act in a variety of biological processes such as chromosome packaging [3,4], transcriptional activation and inactivation [5][6][7], as well as DNA damage and repair [8]. Therefore, quantitative detection of histone modifications would provide useful information for not only a better understanding towards epigenetic regulation of cellular processes but also the development of drugs targeting on histone modifying enzymes [9].
Histone modifications are mainly profiled by such high-throughput experimental techniques as chromatin immunoprecipitation followed by sequencing (ChIP-seq) [10]. For example, Barski et al. generated high-resolution maps for the genome-wide distribution of 20 histone lysine and arginine methylations and identified typical patterns of histone methylations exhibited at promoters, insulators, enhancers, and transcribed regions [11]. Whole-genome profiling of DNA regulatory elements, their relationship to target genes, their properties of histone modifications, and their features of chromatin accessibility, were conducted by the Encyclopaedia of DNA Elements (ENCODE) project [12]. Even larger scale global maps of regulatory elements in 111 reference human epigenomes, together with chromatin accessibility and gene expression information, were established by the Roadmap Epigenomics Consortium [13]. These abundant resources provided new insights into the function of histone modification and chromatin organization in genome, demonstrated the central role of epigenomic information for understanding gene regulation and cellular differentiation, and opened a door towards deciphering mechanisms of human disease.
Nevertheless, it is still too expensive and time-consuming to establish a landscape of histone modifications purely relying on biological experiments, due to the large number of cell types and known histone markers. It is, therefore, reasonable to take advantage of computational methods to predict histone modifications, complementing experimental approaches and facilitating the understanding of DNA signatures and modifications that contribute to gene expression. Towards this objective, Benveniste et al. designed a logistic regression model to predict histone modifications from transcription factor-binding profiles and recapitulated the importance of interactions between transcription factors and chromatin-modifying enzymes to gene expression [14]. Karlic et al. elucidated the correlation between histone modification levels and gene expression and designed a linear regression model to predict gene expression relying on a small number of histone modifications [15].
In the recent years, deep learning has been successfully incorporated into a variety of bioinformatics studies. For example, Alipanahi et al. proposed a convolutional neural network (CNN) named DeepBind to predict binding proteins and showed higher prediction power than traditional classifiers [16]. Zhou and Troyanskaya designed a model called DeepSEA to learn DNA regulatory signatures via a CNN from epigenomic data [17]. Quang and Xue combined a CNN and a bi-directional long short-term memory network to predict functions of DNA sequences and named their method DanQ [18]. Min [20]. Min et al. further developed a representation learning formulation to embed k-mers into a low dimension space and then used the resulting vectors to predict chromatin accessibility via a deep neural network [21]. The success of these methods suggests that deep learning is a powerful technique in genomic studies. However, all these methods rely purely on DNA sequence information, which apparently lacks the power of making predictions in a cell line-specific manner, because DNA sequences are identical in different cell lines. To overcome this limitation, hybrid deep learning methods have been proposed and shown visible improvement in specific research by combining sequence information and biological experimental data. For instance, a recently proposed method named DeepTACT combined DNA sequences and chromatin accessibility to predict highresolution chromatin contacts from promoter capture Hi-C data and achieved state-of-the-art performance [22].
Motivated by the above understanding, we purposed a deep learning approach named DeepHistone to predict histone modification by integrating DNA sequence information and chromatin accessibility data. The rationale for our method is to capture regulatory signatures from DNA sequences, while taking advantage of the compact relationship between histone modifications and chromatin accessibility to further improve the prediction performance. Through a serial of comprehensive validation experiments, we demonstrated that DeepHistone is superior to several baseline methods in the prediction of modification sites specific to different histone markers, not only within an epigenome but also across epigenomes. Besides, we illustrated that sequence signatures automatically extracted by our deep learning model was consistent with known transcription factor binding sites. As a potential application, we finally showed the possibility of our method in distinguishing functional single nucleotide polymorphisms (SNPs) from their nearby genetic variants.

Data sources
We downloaded peak files of 7 histone modification markers for 21 human epigenomes from the Roadmap Epigenomics Project [13]. As shown in Table 1, the 7 markers, including H3K4me3, H3K4me1, H3K36me3, H3K27me3, H3K9me3, H3K27ac, and H3K9ac, are regarded as the most important markers that have been verified to be associated with such specific functional regions as enhancers and promoters in the genome [23]. The criterion for selecting an epigenome is that ChIP-seq assays should be performed for all the 7 markers for the tissue or cell line corresponding to the epigenome.
Given a marker, an epigenome, and the peak file of the corresponding ChIP-seq experiment, we used a window of 200 bp to scan the whole human genome (hg19) with step 200 bp and regarded a window that had at least 100 bp overlap with a peak as a histone modification site.
Applying this procedure to every marker and every epigenome and discarding epigenomes (total 6 epigenomes) that had only a small number of modification sites (< 50,000) for some histone markers, we identified a total of 7,626,807 sites in the human genome from 15 epigenomes, as detailed in Table 1.
For an epigenome, we further downloaded corresponding DNase-seq peak files from Roadmap. For a genomic position in a peak, we assigned the fold enrichment score of the peak, calculated by the standard pipeline of Roadmap [13], to the position, as its openness score to quantify the status of chromatin accessibility. For other genomic positions, we regarded their openness scores as zeros. By doing this for every epigenome, we obtained an openness score that was specific to the epigenome for every genomic position.

Design of DeepHistone
We designed a deep neural network model, named DeepHistone, to predict whether a DNA fragment in an epigenome is a site for the 7 histone markers. To achieve this aim, we first extended the input fragment upstream and downstream to obtain a region of 1000 bp centred at the fragment and then fed the resulting region to our model, which consists of three modules: a DNA module, a DNase module, and a Joint module, as illustrated in Fig. 1.
The DNA module, designed as a customized densely connected convolutional neural network [24], extracts sequence information for the input region. For this purpose, a one-hot encoding strategy is used to convert the sequence of the input region into a binary matrix. An initial convolutional layer is then adopted to scan the matrix for sequence patterns, i.e., motifs. The resulting patterns are further fed to two densely connected convolutional blocks connected in a tandem way by a convolutional layer and a pooling layer for extracting high-level features. These features, after passing through a convolutional layer and a pooling layer, are eventually fed to the joint module for the classification task. A densely connected convolutional block consists of three convolutional layers. Mediated by a batch normalization operation and a ReLU activation function, the first two layers connect to not only the subsequent layer but also all latter layers. The densely connected architecture is adopted here because recent advances in deep learning have shown that such an architecture can effectively overcome the vanishing gradient problem, strengthen feature propagation, utilize parameters more efficiently, and avoid the overfitting problem [24]. These shortcomings are common in a classical convolutional neural network, especially on tasks with small dataset. Detailed parameter settings of the DNA module are shown in Fig. 1.
The DNase module extracts chromatin accessibility information for an input region. This module has the identical architecture as the DNA module, except that an initial one-dimensional convolutional layer is used to deal with openness scores of positions in the region at the beginning. The joint module integrates features extracted by the DNA and DNase modules to produce classification results. To achieve this objective, features extracted by these two modules are concatenated and fed to a feedforward neural network, which uses 7 sigmoid functions to predict in parallel probabilities that a region is a site for the 7 histone modification markers. Note that multiple sigmoid functions instead of a softmax function are adopted because in reality the events that a site belongs to the markers are not mutually exclusive. In other words, a site can belong to multiple markers simultaneously.
We implemented DeepHistone in Python using Pytorch [25]. The high-performance NVIDIA GeForce GTX 1080Ti GPU was used to accelerate the computation. The cross entropy loss was used as the optimal function in model training, measuring the similarity between a true distribution p and the prediction probability q, as: Adam [26] was used to accelerate backpropagation with default parameters, except that the initial learning rate is set to 0.001. An early stopping strategy was used to reduce the training time.

Baseline methods
We compared the performance of DeepHistone with three baseline methods, including DeepSEA [17], DanQ [18], and gkm-SVM [27], with parameters proposed by the respective authors. Briefly, DeepSEA used three convolutional layers, a fully connected layer, and a sigmoid output layer to distinguish epigenomic sites. DanQ used a convolutional layer, a bidirectional long short-term memory layer, a fully connected layer, and a sigmoid output layer to classify DNA sequences. Gkm-SVM represented a DNA sequence as a gapped k-mer vector and then resorted to the widely used support vector machine (SVM) to do binary classification. We also proposed two variations of our model, named "DeepHistone (DNA-only)" and "DeepHistone (DNase-only)". The former discards the DNase module and predicts histone modification markers using only DNA sequence information, and the later discards the DNA module and makes predictions using only chromatin accessibility data.

Validation method and evaluation criteria
We adopted 5-fold cross-validation experiments to validate the performance of a method in predicting histone modification sites. Briefly, from ChIP-seq peak files regarding the 15 epigenomes and 7 histone markers, we identified a total of 7,626,807 modification sites. Given one of the 15 epigenomes, we partitioned all the known sites into five parts of nearly equal size. Then, in each fold of the validation, we used four parts to train a model and tested its performance on the remaining part. This procedure was repeated five times to guarantee that each site had been tested once and only once. Note that gkm-SVM is very time-consuming when compared with a deep learning method that can be accelerated by hardware (e.g., GPU). Consequently, we had to sample at random only a small number (50,000) of modification Although our method can simultaneously predict whether a DNA fragment in an epigenome is a site for the 7 histone markers, a fragment has only two status for a certain marker, being a histone modification site or not. This understanding allows us to evaluate the performance of our method using the traditional formulation of binary classification. Specifically, given a histone marker, at a certain threshold of the prediction probability, we calculated the sensitivity as the fraction of its modification sites assigned a probability higher than the threshold, and the specificity as the fraction of sites not relevant to the marker and assigned a probability lower than the threshold. Varying the threshold value from 0 to 1, we were able to draw a receiver operating characteristic (ROC) curve. The area under this curve was then calculated as a criterion called auROC. Considering that the number of none-relevant modification sites for a marker is typically much larger than that of true sites, we further calculated the recall and precision at a threshold, drew a precision-recall curve by varying the threshold value, and obtained the area under this curve as another criterion called auPRC.
The rationale for our method and validation design is conceptually equivalent to using modification sites specific to a histone marker as positive set and those not relevant to the marker as negative set to train a binary classification model for the marker. However, our design has two advantages. First, instead of training 7 models for the 7 markers separately, our method can simultaneously train a model for all the 7 markers, thereby saving computational time. Second, the selection of the negative set in our design is much more stringent than such strategies as selecting DNA fragments at random from the whole genome, because modification sites for different markers may have some similar properties, e.g., GC contents, the distance to a gene, etc.

Motif visualization
To interpret how DeepHistone captures DNA sequence patterns, we proposed the following strategy to demonstrate the relationship between known DNA binding motif and sequence patterns extracted by the first convolutional layer of DNA module. Following the literature [18,20], we first generated a position weighting matrix (PWM) for each kernel in first convolutional layer of the DNA module by scanning along all the input sequences to find activated regions and then averaging over all the activated regions. Formally, a region x i in an input sequence s was regarded as activated, if where w k is the weight matrix of the k-th kernel, α ∈ (0, 1) a control coefficient, and EAV the extreme activation value of s defined as We set the length of a kernel to 9 and α to 0.9. We then compared extracted PWMs to the JASPAR database [28] and illustrated the results by using the tool TomTom [29] with q-value threshold 0.05.

Analysis of functional implications of haQTLs
We applied DeepHistone to explore functional implications of single nucleoid polymorphisms (SNPs) related to histone acetylation quantitative trait loci (haQTLs) identified in a lymphoblastoid epigenome by the histone H3 acetylated on lysine 27 (H3K27ac) marker [30]. Given a SNP, we identified the 1000 bp DNA sequence centred at the SNP position and predicted two probabilities, p ref and p alt , that indicate the degree that the reference and alteration sequences being a histone modification site for the H3K27ac marker, respectively. Following the literature [20], the absolute value of the different between the two predictions was then defined as the functional implication score, Δp = | p alt − p ref | for the SNP.

DeepHistone accurately predicts histone modification sites
We first conducted 5-fold cross-validation experiments to assess the performance of our method (see Materials and methods). As shown in Table 2, for a histone marker, the auROC score averaging over the 15 epigenomes is close to 0.9, indicating the effectiveness of our method in predicting modification sites specific to a histone marker. From Fig. 2 (a), we observe that for a histone marker, the auROC score for an epigenome is typically above 0.87, though different epigenomes show fluctuations, also supporting this conclusion. Moreover, the effectiveness of our method is further supported by auPRC scores shown in Table 3 and Fig. 2 (b).
We then compared the performance of our method with that of the baseline approaches. Considering that our method uses both sequence and chromatin accessibility information, while the other approaches only rely on DNA sequence, we discarded the DNase module and implemented a variation of our method called Dee-pHistone (DNA-only). From Table 2, we observe that the mean auROC score over the 15 epigenomes for a histone marker yielded by this model, though in general has a slight drop when compared with the that generated by the original model, i.e., DeepHistone (Standard), is obviously significantly higher than all the three baseline methods (DeepSEA, DanQ, and gkm-SVM). For example, for H3K4me1, the mean auROC of the 15 epigenomes for DeepHistone (Standard) is 0.9065 ± 0.0290, while those for DeepHistone (DNA-only), Deep-SEA, DanQ and gkm-SVM are 0.8685 ± 0.0550, 0.7828 ± 0.0280, 0.7649 ± 0.0260, 0.6361 ± 0.0400, respectively. This observation suggests that our method, even when using sequence information alone, is still superior over the three baseline methods in predicting modification sites specific to a histone marker.
From Fig. 2, we further confirmed this observation. Also taking H3K4me1 as an example, the median auROC of the 15 epigenomes for DeepHistone (Standard) is 0.9152 in the box plot, while those for DeepHistone (DNA-only), DeepSEA, DanQ and gkm-SVM are 0.8922, 0.8200, 0.8058, 0.6804, respectively. We then conducted a onesided paired-sample binomial exact test to access whether the auROC scores of the 15 epigenomes yielded by a method for a histone marker is higher than those generated by another. Results show that DeepHistone (Standard) is superior to DeepHistone (DNA-only) with significant p-values for H3K4me1, H3K4me3, H3K9ac,  and H3K27ac (all p-values are equal to 3.052E-05) and marginal significant p-values for H3K9me3 (p-value = 3.693E-03) and H3K27me3 (p-value = 5.924E-02). For H3K36me3, these two methods show no apparent difference (p-value = 0.500). Furthermore, DeepHistone (DNA-only) is superior to all the three baseline methods for all the 7 histone markers (all p-values are equal to 3.052E-05). These results further support the conclusion that our method outperforms existing baseline approaches, even when using sequence information alone.

Contributions of the DNA and DNase modules
To evaluate contributions of the sequence information and chromatin accessibility data, we discarded the   We also notice that the sequence information contributes more to the final performance than the chromatin accessibility data, because the removal of the DNA module, i.e., DeepHistone (DNase-only), in general results in a larger drop in performance. To further confirmed this observation, we again conducted the aforementioned one-sided paired-sample binomial exact test to access whether the auROC scores of the 15 epigenomes yielded by DeepHistone (DNA-only) for a histone marker is higher than those generated by DeepHistone (DNase-only). Results show that the former is superior to the latter with significant p-values for H3K4me1, H3K4me3, H3K27me3, H3K36me3, H3K9me3 (all p-values are equal to 3.052E-05). For H3K9ac, the p-value is also significant as 4.883E-04. The only exception is H3K27ac, where the p-value (0.304) is not significant.
On one hand, it is not surprising that the sequence information contributes to the prediction of histone modification sites. Actually, this conclusion has been supported by abundant studies that demonstrate the effectiveness of such sequence patterns as transcription factor binding sites in the prediction of histone modification sites [14]. On the other hand, the effectiveness of chromatin accessibility information can also be explained by not only the relationship between histone methylation and DNA accessibility [31] but also the correlation between histone acetylation and chromatin status [32]. Moreover, we conjecture that chromatin accessibility information contributes less than sequence information might be due to the fact that DNase-seq data for all the 7 markers are identical for an epigenome, and signatures of chromatin accessibility may not be so strong as those of sequence.

DeepHistone predicts histone modification sites across epigenomes
Although the above cross-validation experiments demonstrated the success of our method in the prediction of modification sites specific to a histone marker, in reality it would be more meaningful to predict histone modification sites for an epigenome that has no biological experiment conducted. We therefore proposed the following collective scoring strategy to predict status (i.e., belonging to which histone markers) of modification sites for a novel epigenome.
Given a novel epigenome and a genomic region, we would like to predict whether this region was a modification site for a histone marker, with respect to the given epigenome. To achieve this objective, we resorted to a model trained on a known epigenome to predict a probability that indicated whether this region is a modification site for the same histone marker, and we averaged all such probabilities over all known epigenomes to obtain a final prediction probability. In this procedure, the input includes the DNA sequence of the region and the chromatin accessibility data specific to the novel epigenome.
We conducted a leave-one-out experiment to evaluate the performance of our method with this strategy. Specifically, in each validation run, we selected one of the 15 epigenomes and assumed that status of modification sites in this target epigenome is unknown. Then, we applied the collective scoring strategy to recover the status of these sites by making use of the remaining 14 epigenomes. Finally, we evaluated the performance of our method in terms of the auROC and auPRC scores by using the known status of the sites in the target epigenome as the gold standard. In implementation, we took advantage of the models trained in the aforementioned 5-fold cross-validation experiments and averaged the probabilities calculated by these 5 models to obtain the prediction probability with respect to a known epigenome.
We presented the results in Fig. 4, in which both auROC and auPRC scores were averaged over the 7 histone markers to make the presentation concise. From the figure, we can clearly see that the cross-epigenome prediction by DeepHistone is effective, in that for the 15 epigenomes, the auROCs are typically above 0.8, and the auPRCs are typically above 0.6. We also notice that the cross-epigenome prediction in general exhibits lower performance than self-prediction by DeepHistone (5-fold cross-validation). This is reasonable because an epigenome may have its specific sequence codes and chromatin accessibility patterns that might not be captured by the collective scoring strategy.
When compared with the three baseline methods, Dee-pHistone apparently achieves higher performance for all the 15 epigenomes. For example, the average auROCs for E003 (an embryonic stem cell line) are 0.8391, 0.7697, 0.7744, and 0.6711 for DeepHistone, DeepSEA, DanQ, and gkm-SVM, respectively. Actually, DeepHistone achieves higher auROC scores than all the three baseline methods for all the 15 epigenomes. As a result, a one-sided binomial exact test against the null hypothesis that the performance of DeepHistone across the 15 epigenomes is not different from a baseline method gives significant p-values for all the three methods (all p-values are equal to 3.052E-05). This conclusion is further supported when using auPRC as the evaluation criterion.

DeepHistone recovers TF binding motifs
To demonstrate sequence patterns automatically extracted by our method, we used the strategy described in Materials and methods to obtain sequence signatures (i.e., PWMs) learned from the first convolutional layer of the DNA module with respect to an epigenome. We further identified putative sequence motifs by using the tool TomTom and match these PWMs to the JASPAR database. For each epigenome, we displayed the sequence logo of one of the matched motifs in Fig. 5.
In different carcinoma cell lines, DeepHistone recovered corresponding motifs to each cell line, which showed the sensitivity of DeepHistone. In the lung carcinoma cell line (E114), DeepHistone recovered E2F3, TFAP2C and GRHL2. It has been verified that the overexpression of E2F3 transcription factor promotes the development of lung cancer [33,34]. TFAP2C has been previously shown to promote lung tumorigenesis and aggressiveness by upregulating of TGFBR1 [35,36]. Different from E2F3 and TFAP2C, GRHL2 can suppress tumor metastasis by regulating of transcriptional activity of RhoG in lung cancer [37]. In HeLa-S3, the cervical carcinoma cell line (E117), PROX1 and NR2F6 were found by DeepHistone. The commitment of PROX1 positive cells is an early event in cervical neoplastic progression, and the expression of PROX1 is considered as evidence of an early lymphangiogenic switch [38]. The abnormal high expression of NR2F6 in early-stage cervical cancer predicts pelvic lymph node metastasis, tumor recurrence and poor prognosis and NR2F6 might be a potential therapeutic target of cervical cancer [39]. As for hepatocellular carcinoma cell line (E118), E2F8, GABBPA and SOX11 were recovered. It has been shown that E2F8 contributes to human hepatocellular carcinoma via regulating cell proliferation [40] and is considered as a potential therapeutic target of hepatocellular cancer [41]. GABBPA inhibits metastasis of hepatocellular carcinoma [42] and SOX11 is important in the regulation of hepatocellular carcinoma cell proliferation, migration and invasion [43]. Besides, DeepHistone recovers SREBF2, HOXA5 and ZNF24 in human umbilical vein endothelial primary (HUVEC) cell line (E122) and NKX6-1 in embryonic stem cell line (E008). Those recovered transcription factors are verified to play important roles in the corresponding cell line [44][45][46][47][48]. To sum up, DeepHistone has the ability to recover potential functional transcription factor corresponding to specific cell line.

DeepHistone explains functional implications of SNPs
Although genome-wide association studies (GWAS) have successfully identified thousands of single nucleotide polymorphisms (SNPs) associated with complex traits [49], most of these SNPs locate outside coding regions. The explanation of the functional implications of these SNPs has thus long been a critical task in genetic studies [50]. Recently, a new technique that combines a deep and long-read ChIP-seq assay on H3K27ac with a powerful statistical test has successfully enabled the identification of histone acetylation quantitative trait loci (haQTLs) related to a lymphoblastoid epigenome. The identified SNPs exhibit highly predictive power in exploring mechanisms of autoimmune disease. We then applied DeepHistone to analyze these SNPs, demonstrating potential applications of our method.
From the literature [30], we identified a positive set that includes 7497 SNPs (haQTLs) specific to H3K27ac in the lymphoblastoid epigenome (E116) and appearing in the 1000 genomes project [51]. Meanwhile, we generated a negative control set that includes the same number of SNPs as the positive one by identifying for each haQTL a SNP that locates about 500 bp away, also from the 1000 genomes project. We then used the formulation detailed in Materials and methods to calculate functional implication scores for the identified SNPs and compared whether scores for positive SNPs are significantly different from those for negative ones. The results, as shown in Fig. 6, clearly show that the haQTLs tend to have higher functional implication scores than the control SNPs. A one-sided Wilcoxon rank sum test against the null hypothesis that the median score of these two sets of SNPs are identical yield a very significant p-value of 1.369E-140, strongly support the conclusion that haQTLs have higher functional implication scores. In other words, these SNPs are more likely to change the function of the lymphoblastoid epigenome, and thus are more likely to be responsible to a phenotype. We further generated other four control sets in which a SNP is required to be 1000, 1500, 2000 and 2500 bp away from a haQTL. The results, as shown in Fig. 6, give us a similar conclusion. All these results suggest that our method has the potential ability to discriminant SNPs responsible for a certain phenotype from their nearby genetic variants.

Conclusions and discussion
We have proposed a deep learning framework named DeepHistone to integrate DNA sequence information and chromatin accessibility data for predicting histone modification sites. Through comprehensive validation experiments regarding 7 histone markers and 15 epigenomes, we have shown that our approach is superior to several baseline methods in discriminating among modification sites specific to different histone markers, capable of making predictions across epigenomes, interpretable in extracted sequence features, and The success of our method can be attributed to the combination of the following facts. First, we have designed a novel deep neural network model with the incorporation of state-of-the-art techniques in the deep learning community. Particularly, the densely connected architecture effectively overcomes such problems as the vanishing gradient and overfitting, and greatly improves the prediction accuracy. Second, besides sequence information, we have also incorporated chromatin accessibility data into our model. These two types of information can then complement each other in our neural network model to capture subtle signals towards the accurate prediction of histone modification sites.
Certainly, our work can be further improved in several aspects. First, resorting to an embedding representation of DNA sequences instead of using the one-hot encoding may further improve the prediction accuracy [21]. Second, considering the sequential natural a DNA fragments, the incorporation of a recurrent neural network architecture, especially long short-term memory units, may further improve the performance of our method [18,21]. Third, instead of scanning sequence motifs from the beginning using convolutional kernels, it is also possible to incorporate sequence patterns and design a hybrid network architecture [20]. Fourth, we used DNase-seq peaks from Roadmap to quantify chromatin accessibility. This treatment, though simple, may not be precise. Fifty, besides chromatin accessibility data, it is also worth to consider the integration of plenty of gene expression data. Finally, besides our current formulation of predicting for a certain epigenome putative modification sites specific to different histone markers, it will also be beneficial to formulate the problem from the perspective of predicting for a fixed histone marker putative sites for different epigenomes.