Selection and assessment of positive and negative datasets
We used the previously established definition of a regulatory region as determined by genome-wide histone modification signatures [8]. For example, H3K4me3 is known to be associated with promoters, while H3K4me1 with enhancers. Ernst et al. predicted different types of regulatory regions using a hidden Markov model. In this paper, we focus on two major regulatory regions: promoters and enhancers.
We also selected the same numbers of the mCpG sites in random genomic regions as negative datasets. To exclude the effect of differential methylation due to genomic location, we chose random genomic regions with the same relative distance to the nearest transcription start site (TSS) or exon-intron boundary for each type of regulatory regions (see Methods for details).
To determine the quality of the regulatory regions determined by histone marks, we calculated the correlation between the methylation level of a mCpG site and the expression of its associated target genes (Figure 1A). The adjacent gene of a promoter was selected as its target gene. The predicted target genes for enhancers were obtained from Thurman et al. [9], which is based on the correlation of DNase I hypersensitivity activity between distal elements and promoters across cell lines. For each methylation site, Pearson correlation coefficient was used to compute the correlation between methylation level at a given site and the expression of its target gene across cell lines [22]. As comparison, the same correlation was calculated for the negative datasets.
Compared to the mCpG sites in random genomic regions, the methylation level of the mCpG sites in the regulatory regions was significantly more negatively correlated with the expression of the target genes (p < 1.0E-15; Kolmogorov-Smirnov test) (Figure 1B and Figure 1C). This is consistent with the notion that DNA methylation represses gene expression. We would like to point out that the conclusion still holds if we used Spearman's rank correlation, rather than Pearson correlation coefficient (Additional file 1: Figure. S1A, B).
Overall, our result suggests that the regulatory regions predicted by histone marks and the random genomic regions with similar relative genomic locations are of reasonable quality and are suitable to serve as the positive and negative sets of this study.
mCpGs in regulatory regions are hypomethylated
We then extract a series of DNA methylation-dependent features by comparing the methylation sites in the regulatory regions and random mCpG sites in the negative sets. We found that the methylation level of regulatory mCpG sites was significantly lower than the other mCpG sites (Figure 2). The average methylation levels of mCpGs in active promoters and the corresponding negative set are 0.34 and 0.72, respectively, showing statistically significant difference (p < 1.0E-15 based on Kolmogorov-Smirnov test). Similarly, the methylation levels of mCpGs in strong enhancers and the corresponding negative sets are 0.69 and 0.84, respectively (p < 1.0E-15 based on Kolmogorov-Smirnov test). These results persisted when CpG islands in the regulatory regions and the corresponding random regions were excluded, indicating that the low methylation level in regulatory regions was not due to the overlap with CpG islands, which are generally hypomethylated (Figure 2B, 2C). Interestingly, regulatory regions overlapping with CpG islands demonstrated lower methylation levels than regulatory regions outside of the CpG islands (p < 1.0E-15 based on Kolmogorov-Smirnov test; Figure 2B, 2C), suggesting that those CpG islands playing a regulatory role have lower methylation levels. Overall, our observations suggested that hypomethylation is a key feature of regulatory regions, independent of the relative location to a CpG island.
Hypomethylation in regulatory regions extends across a long range
Within a methylated regulatory region, the lowest methylation level is found centrally within the region, and this hypomethylation extends across a long range (Figure 3). For example, the average methylation level of mCpGs at the center of active promoters (0.34) gradually increases across the promoter region and reaches a plateau (0.82) at ~1500 bp away, suggesting that hypomethylation is not localized to a narrow genomic region (Figure 3A). The extended hypomethylated regions were also apparent within strong enhancers (Figure 3B). Strikingly, the extended hypomethylation was also observed within transcription factor (TF) binding sites, which are generally more localized to 10-20 bp regions. Using ChIP-seq and predicted data for several TFs [23], we found that the methylation levels were lowest in the center of TF binding sites and reached background level by 1500 bp (Figure 3C), despite the fact that the majority of binding sites are less than 20 bp in length.
Methylation levels of neighboring regulatory mCpG sites are highly correlated
One explanation of the extended hypomethylation is correlation between neighboring regulatory mCpG sites. To test this hypothesis, we calculated the autocorrelation of methylation profiles within regulatory regions. Specifically, we computed the correlation of methylation levels at two methylation sites as a function of their genomic distances (Figure 3D, see Methods for detail). As expected, mCpG sites within close proximity showed higher correlation than those located distally. Based on the autocorrelation of methylation profile, we observed that the correlation of methylation between mCpG sites was significantly stronger in regulatory regions than in non-regulatory regions (p < 1.0E-15 based on Kolmogorov-Smirnov test; Figure 3D). In regulatory regions, the correlation of methylation levels extended across distances of up to 326 and 231bp for promoters and enhancers, respectively, whilst in random genomic regions the correlation disappeared by 41bp (see also Additional file 1: Figure S1C).
We speculated that a high density of CpG sites could be the underlying mechanism for correlation between methylation levels, as sites in close proximity might be co-regulated by DNA methyltransferases. We examined the CpG densities in the regulatory and random regions and found that regulatory CpG sites were much denser than those in random regions, especially within active promoters (Figure 3E). This observation was true even if the CpG islands in these regions were excluded (Additional file 1: Figs S1D, E). Thus, the hypomethylation in regulatory regions and/or TF binding sites can be attributed to both the high correlation between methylation levels and the high CpG density within these regions.
Inactive regulatory regions are hypermethylated
Epigenetic marks in regulatory regions differ between different cell types. For example, in a given cell line a genomic region can be an active promoter enriched for active histone marks (e.g. H3K4me3, H3K9ac, and H3K27ac), while the same region may be inactive in another cell line due to the absence of such histone marks. To address the role of histone modifications within regulatory regions, we compared changes in the methylation level of the regulatory mCpG sites to changes in histone mark status. The methylation levels between two cell lines (H1 and GM12878) were examined, as genome-wide histone modification and methylation data were available for both cell lines. We defined three genomic regions for each cell line: transcriptionally active, transcriptionally inactive, and random (background) regions. For example, the genomic regions predicted as regulatory regions in H1 cell were defined as 'active' in H1 cell. The genomic regions that were predicted as regulatory regions in GM12878 cell, but not in H1 cell were categorized as 'inactive' in H1 cell. These "inactive" regions represent the genomic regions that are not enriched by transcriptionally active histone marks in H1, but have the potential to become transcriptional active in other cell lines. As comparison, we also selected a set of genomic regions that do not overlap with any regulatory region in both H1 and GM12878 (see details in Methods and Figure 4A). The mCpGs methylation levels inside active, inactive, and random regions in H1 cell were compared.
Using our comparative analysis between H1 and GM12878 cells, we found that the methylation levels at mCpG sites within inactive promoters and enhancers resembled neither the active nor the background (random region) levels (Figure 4B). Interestingly, the methylation levels in inactive regulatory regions were even higher than background levels obtained from random regions (p < 1.0E-15; Kolmogorov-Smirnov test) (Figure 4B). This phenomenon was robust as it was present when we compared H1 cells with other different cell lines (additional file: 2: Figure S2 A-E). Furthermore, the finding still held if we grouped the regulatory regions based on whether they overlap with CpG islands (Figure 4B, additional file 2: Figs S2 A-E, 3: Figure. S3). Finally, rather than using predicted active and inactive regulatory regions, if we used the raw histone marks (e.g. H3k9ac), we made the same observation in multiple cell lines (Additional file 4: Figure S4). Our findings demonstrate that inactive regulatory regions show hypermethylation relative to the genomic background, distinguishing inactive regulatory regions from background methylation.
Methylation level in regulatory regions can vary considerably between cells
The difference in methylation levels between active and inactive regions suggested that regulatory mCpG sites might have a greater range of possible methylation levels than other mCpGs. To test this hypothesis, we calculated the variability of each mCpG site across the cell lines whose genome-wide methylation profiles are publically available [22]. The variance of the methylation levels for the regulatory regions predicted in H1 cells were then compared to those from random genomic regions (Figure 4C, 4D). About 40% of the regulatory mCpG sites have a variance larger than 0.2, whereas only 15% of mCpG sites in other genomic regions showed the same variance, suggesting that the regulatory mCpG sites have a larger range of potential methylation levels.
We then directly related the methylation level of mCpGs to the regulatory state (active versus inactive). For those mCpG sites that were present in an active regulatory region in both cell lines, the difference in methylation level between the two cell lines was small (Additional file 2: Fig. S2F). In fact, the vast majority (~75%) of the regulatory mCpG sites showed very small methylation level differences in the range of -0.1 to 0.1 between the two different cell lines. This difference was similar to the difference in methylation levels of mCpG sites in random genomic regions. In contrast, the regulatory mCpG sites present in both active and inactive regulatory regions in two cell lines showed a greater methylation difference; less than a half (42%) of the active-inactive regulatory regions had methylation level differences between -0.1 and 0.1. Taken together, the large range of potential methylation levels in the regulatory regions can be associated with the status changes of the regulatory regions.
Distinct sequence motifs are associated with regulatory mCpG sites
Since DNA motifs were used to predict enhancers, we also explore to identify the DNA motifs associated with mCpG sites in regulatory regions. The rationale is that DNA methylation is believed to play a crucial role in the transcription process by directly interfering with TF binding to the regulatory regions [24]. Therefore, we hypothesize that sequence motifs around the mCpGs in regulatory regions were more likely to be TF binding sites and would serve as another informative feature distinguishing regulatory mCpG sites. We examined all 8-mer sequences with a CpG in the center and compared the occurrence of each motif in the regulatory regions and in the random regions. The statistical significance of each motif was evaluated, and both overrepresented and underrepresented motifs were identified. In total, 86 and 194 8-mers were significantly overrepresented in the active promoters and the strong enhancers, respectively, compared to background (p < 1.0E-5). Additionally, we found 104 and 86 significantly underrepresented 8-mers in the active promoter and the strong enhancers (p < 1.0E-5). Figure 5 lists the top 5 most significantly overrepresented and underrepresented 8-mers in the regulatory regions (see additional file 5: Fig. S5 for additional motifs). These motifs were also compared with known transcription factor consensus sequences (5: Figure. S5).
Interestingly, the nucleotide composition of the overrepresented 8-mers was different from that of the underrepresented 8-mers. For example, the GC content of overrepresented 8-mers obtained from the promoters and enhancers are 0.68 and 0.60, respectively. In contrast, the GC contents of the underrepresented 8-mers from the promoters and enhancers are only 0.47 and 0.31, respectively. This distinct nucleotide composition difference became more significant when considering only the 2 bases directly adjacent to the CpG sites. Over 77% and 68% of the overrepresented 8-mers in the promoters and enhancers, respectively, had either cytosines or guanines as a direct neighbor of the CpG site. In contrast, only 31% and 28% of the underrepresented 8-mers in these two types of regulatory regions had either cytosines or guanines at these positions. Cytosines or guanines were simultaneously observed in both direct neighbors of the overrepresented motifs (56% and 45% in the promoters and enhancers, respectively), which occur much less frequently in the underrepresented motifs (13% and 11% in the promoters and enhancers, respectively).
Regulatory regions are predictable
Distinct genomic features of the regulatory mCpGs, which distinguish them from other mCpGs in negative sets, were used to predict regulatory regions. We predicted the regulatory regions using these features, including methylation level, CpG density, autocorrelation of methylation levels, variance of methylation levels among different cell lines and sequence motifs (significance (-log(P)) of the occurrence of 8-mer sequence motifs surrounding the mCpGs). The computation of these features is described in detail in Methods section. As comparison, we predicted the regulatory regions solely based on methylation level.
Support vector machine (SVM) was used to classify regulatory and non-regulatory regions. SVM finds a set of hyper planes that separates data points into a set of classes in high dimensional feature space. To distinguish between regulatory and random genomic regions, 10-fold cross validation was used to measure the classification performance, in which 90% of regions were used to train the model, and the remaining 10% of regions were used as test dataset (see details in Methods). Figure 6A illustrated the performance of the prediction for active promoters and strong enhancers using a Receiver Operating Characteristic (ROC) curve. The area under the ROC curve (AUC) was used to depict the performance of a binary classification. Active promoters and strong enhancers were predicted quite well, showing an AUC of 0.992 and 0.817, respectively (Figure 6A). Perhaps not surprisingly, if we only used the methylation levels as the feature for prediction, the performance is lower, with AUC of 0.985 and 0.692 for promoters and enhancers, respectively. The prediction for enhancers is significantly improved with the additional features.
Some features showed more significant contribution to the prediction than other features. Therefore, the ability of each feature to discriminate regulatory and non-regulatory regions was analyzed by the information gain of each feature (Figure 6B). Information gain (IG) of a given feature F with respect to the classes (e.g., regulatory or random regions) is the entropy reduction of the sample set when we know the feature F (see Methods in detail). Interestingly, the most informative features are different when predicting promoters and enhancers. For the promoters, the features showing the largest information gain are methylation level and methylation variance, while for enhancers, the most useful features are CpG density and methylation autocorrelation (Figure 6B). Some features show strikingly different contribution in predicting promoters and enhances. For example, the feature of methylation variance is quite informative in promoter prediction, with information gain of 0.67, while it has limited contribution when predicting enhancers, with information gain of 0.06. This result suggests that a set of methylation-associated features are needed for predicting regulatory regions and these features have different predictive power for promoters and enhancers.