Evidence for sequence biases associated with patterns of histone methylation
© Wang and Willard; licensee BioMed Central Ltd. 2012
Received: 9 June 2011
Accepted: 18 July 2012
Published: 2 August 2012
Skip to main content
We’re sorry, something doesn't seem to be working properly. Please try refreshing the page. If that doesn't work, please contact us so we can address the problem.
© Wang and Willard; licensee BioMed Central Ltd. 2012
Received: 9 June 2011
Accepted: 18 July 2012
Published: 2 August 2012
Combinations of histone variants and modifications, conceptually representing a histone code, have been proposed to play a significant role in gene regulation and developmental processes in complex organisms. While various mechanisms have been implicated in establishing and maintaining epigenetic patterns at specific locations in the genome, they are generally believed to be independent of primary DNA sequence on a more global scale.
To address this systematically in the case of the human genome, we have analyzed primary DNA sequences underlying patterns of 19 different methylated histones in human primary T-cells and patterns of three methylated histones across additional human cell lines. We report strong sequence biases associated with most of these histone marks genome-wide in each cell type. Furthermore, the sequence characteristics for such association are distinct for different groups of histone marks.
These findings provide evidence of an influence of genomic sequence on patterns of histone modification associated with gene expression and chromatin programming, and they suggest that the mechanisms responsible for global histone modifications may interpret genomic sequence in various ways.
The basic unit of eukaryotic chromosomes is the nucleosome, comprised of DNA wrapped around a histone octamer complex . Nucleosomes can adopt distinct chromatin structures, associated with specific post-translational modifications of histone proteins at their N-terminal tails . Such histone modifications can be stably maintained through cell divisions and are strong candidates to serve as marks for epigenetic regulation. Epigenetic modifications either influence the accessibility of cis-regulatory elements in genomic DNA or recruit chromatin-binding proteins to regulate gene expression.
The histone code or epigenetic code theory proposes that the combinatorial nature of histone modifications and histone variants represents information that greatly extends the content and display of genetic information alone [3–8]. As one approach to testing this model, a number of studies have begun to define genome-scale maps of various histone modifications and other chromatin constituents and to relate such maps to cellular phenotypes in organisms from yeast to human [9–11]. Significantly, the ENCODE project reported a high-resolution epigenetic landscape of 1% of the human genome that could be used to accurately predict gene expression in a variety of cell types  and increasing amounts of ENCODE whole-genome data are becoming available on genome browsers .
Alternatively, however, it is possible that the genomic sequence could influence different epigenetic domains and that primary DNA sequence itself could thus bias their formation and/or maintenance on a genome-wide basis (Figure 1, right). This hypothesis can account for chromatin state persistence across cell cycles and even through generations . Consistent with the predictions of this model, several recent studies have demonstrated that some epigenetic marks, such as the position of methylated CpG islands [21–23] or the presence of trimethylated forms of histone H3 at lysine 4 (H3K4me3) or at lysine 27 (H3K27me3) , are correlated with particular features of complex genomes. Furthermore, binding of CTCF to cis elements has been shown to form well-positioned nucleosome arrays around them, providing a potential mechanistic link between primary genome sequence and chromatin state . Recently, it has been shown that mammalian chromosomes are organized into megabase-size domains stable across cell types and conserved across species, with specific genomic features marking their boundaries . While these findings support the genomic influence model in specific local instances, it remains a question whether the histone code, consisting of many different types of histone modifications and variants, is associated with primary DNA sequence genome-wide.
In this study, we have tested this hypothesis for the human genome by investigating correlations of genomic regions associated with a wide range of methylated histones with the underlying DNA sequence. To achieve this, we used high-resolution, genome-wide epigenetic maps and applied a machine learning approach called Support Vector Machine (SVM, reviewed in ), which can be used successfully to computationally predict other epigenetic states [22, 28]. Like other machine learning algorithms, SVM has the ability to recognize patterns in a given dataset (used for training), and the resulting models can then be tested with previously unseen examples and new predictions can be made accordingly. Thus, an association between genome sequence and epigenetics can be tested by investigating whether or not primary sequence alone is sufficient to predict the genomic location of the histone code.
To investigate whether histone marks in general are associated with underlying genomic sequence, we analyzed a dataset containing the profiles of 19 different methylated histones from genome-wide ChIP followed by deep sequencing experiments (ChIP-Seq) in human CD4+ T-cells . (Of the 20 marks reported by Barski et al., H3K29me2 was excluded because of low sequence coverage in the original dataset.) We asked whether genomic sequence could distinguish between regions that are enriched or depleted for these histone marks in this cell type.
To avoid potential over-fitting problems associated with many machine learning approaches when the number of variables greatly exceeds the number of samples , we used a large sample size (up to 1,000 depleted and 1,000 enriched regions randomly selected from the genome, Additional file 2: Table S2) for each methylated histone and performed SVM classification experiments. Each classification experiment involved training a SVM model using 1,000 regions randomly sampled from the above 2,000 regions, and testing its prediction performance on the rest. This classification procedure was reiterated ten times to minimize sampling bias. If genome sequence has no influence on histone modification placement, SVM models should randomly assign sequences to be “enriched” or “depleted” (thus, prediction accuracy =50%). However, for all 19 histone marks in TSS regions and for 14 histone marks in non-genic regions, the prediction accuracy of the SVM models was significantly greater than 50% (p < 0.01), with over 75% accuracy in most cases (Additional file 3: Table S3). Three specific examples, the classifications for a euchromatin mark H3K4me2 and two heterochromatin marks, H3K9me3 and H3K27me3, are shown in Figure 2C. Classification performance of these examples was further visualized using their Receiver Operating Characteristic (ROC) curves (Figure 2D). Specifically, the Area Under Curve (AUC) values from these ROC curves are around 0.85, suggesting the models have very high discrimination power (AUC = 1.00 means perfect classification while 0.50 means none). In contrast, AUC values drop to ~0.50 and classification ability is completely lost when models were trained without the correct enrichment/depletion information among the training samples (“permuted” in Figure 2D). Thus, these results demonstrate that underlying sequence alone can discriminate, to a substantial degree in most cases, regions enriched or depleted for most methylated histone marks.
It has been shown that different modifications of histones can promote or interfere with one another [32, 33], resulting in co-occupancy of some marks in the same genomic region, while others are mutually exclusive. Alternatively, histone-modifying enzymes may recognize similar or different sequence signals among chromosomal domains. Sequence-based models provide a powerful tool to explore these possible inter-relationships even without complete experimental coverage. Based on the similarity among their occupied sequences, or in other words, the ability to use models trained with one histone type to predict others, we performed cluster analysis on the histone marks for which we had obtained accurate SVM models (cross-validation rate > =75%). By cluster analysis, the 16 methylated histones in TSS regions fall into three groups (Additional file 1: Figure S3A). Many mono-methylated histones, as well as H3K4me2-3 and H3K79me3, are associated with a similar set of TSS regions (Additional file 4: Figure S3A and Additional file 5: Table S4), consistent with the fact that many such histone modifications are associated with active transcription . Notably, these sequences are very different from the two distinct sequence clusters occupied by repressive marks, H3K9me2-3 and H3K27me2-3. The two repressive marks can also be readily distinguished from each other based on local sequence, which again is consistent with existing experimental data on a broad scale . Interestingly, for non-genic regions, H3K9me3 still occupies different sequences from those associated with other histone marks, but H3K27me3-occupied non-genic sequences cannot be distinguished from those occupied by marks of active chromatin (Additional file 4: Figure S3B). These results suggest that whatever mechanisms are responsible for placing these marks in the genome are either directly or indirectly dependent on genomic sequence. Further, the data indicate that some of these mechanisms must be different for genic and non-genic regions.
While the above analyses all support the existence of predictive genomic characteristics underlying methylated histone patterns in the human genome, they do not by themselves identify the particular sequences involved. We therefore next explored the specific sequence features supporting each of the above SVM classifications to address the nature of the sequence bias.
To overcome the dataset dependency nature of F-score calculations, we selected sequence features that consistently have the highest F-scores across multiple, independent subsets of sequences. The Pearson linear correlation coefficients between each feature and whether or not a region is enriched for a particular histone mark were also computed to reveal their degrees of association. Finally, these consistent features were assessed by SVM to evaluate their discrimination power as single features (Supplemental Table S5). On the basis of this analysis, highly discriminative sequences in TSS regions display three notable features: regions marked by methylated histones that are associated with active transcription (e.g. H3K4me2) are enriched for CpG-containing sequence features but are depleted for AT-rich sequences; H3K27me3 regions are enriched in poly-purines and poly-pyrimidines, but are depleted in AT-rich features; H3K9me2 and H3K9me3 regions are enriched for AT-rich features but depleted in GC-rich ones (Figure 4B and Additional file 6: Table S5). In contrast, the patterns are much simpler among non-genic regions: GC-richness is associated with all the methylated histones except regions enriched in H3K9me3, in which the association is the opposite (Figure 4B). These features collectively reveal, at least to a first-order, the trends of the genomic bias for methylated histone patterns, based on which epigenetic information can be accurately inferred from primary sequence information.
For H3K27me3, the majority of short sequences that have high F-scores are parts of human Alu repetitive elements, SINE non-LTR retrotransposons that comprise at least 10.8% of the human genome (calculated from UCSC genome browser). To systematically investigate the potential contribution to the genomic code of all known repetitive elements, each of which can be viewed as a unique combination of shorter sequences, we performed F-score and SVM analysis among a training set including 2,000 TSS regions. The Alu family is the only repetitive element whose content has an unusually high F-score and by itself it predicts ~70% of training samples (Figure 4C). In addition, Alu content is negatively correlated with H3K27me3 enrichment (corr = −0.42). This is consistent with a previous report that H3K27me3-enriched regions in mouse ES cells are relatively depleted for transposons . This observation is clearly not just a reflection of the absence of repeat-associated sequences from ChIP experiments, because other repetitive sequences do not have discriminative power, nor does Alu distinguish H3K4me2 and H3K9me3 in TSS regions (not shown). As H3K27me3 represents a repressive mechanism for controlling the expression of genes [24, 36, 37], it is likely that Alu insertions into vicinities of these genes are selectively eliminated during evolution to ensure gene function in development. Consistently, Alu elements have been found to be excluded from tissue-specific genes but enriched in housekeeping genes, a genomic trend apparent both at the level of individual genes and at the level of megabase-sized chromosome bands [38–41].
It is possible that SVM models simply classifies GC/AT content, as there is a significant GC bias in the features listed in Table S5. To test this hypothesis, we first randomly permuted the base order of selected regions to preserve the base composition but to randomize all higher k-mer (k = 2,3,4,5) content (“singlet permutation”; see Methods for a description of the parameters altered in these permutation experiments). For H3K9me3 in TSS regions, base composition seems to be the only key factor for accurate prediction, because none of the permutations tested significantly affected the prediction accuracy (Additional file 4: Figure S4). In contrast, prediction on these randomized sequences for H3K4me2 and H3K27me3 in TSS regions was completely lost, with the models classifying all regions to be enriched (H3K4me2) or depleted (H3K27me3) regardless of the original labels (Additional file 4: Figure S4).
To explore further the apparent dependence of H3K4me2 and H3K27me3 predictions on sequence content, we permuted the TSS sequences to preserve both base composition and dinucleotide frequencies, but altering all higher k-mer content (“doublet permutation” in Additional file 4: Figure S4). We observed a significant restoration in prediction rate for H3K4me2, likely due to the restoration of correct CpG content, since prediction was lost in a control experiment when we only permuted the content of CpG-containing sequence features while keep everything else the same (Additional file 4: Figure S4). In contrast, while addition of the correct dinucleotide content partially restored H3K27me3 prediction, this appears to be independent of CpG content, since CpG permutation itself has little effect. These findings suggest that models for H3K27me3 recognize CpG-independent, higher-order (k > 2) sequence features while models for H3K4me2 recognize CpG-dependent features and models for H3K9me3 depend only on base composition, not appreciably on content. Combined, these results are consistent with the highly predictive features identified based on F-scores and suggest that the observed sequence bias for histone modification exists at multiple levels.
Our results establish a strong association between the primary DNA sequence and an extensive set of histone methylation marks in multiple human cell lines, supporting the hypothesis that at least part of the underlying basis for the histone code is encoded in the genome. This study also provides an example of how genome organization and sequence might directly impact biological function(s). Furthermore, the ability of sequence models to make such predictions provides insights into the establishment and maintenance of epigenetic modifications in complex genomes.
It is worth noting that, for several reasons, the prediction accuracy of sequence-based models for the histone code as outlined here is likely an underestimate. First of all, we have considered only the linear combination of short k-mers in local regions. Future analyses using additional information in the sequence (e.g. non-linear combinations of k-mers, spatial relationships of k-mers within the genome, etc.) will be required to further explore the predictive potential of the genomic sequence and to thus further define the genomic code . Second, SVM models tend to give higher probability to genomic regions that have high levels of a particular epigenetic mark, even without such information being provided in the training sets. Supplying quantitative information to the training sets will likely enable the models to make more accurate predictions. Finally, some “false” predictions may arise from the noise of the experimental approaches, and inclusion of additional datasets may help identify and thus reduce that noise.
A nucleosome code has been described for nucleosome organization and positioning in yeast and C. elegans[44–46]. Such a nucleosome code could influence or contribute to our selection of regions in the human genome enriched/depleted for many histone marks in the same way as in yeast . However, nucleosome density appears to be largely invariable except for limited changes in response to signaling or development [14, 48]. Therefore, although nucleosome density data that might serve as a control for our analysis are unavailable, our conclusions appear unlikely to be biased by nucleosome placement. Furthermore, most of our analyses are based on several kilobases of sequence containing many nucleosomes, which should effectively reduce the bias even if there is a local nucleosome positioning code in complex genomes. In addition, by separately analyzing TSS and non-genic regions we avoided potential bias from features specifically associated with transcription.
How might primary genomic sequence influence epigenetic modification(s)? Certain types or combinations of sequences could form higher-order signals with specific conformation, which might be used to recruit histone-modifying enzymes and/or to facilitate their spreading in cis to form the initial epigenetic framework. Different sequence composition may target, or have different affinity for, different chromatin remodeling complexes [49, 50]. Other sequence-specific events, including transcription and chromatin remodeling factors, could further tailor this framework to create specific and dynamic epigenetic patterns downstream during development, in ways not necessarily dependent on sequence in that later context. The dataset evaluated here thus provides only a single “snapshot” of epigenetic modification in time, reflecting the combined results of genomic sequence influence and subsequent cell type-specific chromatin remodeling. Notwithstanding the proposed general nature of the genomic bias reported here, the situation is no doubt further complicated at specific loci or in specific regions by various genomic features such as insulators and barriers, replication origins, or centromeres and telomeres.
A prerequisite for the genomic influence hypothesis is that the enzymes responsible for establishing epigenetic marks recognize subtle structural difference of underlying genomic sequence. Consistent with this concept, it has been recently shown that Dnmt3a, an enzyme that catalyzes de novo CpG methylation, recognizes CpG periodicity signals encoded in the genome . Histone-modifying enzymes could similarly utilize structural information, but not necessarily the same information, because our preliminary results indicate that CpG periodicity is not important for prediction methylated histone marks (data not shown). GC- and AT-rich sequences have been shown to differ in chromatin conformation as well as histone modifications in yeast , which further strengthens the link between sequence and histone modifications. Further explorations of the information encoded in the genome, especially sequence features discovered in this study, and a deeper understanding of the structural hierarchies encoded in the genome, should help to better inform how the genome sequence is interpreted or “read” by epigenetic factors.
Conceptually, sequence could be the major driving force influencing the epigenetic framework, which, if true, could help explain the persistence of genome-scale epigenetic modifications and the observed clustering of tissue-specific or generalized functions in complex genomes. For maintaining “housekeeping” functions across different cell types, for example, it would seem to be much less expensive to use the genomic sequence instead of complicated and error-prone gene regulatory networks. This hypothesis predicts that the epigenetic framework should be largely invariant in cells from different developmental contexts. Indeed, this prediction is supported by recent genome-scale profiling studies of several histone marks in embryonic stem cells and lineage-committed cells [53–55]. The specific and dynamic epigenetic patterns that have been demonstrated during development and the cell cycle  appear to be at a much smaller and more local scale than the overall framework laid down by sequence features across the genome.
This study demonstrates a strong association between the primary DNA sequence of the human genome and an extensive set of histone methylation marks described in multiple human cell lines and thus supports the hypothesis that at least part of the underlying basis for the histone code is encoded in the genome. These findings illustrate how genome organization and sequence might establish and maintain epigenetic modifications in complex genomes and thereby directly impact biological function(s).
In order to select enriched regions, we first simulated a null distribution based on the assumption that the number of sequence tags observed in a region with defined size follow a Poisson distribution . Based on this null distribution, two cutoff counts were made at 1% and 99%, thus only allowing 2% False Discovery Rates (FDR).
We considered ±2 kb sequence flanking the TSS of all known RefSeq genes (UCSC hg18). Duplications were subsequently eliminated to yield 19,812 unique regions. For each histone modification, enriched regions were defined to have tag counts greater than the 99% line (based on the null distribution, above), while depleted regions have tag counts less or equal to the 1% cutoffs. The other regions were considered “neutral” and were not analyzed further (see Figure 2A).
To identify a set of non-genic regions for comparison with TSS regions defined above, we first selected genomic regions that are (1) not within 100 kb of any RefSeq genes; (2) free of known repetitive sequences and (3) >1.5 kb. This resulted in 43,039 regions from the current UCSC human genome assembly (hg18). These regions were further divided into enriched, depleted, or neutral regions in the same way as the TSS regions, except that the 1% and 99% cutoff counts were normalized to the size of each region.
Unless specified, DNA sequences were extracted from UCSC assembly (hg18), and k-mer (k = 1,2,3,4,5) content (count/size) was calculated for each region, yielding a total of 1,364 sequence features to represent the underlying sequence information of a particular genomic region. For models with repeat features, the content of repeat families as well as individual repeats were extracted from UCSC annotations.
LibSVM version 2.84 software (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) was used for SVM classification. With the assumption that the data used in this paper are linearly separable, we used linear SVM models throughout the analysis. In a typical classification experiment, a proportion of the regions were selected randomly from each dataset for training and cross validation purposes, while an independent set was used for testing. Such training/testing sessions were performed 10 times and p-values were based on paired t-tests.
Briefly, we selected 100 features with highest F-scores  for each 100 randomly sampled regions that were either enriched or depleted for a specific epigenetic mark. The above sampling/feature selection process was repeated 100 times and features that were selected more than 30 times were defined as highly discriminative features. These features were tested for their correlation with whether or not the region is enriched and for their SVM classification performance as single features.
For whole-genome prediction, we used sliding windows of 2.5 kb in size with 2 kb overlaps to cover the entire human genome. To predict the probability for each window to be bound by an epigenetic mark of interest, we used SVM models based on 2,000 regions enriched for a histone mark (1,000 non-genic regions and 1,000 TSS regions, for both enriched and depleted) with 1,364 sequence features. The resulting predictions were made into local UCSC genome browser tracks and visualized along with other genomic information.
Using datasets from Barski et al., SVM models were trained with 2,000 regions for each histone mark, and those with a 10-fold cross-validation rates greater than 75% were used to predict other histone marks, excluding H3K79me2 (too few regions). The dissimilarity in their occupied sequence between any two histone marks, or distance, was defined as the average misclassification rates of their mutual predications. Such distances were subsequently used to perform hierarchical cluster analysis. Dendrograms, as in Additional file 4: Figure S2, were drawn with default color threshold (0.7) for coloring clusters. Clustering analysis was performed using MATLAB™ software version 7.3.0.
Singlet- and doublet-sequence permutations were performed by shuffling the positions of either single nucleotides or dinucleotides while maintaining constant the base composition or the dinucleotide frequency of the original sequences, respectively. Feature extraction was then performed on the permuted sequences. Since it is not possible to permute CpG only without affecting other sequence motifs, we permuted the frequency of all CpG-containing motifs among the sequences to simulate the effects of CpG permutation. A necessary side-effect of CpG permutation is that the sum of all k-mer frequencies within individual sequences may not be precisely 1. However, since these values are usually very small and the direction of change (increase or decrease) for each k-mer will be random, the sum is likely to be very close to 1 after the permutation.
We thank J Zhu for computational resources and discussions, and U Ohler, T Furey, G Crawford and members of the Willard Lab for discussions and comments on the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.