Extrapolating histone marks across developmental stages, tissues, and species: an enhancer prediction case study

Background Dynamic activation and inactivation of gene regulatory DNA produce the expression changes that drive the differentiation of cellular lineages. Identifying regulatory regions active during developmental transitions is necessary to understand how the genome specifies complex developmental programs and how these processes are disrupted in disease. Gene regulatory dynamics are mediated by many factors, including the binding of transcription factors (TFs) and the methylation and acetylation of DNA and histones. Genome-wide maps of TF binding and DNA and histone modifications have been generated for many cellular contexts; however, given the diversity and complexity of animal development, these data cover only a small fraction of the cellular and developmental contexts of interest. Thus, there is a need for methods that use existing epigenetic and functional genomics data to analyze the thousands of contexts that remain uncharacterized. Results To investigate the utility of histone modification data in the analysis of cellular contexts without such data, I evaluated how well genome-wide H3K27ac and H3K4me1 data collected in different developmental stages, tissues, and species were able to predict experimentally validated heart enhancers active at embryonic day 11.5 (E11.5) in mouse. Using a machine-learning approach to integrate the data from different contexts, I found that E11.5 heart enhancers can often be predicted accurately from data from other contexts, and I quantified the contribution of each data source to the predictions. The utility of each dataset correlated with nearness in developmental time and tissue to the target context: data from late developmental stages and adult heart tissues were most informative for predicting E11.5 enhancers, while marks from stem cells and early developmental stages were less informative. Predictions based on data collected in non-heart tissues and in human hearts were better than random, but worse than using data from mouse hearts. Conclusions The ability of these algorithms to accurately predict developmental enhancers based on data from related, but distinct, cellular contexts suggests that combining computational models with epigenetic data sampled from relevant contexts may be sufficient to enable functional characterization of many cellular contexts of interest. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1264-3) contains supplementary material, which is available to authorized users.


Background
Tissue-specific gene regulatory regions are essential for specifying the proper gene expression patterns that drive cellular differentiation and development in animals [1]. However, the recognition of these regulatory regions is challenging. Recently, the ability to assay a range of epigenetic modifications to DNA and histones so-called "functional genomics" assays on hundreds of cell lines and tissue samples. Despite their Herculean efforts, these projects have comprehensively analyzed only a small fraction the modifications in cellular contexts of interest to the scientific community. We are still far from a complete picture of the dynamics of DNA modification and binding across different cells.
Moreover, the initial selection of cellular contexts to characterize by ENCODE was based mainly on practical considerations, such as sample availability, ease of growth, and high yield. The relevance of functional genomic data from the best characterized cell lines-e.g., the three "Tier 1" ENCODE cell lines: B-lymphocytes (GM12878), embryonic stem cells (H1-hESC), and a leukemia cell line (K562)-to other cellular contexts is unclear due to changes associated with immortalization and the transition to a cancerous state. In addition, these cell lines' progenitors are developmentally distant from many cells of interest. Indeed, most primary tissues and developmental stages have few data sets available, and these are insufficient to produce a full picture of the functional state of the genome in these cellular contexts.
In this environment, researchers with interests outside of the few well characterized cells are presented with a difficult choice between mapping existing data from other contexts to their own or performing functional genomics analyses in their systems of interest. Furthermore, functional genomics analysis of certain cells may never be possible for technical or ethical reasons, e.g., lack of material or the use of protected tissues. As a result, the mapping of functional genomics data from one context to another is common practice, but the situations in which it is appropriate and the potential pitfalls are not clear. A deeper understanding of the relationships between functional genomics data across contexts is needed to identify the conditions in which mapping across contexts is justified.
Recent work comparing chromatin accessibility and epigenetic modification profiles between pluripotent cells and lineage committed cells has revealed the dynamic nature of these modifications [8][9][10][11]. Embryonic stem cells display more accessible chromatin and potentially active regulatory sequence than differentiated cells [12], and lineage commitment is accompanied by activation of lineage-specific regulatory regions and an overall repression of regions active in embryonic stem cells [13]. The relationships encoded in DNA methylation and chromatin state profiles of different cell types are often sufficient to accurately reconstruct hierarchical relationships based on the anatomical and developmental similarity of cellular lineages [13][14][15]. These results suggest that, given the proper models, these relationships could be exploited to "impute" epigenetic information across related cellular contexts to enable functional annotation. Indeed, integrating diverse functional genomics data sets (without regard to their developmental relationships) improves gene regulatory enhancer prediction [16].
In this paper, I evaluate the ability of existing genomewide H3K4me1 and H3K27ac data sets to identify known gene regulatory enhancers active in a related, but distinct, context. I focus on heart development in mouse, because this process has both a multi-stage characterization of these two prominent enhancer-associated histone modifications [9] and several hundred experimentally validated enhancers [17,18]. I introduce a supervised machine learning prediction framework in which I analyze the ability of existing functional genomics data to predict enhancer activity across three dimensions: developmental time within an organism, different tissues within an organism, and corresponding tissues between species. I find that developmental heart enhancers can be predicted very accurately using data from related contexts. Data from all contexts considered, including across species, provide useful information and perform better than random; however, the developmental proximity of a cellular context to the target is correlated with its utility.

Preliminaries
My goal was to evaluate the ability of two enhancerassociated histone modifications, H3K4me1 and H3K 27ac, collected from different cellular, developmental, and organismal contexts to identify known mouse developmental enhancers ( Figure 1). I used H3K4me1 and H3K27ac sites identified via ChIP-Seq on four stages of a directed differentiation of ES cells (E0) to mesoderm (E4) to cardiac precursors (E5.8) to cardiomyocytes (E10) [9]. All other histone mark data I used, including marks from embryonic day 14.5 (E14.5) and eight week old (adult) hearts, were collected by the ENCODE project [6]. Note that the heart data from the first four contexts were collected from a single cell type, while the last two are from full hearts (see Discussion). Other histone modifications are likely informative about enhancer activity [16]; however, we only consider H3Kme1 and H3K27ac, because they have been associated with enhancer activity in many studies and have both been collected in a consistent manner across a range of cellular and developmental contexts.
I took mouse enhancers from the VISTA enhancer browser, a database of experimentally validated DNA sequences (and their genomic locations) that drive gene regulatory enhancer activity at E11.5 in transgenic mice [17]. The mouse genomic regions tested for enhancer activity in VISTA were primarily selected due to evolutionary conservation or association with the P300 transcriptional co-activator protein [17][18][19]. The data set consisted of 217 enhancers, 90 of which drove gene  Figure 1 Overview of the data and analyses. (A) I collected existing genome-wide maps of two histone marks, H3K4me1 and H3K27ac, from stages of a directed differentiation of mouse embryonic stem (ES) cells into cardiomyocytes, from heart tissues collected from several life stages, and from several other tissues. I evaluated how well these marks, which are associated with enhancer activity, could predict experimentally validated heart enhancers in E11.5 mice ("Target"). (B) I took a supervised machine learning approach to this problem by constructing feature vectors for validated enhancers and control regions based on the presence or absence of these histone modifications at their genomic locations. I created classifiers based on different subsets of the data from the cellular contexts given in (A) and evaluated them using cross validation.
expression in the heart. In some analyses, I also considered 88 human DNA sequences shown by VISTA to have heart enhancer activity in transgenic mice. I evaluated the ability of the different histone modification data to identify known heart enhancers in a supervised machine learning framework ( Figure 1). This type of approach has had great success at identifying enhancers in previous studies [16,[20][21][22]. In short, I aimed to classify genomic regions as either positive (having heart enhancer activity at E11.5) or negative (no heart enhancer activity at E11.5) based on the overlap of H3K4me1 and H3K27ac datasets from different cellular, developmental, and species contexts. I refer to these data as "features". In most analyses, validated heart enhancers were the positives. I considered two different sets of negatives that are expected to have different properties: random genomic regions without known enhancer activity matched to the chromosome and length distribution of the heart enhancers ("genomic background") and validated enhancers of gene expression in other E11.5 tissues ("other enhancers"). I anticipated that the histone modifications would be better able to distinguish heart enhancers from the genomic background regions than enhancers active in other tissues. When the random regions were used as negatives, I included 10 matched negatives for each positive. I used six common machine learning algorithms to explore how well these features can predict heart enhancers. I performed five-fold cross validation, in which 20% of the known examples were held out for evaluation of classifiers trained on the remaining data, and evaluated the results by calculating receiver operating characteristic (ROC) curves averaged over the five folds. Unless otherwise stated, I report the results obtained with Random Forests [23], because they performed well and, as described in the Results, the conclusions are robust to the the algorithm used.

Heart enhancers can be identified accurately using data from different developmental stages
In this section, I evaluate the ability of mouse H3K4me1 and H3K27ac data from different stages of heart development to predict validated mouse heart enhancers at E11.5. I first trained a random forest classifier using all the mouse heart histone mark datasets as features ( Figure 1A) to distinguish the heart enhancers from matched regions taken from the genomic background. This classifier was able to accurately identify E11.5 heart enhancers; the area under the resulting ROC curve (ROC AUC) was 0.96 ( Figure 2A).
Next, I evaluated the ability of classifiers trained using the same features to distinguish the heart enhancers from enhancers active in a diverse set of other tissues (listed in Additional file 1: Table S1) at the same developmental stage. As expected from recent results [16], this task

A B C
Between Tissues Between Develop. Stages Figure 2 Heart enhancers can be identified accurately using data from different cellular contexts. A random forest classifier was trained to distinguish E11.5 heart enhancers from the genomic background and from enhancers active in other tissues. Each classifier used H3K4me1 and H3K27ac patterns from different sets of cellular contexts as features ( Figure 1). (A) In five-fold cross validation, the classifiers based on data from other stages of heart development accurately identified E11.5 heart enhancers; these classifiers achieved ROC AUCs of 0.96 vs. the genomic background and 0.85 vs. other enhancers. (B) The classifiers that used data from non-heart tissues as features performed well (AUCs of 0.91 and 0.72), but were worse than the developmental-stage-based classifiers. (C) When mapped between species, the histone marks from mouse heart development were also able to identify human developmental enhancers better than random (AUCs of 0.87 and 0.82). Note that the results in (C) should not be directly compared to the Stage and Tissue results, because they are based on different sets of enhancers. As expected, distinguishing heart enhancers from the genomic background was easier than from non-heart enhancers in each scenario.
proved more challenging, but the random forest classifier still performed well (Figure 2A; ROC AUC = 0.85).

Histone marks from developmental stages nearby the target stage are the most informative
To investigate the contribution of histone marks from different developmental stages to the predictions, I computed the normalized "feature importance" for each feature dataset in the random forest using the Gini impurity metric [23,24]. For both prediction tasks-heart enhancers vs. background and vs. other enhancershistone marks from the two stages adjacent to the target stage (E10 and E14.5) had the majority of the importance (Table 1). In both cases, the importance of features decreased monotonically with distance from the target stage. Not surprisingly, given their pluripotent state, marks from embryonic stem cells were the least important.
Looking at these data in the simpler terms of the overlap of known enhancers with the different histone marks reveals a similar pattern. I computed the enrichment of each heart histone mark dataset within the E11.5 heart enhancers compared to the background regions. All histone mark datasets showed significant enrichment (p < 0.0003 for each; Fisher's exact test) for overlap with the enhancers (Figure 3; odds ratios >5 for all), but as expected, marks collected from the closest developmental stages to the target (E10 and E14.5) are the most enriched for overlap with the enhancers (odds ratios from 60.5 to 213.9). The embryonic stem cell's marks showed the least significant enrichment.

Histone marks from different tissues can accurately identify heart enhancers
Next, I evaluated the ability of data from non-cardiac mouse tissues to predict E11.5 heart enhancers. I used H3K27ac and H3K4me1 from 29 different mouse noncardiac tissues and cell lines collected by the ENCODE project (listed in Additional file 1: Table S1) to train random forest classifiers. This classification strategy also performed well at distinguishing heart enhancers from the genomic background ( Figure 2B; ROC AUC = 0.91); however, it did not achieve the level of accuracy attained with heart data. The non-heart features performed relatively poorly at distinguishing heart enhancers from other types of enhancers ( Figure 2B; ROC AUC = 0.72). Consistent with these patterns, classifying the other (i.e., non-heart) enhancer regions against the genomic background using the diverse set of non-heart histone marks achieved strong performance (ROC AUC of 0.91) and was better than using heart histone marks alone for this task (ROC AUC Figure 3 Odds ratios for E11.5 heart enhancer overlap by H3K4me1 and H3K27ac from different cellular contexts. Mouse E11.5 heart enhancers are significantly enriched, compared to matched regions from the genomic background, for H3K4me1 and H3K27ac from all developmental stages listed (p < 0.01 for all; Fisher's exact test). As expected, the modifications from contexts flanking E11.5-E10 and E14.5-show the strongest enrichment, while the early stages and adult heart tissue have lower enrichment. Only three E11.5 heart enhancers are not marked by at least one of H3K4me1 or H3K27ac at E14.5.
of 0.83). This suggests that these feature data can be used across tissues to find active regulatory regions, but without data from the relevant tissues, predicting specific tissues of activity is more challenging. An H3K4me1 dataset from adult placental tissue received the highest normalized feature importance (10.2%) in the non-heart tissue based classifier of heart enhancers vs. the genomic background. However, I found few clear patterns in the feature importances for these classifiers (Additional file 2: Table S2). No tissue dominated the importances, and many datasets from a variety of tissues including liver, limb, embryonic fibroblasts, and brain, had at least 5% of the importance. A largely different set of tissues and marks were found among the most important in the classifier of heart vs. other enhancers.

Histone marks from mouse heart tissues can accurately identify human heart enhancers
It may never be possible to collect sufficient material for histone mark mapping from some cellular contexts due to ethical or technical reasons. In such cases, using histone marks from model organisms to analyze the target species is appealing. To model this situation, I evaluated the efficacy of using H3K4me1 and H3K27ac collected in mouse hearts to predict known human heart enhancers. Studies of the similarity of transcription factor binding [25], methylation [26], and gene expression [27] suggest that this may be feasible due to the considerable similarity in these events in corresponding tissues across distant species.
I mapped validated human heart enhancer sequences to their corresponding locations in the mouse genome, and then used the same random forest strategy to distinguish these regions from the genomic background and other known non-heart human enhancers mapped to mouse. Only 7% of the human heart enhancers overlapped a validated mouse heart enhancer, so they represent a largely independent set of genomic regions to classify. (This should not be taken as an estimate of the number of enhancers conserved between these species, because testing of both the human and mouse sequences for a region was relatively rare).
The mouse H3K4me1 and H3K27ac heart marks were able to accurately identify the human heart enhancers in both prediction tasks ( Figure 2C; ROC AUCs of 0.87 and 0.82); however, they did not perform as well as they did for mouse heart enhancers ( Figure 2A). Interestingly, the cross species data were better able to distinguish the human heart enhancers from other types of human enhancer than the marks from mouse non-heart tissues were for mouse heart enhancers ( Figure 2B, ROC AUC = 0.72). This is consistent with the greater similarity in gene expression in corresponding tissues across species than between different tissues within the same species. However, direct comparison is complicated by the fact that different enhancer regions are being classified in these two analyses.
The human heart enhancers considered here were tested in transgenic mice at E11.5; as a result, the set of human heart enhancers is biased in (at least) two ways. First, they are sufficiently evolutionarily conserved to be mappable between species; 23% of human enhancers did not reliably map to the mouse genome. It is possible that non-conserved human enhancers could be more difficult to predict using mouse data. Second, these regions are active when placed in a mouse. This second bias is unlikely to have a dramatic effect, since the basic transcriptional machinery evolves far more slowly than the regulatory sequences that it acts upon. Indeed, non-conserved enhancer sequences have been shown to maintain activity over much greater distances [28].

The heart enhancer prediction results are robust to the machine learning algorithm used
Finally, to ensure that the patterns I found in the ability of different histone mark datasets to predict heart enhancers were not specific to a particular classification framework, I repeated all the predictions and evaluations using six common machine learning algorithms: random forests, boosting (AdaBoost), linear support vector machines (SVMs), decision trees, naive Bayes, and knearest neighbor (KNN) classification. Random forests, AdaBoost, and SVMs all performed similarly well and outperformed the three other approaches ( Table 2). No matter the algorithm or overall performance, my general conclusions held: histone marks from diverse contexts can predict heart enhancers better than random, and heart enhancers can be better identified using heart histone mark data than data from other tissues.

Histone modifications can be accurately predicted using modifications from other contexts
Thus far, I have focused on predicting enhancers using histone modification data. However, histone modifications are informative about functions beyond enhancer activity, and thus, predicting histone modifications themselves across cellular contexts could have broader utility.
To explore this possibility, I evaluated the ability of mouse histone marks to predict each other. In other words, rather than predicting known enhancers, I held out the histone marks collected from each heart developmental stage in turn and trained a classifier to distinguish them from the genomic background using the other marks as features. The locations of both H3K4me1 and H3K27ac at each heart developmental stage were able to be determined more accurately than random guessing (Table 3). However, there was considerable variation in performance. Not surprisingly given their pluripotent state and the broadly active regulatory landscape that accompanies it [12,13], modifications in E0 cells were by far the most difficult to identify from the available data. Stages in the middle of the differentiation attained the highest scores, likely due to the presence of more temporally flanking modification data.
I also evaluated the ability of the heart H3K27ac and H3K4me1 marks to identify a set of H3K27ac sites with differential activity in cardiomyocytes from adult mouse This table gives ROC AUCs (averaged over five cross-validation folds) for six common algorithms at distinguishing E11.5 heart enhancers from other enhancers based on marks from heart or non-heart tissues. hearts exposed to transverse aortic constriction, a surgical procedure that leads to cardiac hypertrophy [29]. The histone marks from healthy contexts were able to accurately predict these hypertrophy-sensitive sites from the genomic background (ROC AUC of 0.93). However, predicting whether a hypertrophy sensitive H3K27ac site would increase or decrease in strength was difficult (ROC AUC of 0.61), and the addition of modifications from non-heart contexts did not improve performance. A list of H3K27ac sites that did not change in activity was not available, so I could not test the ability to distinguish hypertrophy sensitive from insensitive sites.

Discussion
The results of my analyses suggest that there is considerable promise in using functional genomics data across contexts for the common task of identifying putative gene regulatory regions. However, the poor results based on data from early developmental stages indicate that, as expected, there is considerable variability in the utility of different data sets depending on the target context. This underscores the need for methods to highlight the most informative contexts in which to collect new functional genomics data and to "interpolate" across contexts using existing data. The analyses presented here required the integration of several datasets from different sources, and attributes of these data must be kept in mind while interpreting the results. First, though the heart enhancers validated by VISTA are commonly used to explore attributes of enhancers, they are likely not representative of all developmental heart enhancers. Most heart enhancers in VISTA were selected for analysis based on mammalian evolutionary sequence conservation or the presence of P300 in heart tissue [17]. Thus, it is possible that enhancers that do not have these attributes, e.g., species-specific enhancers, might be harder to identify using histone modifications [30]. It is likely that enhancer prediction performance could be improved further by incorporating additional features, such as other histone modification profiles, DNA sequence patterns, and evolutionary conservation as in previous studies [16,22], but achieving the most accurate enhancer prediction was not the main goal of this study. Nonetheless, my approach performs very well on this high confidence set of functionally validated enhancers. The differences in the ability of the histone modifications to distinguish heart enhancers from the genomic background and from other enhancers underscores the fact that identifying regions that have enhancer activity is easier than predicting their contexts of activity [16], and it suggests that different sets of features may be most informative for each task.
To obtain H3K4me1 and H3K27ac from as wide a range of heart developmental contexts as possible, I combined data collected from two different sources. The data from stages prior to the target stage (E0-E10) come from a directed differentiation of cardiomyoctyes, while the data from after the target (E14.5 and adult) were collected from whole hearts. Given that both data types show a clear trend of increasing performance with proximity to the target, I am confident in this trend. However, given this difference in origin, the feature importances assigned to marks from before and after the target stage may not be directly comparable representations of the relevance of these developmental stages to the target.
As more functional genomics data collected over successive developmental stages become available, it will be possible to model the gene regulatory landscapes of different developing tissues. For example, Nord et al. [31] recently collected genome-wide H3K27ac modification profiles from seven developmental stages (starting with E11.5 through adulthood) for mouse heart, liver, and forebrain tissues and found rapid and pervasive turnover in the H3K27ac modification landscape across the development of these tissues. As illustrated in their description of the two phases of liver development, integrating physiological knowledge about the development of different tissues with genome-wide modification profiles will likely be necessary to identify the most informative contexts to assay and to fully characterize developmental enhancer usage. Also, given the diversity of histone modifications, it is likely that the most informative contexts to sample will vary across modifications.
The analyses presented here take an initial step towards the ultimate goal of understanding how and when functional genomics data should be mapped across contexts. Understanding the dynamics of genome modifications across cell types will be relevant beyond this enhancer prediction case study; genome-wide profiles of TF binding and DNA and histone modification are informative about many possible functions of DNA.
Given the success of these relatively simple models at predicting histone modifications themselves, I anticipate that more explicit models of the dynamics of these features over developmental time in different tissues will enable interpolation of histone marks between different contexts and help identify the most informative contexts to assay. For example, a statistical method for modeling DNA methylation dynamics across development that can accurately impute missing methylation data across blood cell differentiation was recently published [32]. Similar approaches are likely to enable these analyses for other dynamic modifications and cellular lineages.

Conclusion
I show that integrating enhancer-associated histone marks from different cellular contexts achieves accurate prediction of heart enhancer activity in a context from which no data were collected. Thus, extrapolating existing functional genomics datasets across developmental, cellular, and species contexts has the potential to enable accurate gene regulatory enhancer prediction in many contexts, especially when paired with complementary data about evolutionary conservation and DNA sequence. More broadly, this suggests the promise of "interpolating" existing functional genomics data to related contexts to complement ongoing experimental efforts to characterize the diversity of mammalian cells.

Data
All analyses were performed using the February 2009 assembly of the human genome (GRCh37/hg19) and the July 2007 assembly of the mouse genome (MGSCv37/ mm9). Any data that were not in reference to these builds were mapped over using the liftOver tool from the UCSC Genome Browser's Kent tools (http://hgdownload. soe.ucsc.edu/admin/exe/). This tool was also used to map genomic locations between the human and mouse genomes.
Human and mouse enhancer sequences, genomic locations, and expression contexts were downloaded from the VISTA Enhancer Browser [17] on January 6, 2014. This consisted of 217 mouse enhancers, 90 of which had heart activity in E11.5 transgenic mice, and 848 human enhancers, 88 of which had heart activity.
The H3K4me1 and H3K27ac genome-wide profiles were taken from two sources: a recent directed differentiation of mouse embryonic stem cells into cardiomyocytes [9] and mouse heart and non-heart tissues analyzed by the ENCODE project [6]. Each study performed chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-Seq) to identify DNA associated with modified histones. I used the peak calls released by each study. Supplementary Table 1 lists all cellular contexts considered.