We present a systematic approach to assess discriminative features for enhancer identification in mouse ES cells. We initially established the importance of feature selection using a Naive Bayes classifier and subsequently used LASSO regularized multinomial logistic regression to systematically rank the feature weights. In our model we identified 10 key signatures for distinguishing enhancer regions from promoter-like regions and the rest of the genome. The top signatures predictive of enhancers are p300, H3K4me1, MED12, and NIPBL whereas we identified CpG islands, RNAPII-ser5, CTCF, GC percent and H3K4me3 as the top signatures for promoter identification. Our model predicts previously validated enhancers as well as novel enhancers around Oct4, Sox2, Nanog, Phc1, Lefty1, Lefty2, miR290, Tet1, and Zic3, all of which have important regulatory roles in regulating ES cell pluripotency. In summary, our predicted enhancers appear to regulate the expression of ES cell-expressed genes, coordinately with predicted promoters, are significantly associated with MTL compared to both PrL and unknown regions, enriched in motifs for pluripotency associated TFs and marked by tissue-specific chromatin modifications ( Additional file 10: Table S2).
The features used in our model for enhancer identification are p300, H3K4me1, MED12 and NIPBL, ranked in that order. We identified p300 as the top ranked enhancer signature; in addition p300 is a strong negative predictor for the unknown regions of the genome in our model. In ES cells p300 has been identified as a major H3K27 acetyltransferase ; therefore, it is not surprising that we found a high degree of overlap between our Enh candidate regions and H3K27ac. Although p300, when used as the sole Enh feature, ranked highly with respect to precision of enhancer identification the ability to recall enhancers in the training data was poor, in agreement with p300 being an incomplete representation of enhancers (Figure 1, rank 20). The incorporation of CBP, also linked to H3K27ac in ES cells, into the model may allow for improved enhancer identification . H3K4me1, ranked second for enhancer identification, has been used in several other studies to identify enhancer regions, though we find it performs quite poorly when used as the only enhancer feature (Figure 1, rank 27). This is perhaps due to the broader regions associated with H3K4me1 compared to the more discrete peaks of p300 or MED12.
MED12 and NIPBL are the third and fourth most predictive features for enhancer identification. MED12 is part of the mediator complex and NIPBL is associated with cohesin complex loading; many components of these complexes have recently been identified to co-occupy enhancers in mouse ES cells . It is somewhat surprising that MED12 and NIPBL stand out among the other mediator and cohesin components included in the model (MED1, SMC1A, SMC3) as peaks of all five are apparent at key enhancers . The mediator complex is recruited by many TFs and acts as a bridge to the RNAPII preinitiation complex [22, 73]. MED1 is part of the core mediator module while MED12 is part of the mediator kinase module [74, 75]. The mediator core, when associated with the kinase module has been implicated as a transcriptional repressor; however, MED12 has been shown to be required for transcriptional activation by specific transcription factors, including NANOG . The Drosophila homologue of NIPBL (Nipped-B) has been shown to support enhancer-promoter communication between distant enhancers [77, 78]. Members of the cohesin complex, SMC1A and SMC3, in addition to being associated with enhancer regions, are also found at CTCF occupied regions while the cohesin loading factor NIPBL is less associated with CTCF . CTCF is predictive of PrL in our model, which would account for SMC1A and SMC3 being less discriminatory of Enh and PrL than NIPBL.
As individual features are associated with enhancers and promoters to various degrees, the modeling approach is important to discriminate their relative contribution to enhancers and promoters. The importance of feature extraction when using chromatin signatures has been demonstrated previously using an artificial neural network for enhancer prediction in human ES cells . In addition, models integrating multiple data sources, including histone modification ChIP-Seq data, have been shown to successfully improve cell type-specific TF binding site prediction [36, 80]. These studies used all histone modification ChIP-Seq data available for either human or mouse ES cells to predict TF binding sites without assessing the predictive value of each feature. Narlikar et al. 2010 employed LASSO regularization to identify heart-specific enhancers by using TF binding specificities from PSSMs as features , our approach in contrast used non-TF chromatin-associated features from ChIP-Seq data to identify enhancers. Other unsupervised approaches have been used to systematically annotate various functional elements in the human genome using chromatin features [82, 83] while our approach focused specifically on enhancer identification.
In using minimal features (8 sequencing data sets and 2 genomic features) and avoiding the use of cell-type specific TF signatures in the modeling we have retained the potential to apply this model to other cell types. The features used (p300, H3K4me1, MED12 and NIPBL for Enh; CpG islands, RNAPII-ser5, CTCF, GC percent and H3K4me3 for PrL) are genomic sequence features, ubiquitously expressed proteins, and histone modifications which we expect mark enhancers and promoters in all cell types. In fact, p300 has been used to identify enhancers in several different cells types. In contrast to methods identifying enhancers using p300, alone or in combination with H3K4me1, in ES cells we found that our enhancer candidates overlapped the active H3K27ac mark but not the repressive H3K27me3 mark [68, 69, 71]. This significant association with H3K27ac, and the absence of the H3K27me3 mark in our enhancer candidates indicates that the enhancers we identified are mainly of the active and intermediate, but not poised type. This conclusion is further supported by the finding that genes associated with Enh in addition to PrL showed the highest expression levels in mouse ES cells. These findings also suggest that the TFs identified in motif analysis which are associated with differentiated cell types may have been the result of similarity in PSSMs between members of the same protein family such as SOX2, SOX4 and SOX11.
Despite the reported association of H3K27ac with active enhancers, introducing H3K27ac into our model showed positive weighting of H3K27ac for PrL rather than Enh, indicating that it is a predictor of promoters rather than enhancers. This is in agreement with higher enrichment overall of H3K27ac regions in PrL compared to Enh; whereas, the previously reported H3K27ac regions [68, 69] contain only the H3K27ac regions distal to genes and are therefore more enriched in the Enh category. This dichotomy of H3K27ac marks is similar to RNAPII-ser5, H3K4me3 and H3K4me2 which have been observed at distal enhancer regions (Figure 6, S 6) [20, 84, 85]; however, these features are more enriched around gene TSSs compared to enhancers. As our model detects the overall trend, features enriched in both PrL and Enh either become a discriminating feature of the signature in which they are most enriched or their weights are reduced to zero with LASSO regularization. It is possible that some of this overlap in marks is due to chromatin looping events which bring distal enhancers into close proximity with gene promoters [22, 86]. This close juxtaposition could then allow for cross linking events between proteins at the promoter and enhancer regions capturing the marks in both locations. In addition chromatin modifying proteins recruited to either the promoter or a distal enhancer could act at both locations in the genome when they are juxtaposed in a chromatin loop.
We have shown that Enh candidates are significantly further from gene TSSs than PrL candidates, and the absolute expression distribution of genes associated with Enh and PrL candidates is significantly higher than that of genes associated with PrL alone, indicating additional activation of gene expression by Enh candidates. These results are consistent with the enhancer/promoter DNA looping model which promotes cell-type specific gene activation [3, 4, 6, 10, 22, 87, 88]. Furthermore, we have also found that putative enhancer regions identified in our work are significantly enriched with MTL and that higher probability Enh are associated with an increased number of bound TFs. This finding reveals the utility of our model in identifying high confidence regions bound by multiple TFs which are more likely associated with gene regulation [5, 24]. Moreover, as only 41% of the top candidates are co-bound by OCT4, SOX2 and NANOG, enhancers identified from our approach are not limited to the training set of co-OSN regions. To further identify novel enhancer-bound TFs that may play major roles in mouse ES cells, we performed motif enrichment analysis on enhancer candidates. Although motif enrichment analysis is limited by the PSSM available, we have successfully identified motif enrichment of several known ES cell-regulating TFs, including: KLF4, SOX2, OCT4, ESRRB, STAT3 and ZIC3. In addition, we identified SP1, previously reported to regulate Oct4 and Nanog gene expression through binding to their proximal promoters [89, 90]. Binding motifs for NRF2 and TEAD1 were also identified as enriched in the Enh regions. Both of these TFs are associated with regulatory roles during early development and are expressed in ES cells [60, 61]. Interestingly, we have also identified enhancer regions closest to genes of almost all of these regulatory TFs. In agreement with this we found that genes associated with Enh candidates are more exclusively enriched in GO terms related to DNA binding and transcriptional regulation, while the PrL genes are associated with DNA binding as well as more basal cellular functions. Together these findings suggest that enhancers tend to locate around genes involved in transcriptional regulation in ES cells, and work coordinately with PrL candidates. The combination of regulation by a distal enhancer and proximal basal promoter perhaps allows gene expression to be fine tuned in a cell-type specific manner.