Enhancer identification in mouse embryonic stem cells using integrative modeling of chromatin and genomic features
© Chen et al.; licensee BioMed Central Ltd. 2012
Received: 30 September 2011
Accepted: 26 April 2012
Published: 26 April 2012
Epigenetic modifications, transcription factor (TF) availability and differences in chromatin folding influence how the genome is interpreted by the transcriptional machinery responsible for gene expression. Enhancers buried in non-coding regions are found to be associated with significant differences in histone marks between different cell types. In contrast, gene promoters show more uniform modifications across cell types. Here we used histone modification and chromatin-associated protein ChIP-Seq data sets in mouse embryonic stem (ES) cells as well as genomic features to identify functional enhancer regions. Using co-bound sites of OCT4, SOX2 and NANOG (co-OSN, validated enhancers) and co-bound sites of MYC and MYCN (limited enhancer activity) as enhancer positive and negative training sets, we performed multinomial logistic regression with LASSO regularization to identify key features.
Cross validations reveal that a combination of p300, H3K4me1, MED12 and NIPBL features to be top signatures of co-OSN regions. Using a model from 10 signatures, 83% of top 1277 putative 1 kb enhancer regions (probability greater than or equal to 0.8) overlapped with at least one TF peak from 7 mouse ES cell ChIP-Seq data sets. These putative enhancers are associated with increased gene expression of neighbouring genes and significantly enriched in multiple TF bound loci in agreement with combinatorial models of TF binding. Furthermore, we identified several motifs of known TFs significantly enriched in putative enhancer regions compared to random promoter regions and background. Comparison with an active H3K27ac mark in various cell types confirmed cell type-specificity of these enhancers.
The top enhancer signatures we identified (p300, H3K4me1, MED12 and NIPBL) will allow for the identification of cell type-specific enhancer regions in diverse cell types.
KeywordsEnhancer Embryonic stem cells Transcription factor ChIP-Seq Histone methylation Regulation of gene expression
Chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) has enabled genome-wide investigation of chromatin features and epigenetic modifications within the non-coding regions of mammalian genomes in high resolution . ChIP-Seq provides the opportunity to characterise and begin to understand on a genome-wide scale how genes are regulated in a cell-type specific manner by sequence-specific DNA-binding transcription factors (TFs). However, identifying regulatory regions within the genome and linking these regions to the regulation of specific genes remains a challenge.
Distal regulatory elements have been identified which regulate gene transcription from several kilobases (kb) away and have even been found to regulate genes located on separate chromosomes [2–4]. Functional characterisation of these regulatory elements can be done by identifying bound TFs and investigating whether or not they act as enhancers, increasing transcription of a gene in a position and orientation independent manner. ChIP-Seq analysis for several TFs has revealed a significant fraction (40–60%) of the binding sites for most TFs are located in intergenic regions >10 kb from transcription start sites (TSSs) of annotated genes [5–7]. In addition, enhancer regions are associated with significant epigenetic differences between cell types, while gene promoters show more uniform modifications across different cell types [8, 9]. These findings suggest that enhancers, which can be located at great distances from the genes they regulate, play a larger role in regulating tissue-specific gene expression than the sequences proximal to gene promoters. Moreover, mutations in DNA sequences of distant-acting enhancers contribute to various diseases , further stressing their importance in regulating gene expression.
Prior to the availability of ChIP-Seq and ChIP-chip data, computational approaches based solely on genomic sequences were used to identify enhancer regions. Initially these approaches compared the genomic sequence with TF binding motifs represented by position specific scoring matrices (PSSM) from TRANSFAC  and JASPAR . TF motif clustering and comparative genomics improved the predictive power of these approaches [13–16]. In addition, intergenic regions with high sequence conservation between human and Fugu or ultra-conserved regions between human-mouse-rat (>200 bp of 100% identity) are predictive of regulatory regions involved in conserved processes such as embryonic development [17, 18]. As many enhancer regions regulate the expression of genes in a tissue-specific manner and can be located at great distances from the genes they regulate, the link between cell-type and active enhancers is lacking in purely sequenced based approaches. Using ChIP-Seq approaches several different methods of identifying enhancers have been applied including: enrichment of mono-methylated lysine 4 of histone H3 (H3K4me1) and depletion of tri-methylated lysine 4 of histone H3 (H3K4me3) , binding of the co-activator p300 (also known as EP300) , intergenic RNAPII (RNA polymerase II) phosphorylated at serine 5 on the C-terminal domain (RNAPII-ser5) , multiple transcription factor bound loci (MTL) , and a combination of these features  have been used to identify enhancers within a target cell type. Additional chromatin associated proteins have been identified at enhancers including members of the mediator (MED1, MED12) and cohesin (SMC1A, SMC3, NIPBL) complexes .
These different approaches show variable success at predicting enhancers with in vivo activity. For example, 47% (246/528) of human genomic regions predicted by sequence conservation were confirmed as enhancers in a transgenic mouse assay [17, 18]. The prediction was significantly enhanced to 87% (75/86) when using only p300 high throughput chromatin immunoprecipitation sequencing (ChIP-Seq) binding sites from mouse forebrain, midbrain and limb cells . Heintzman et al. developed a motif-independent model for identifying and distinguishing promoters and enhancers using histone modification profiles . They observed H3K4me1 enrichment and H3K4me3 depletion at p300 binding sites and then used this signature to identify putative enhancers in 5 human cell lines. Their enhancer predictions were supported by DNase I hypersensitivity, binding of p300, or binding of the mediator protein MED1 63.5% of the time. Taking the MTL approach, using overlapping regions of three major pluripotency TFs, OCT4, SOX2, NANOG, Chen et al. 2008 generated enhancer candidates in mouse embryonic stem (ES) cells and tested 25 of these regions for enhancer activity . All 25 regions displayed ES cell specific enhancer activity suggesting the MTL approach is highly predictive of functional enhancer regions. However, enhancer activities depend on the specific TFs occupying the MTL, as all 8 co-MYC associated MTL were found to have little or no enhancer activity . In addition, the MTL approach requires prior knowledge of relevant regulatory TFs and the generation of unique ChIP-Seq data sets for each cell type.
Although previous approaches showed promising performance, a potential issue with non-integrated approaches is that each marker may be an incomplete representation of the relevant enhancers in a particular cell type. For example, although sequence conservation is frequently used to identify regulatory elements, ultra-conservation has been reported to identify only a small subset of developmentally related enhancers, specifically, those involved in development of the nervous system . Furthermore, there is variation in the degree of conservation at enhancers; a large population of validated heart enhancers are less deeply conserved in vertebrate evolution . Although found at enhancer regions in different cell types, p300 is reported to mark only a subset of enhancers in heart . Histone modifications perhaps represent a more widely applicable enhancer signature, though by inspection they appear to mark the genome in a broad manner. The incomplete representation of enhancers by each feature and the broad signatures associated with histone modifications emphasize the need to integrate these features within the same cell type and evaluate their importance for enhancer prediction.
In this study we used 31 public high throughput mouse ES cell RNA-Seq and ChIP-Seq datasets, as well as 5 genomic features to identify the characteristics most predictive of enhancers. We applied multinomial logistic regression with LASSO regularization [25, 26] to identify key enhancer signatures, predict functional enhancers, and subsequently identify motif enrichment in predicted enhancers. Through an initial assessment of enhancer markers, we highlight the importance of feature selection. Furthermore, LASSO regularization ranked the predictability of signatures necessary for enhancer classification with p300 being the most predictive followed by H3K4me1 and MED12. Predicted enhancers showed significant association with MTL indicating functionality and identified previously validated enhancers. In addition, a supervised motif enrichment test on putative enhancer regions using Clover  confirmed our ability to identify known TFs centrally involved in ES cell transcriptional regulatory networks.
Feature selection improves enhancer prediction
Datasets used in the integrative modeling
Histone modifications (H3, H3K4me1, H3K4me2, H3K4me3, H3K36me3, H4K20me3, H3K27me3, H3K9me3) and RNAPII
SMC1A, SMC3, MED12, MED1, NIPBL
CpG islands, GC content, SNP, repeat regions and PhastCons most conserved regions
OCT4, SOX2, NANOG, MYC, MYCN
KLF4, STAT3, SMAD1, E2F1, TFCP2L1, ZFX, ESRRB
In our model we classified 1 kb genomic bins into 3 categories: enhancer positive (Enh), promoter-like (PrL) and unknown. As multiple combinations of features have been used to predict enhancers, we performed Naive Bayes classifications using various combinations of features. Assessment of the model was carried out using four indices, area under curve (AUC), precision, modified precision, and recall (detailed in Methods), using 10 fold cross validation on the training set data (Figure 1B). The classifiers were ranked by the average ranking of the four indices. We used this analysis to make a number of observations about the predictive value of various markers of enhancers within the context of other possible features. First, we noted that the Naive Bayes classifier using all features (rank 12) does not generate the best classification. Such classification bias can be caused by feature redundancy and introduction of noise by non-informative features. Second, although p300 is extensively used to identify enhancers and has the highest modified precision in combination with CpG islands, the recall is low which is likely caused by the incomplete representation of enhancers by p300 (rank 20). Classifiers using previous markers, such as H3K4me1 and H3K4me3 as well as RNAPII-ser5 and RNA, also performed poorly (rank 27 and 32). Third, adding GC content, an informative genomic feature, improves precision and recall (comparing rank 5 to 1). Fourth, the recall of enhancers can be over 80 percent with informative features (rank 2 and 4). Finally, we observed that although sequence conservation has been previously used to predict enhancers, adding most conserved regions into feature combinations can worsen the model prediction (rank 5 to 7; 18 to 21), which is likely due to non-exclusive representation of enhancers in conserved regions.
Ranking enhancer signatures
Classified enhancer and promoter-like candidates
A total of 19200 1 kb regions were predicted to be Enh, 67672 were predicted to be PrL, and 2567872 regions were predicted as unknown using the LASSO regularized model ( Additional file 1: Table S1). Of the 1291 co-OSN training regions 922 were classified as Enh in the model, while 4054 of 4465 co-MYC training regions were classified as PrL. All luciferase-validated enhancer positive regions from Chen et al. 2008  are predicted positive whereas, all luciferase-validated enhancer negative regions are predicted PrL in our model. As Enh candidates are ranked by probability, more stringent thresholds can be applied to gain higher confidence enhancer candidates. Applying a stringent threshold of greater than 0.8 probability, 24 of 25 previously validated enhancer regions were predicted Enh and 7 of 8 previously identified negative regions were predicted PrL. In addition, the log probabilities of putative enhancer candidates are significantly correlated (ρ = 0.44, p = 0.0002) to an independent luciferase assay data set in which 67 regions bound by CHD7 were tested for enhancer activity . A heatmap of the top 50 Enh and PrL candidates demonstrates that the Enh and PrL signature features are separated with hierarchical clustering except for H3K4me2, the lowest ranked feature in the PrL category ( Additional file 2: Figure S1). The probability and modeling approach, in contrast to an overlapping peak approach, allows variations in features within the predicted category.
Enhancer and promoter-like candidates coordinately regulate gene expression
Short intergenic transcripts have been found associated with enhancers in neurons , while the human β-globin distal locus control region (LCR) is associated with long, cell cycle-regulated intergenic transcripts [39, 40]. We investigated the overlap between lincRNAs (large intergenic non-coding RNA) identified in ES cells and our Enh and PrL categories. We found that only 135 out of 2127 lincRNAs (0.7% total Enh) from Guttman et al. 2009  overlapped predicted Enh regions and only 22 (1.7% high probability Enh) overlapped high probability Enh candidates (prob ≥ 0.8). A larger proportion overlapped PrL regions, 501 (3.1% total PrL) and 233 (2.3% high probability PrL) of the 2127 lincRNAs overlapped the total and top (prob ≥ 0.8) PrL candidates, respectively. In addition, the proximity of Enh candidates to protein coding transcripts is significantly greater than the proximity to these lincRNAs (p < 2.2 × 10−16). The enhancer associated transcripts identified by Kim et al. 2010  are generally short, <2 kb in length, and lincRNAs were defined as at least 5 kb in length. These may therefore represent functionally different types of intergenic non-coding transcripts.
In order to probe the functional significance of the genes regulated by Enh or PrL, we investigated functional enrichment in these categories using DAVID [42, 43] followed by clustering of functions with significant numbers of shared genes using Enrichment map . The Enh set is significantly enriched with DNA binding, transcription regulating activities and stem cell development (FDR = 1.2 × 10−06, 0.0018, and 0.03 respectively), and the genes include several TFs involved in ES cell transcriptional regulation: Oct4, Myc, Mycn, Sox2, Esrrb, Phc1 and Zic3 ( Additional file 4: Figure S3). In contrast the PrL candidates are enriched with a wide variety of molecular functions in addition to DNA binding, such as RNA binding and processing, translation, and chromatin organization (all FDR < 1.0 × 10−8) indicating these genes are associated with more basal cellular functions. This indicates that, in addition to promoters, Enh candidates play a significant role in regulating the ES cell transcriptional program to maintain pluripotency.
Enhancer candidates are bound by multiple transcription factors in ES cells
ES cell enhancer candidates
We next looked at how well the model predicts previously identified mouse ES cell enhancer regions associated with mediator and cohesin proteins and shown to form chromatin loops with the nearby gene TSSs . These tissue-specific promoter-interacting enhancer regions upstream of Oct4 (Pou5f1) Nanog Phc1 and Lefty1 are all predicted Enh with probabilities greater than 0.8 ( Additional file 5: Figure S4). Moreover, several significant TF binding peaks overlap with these enhancers. Although the aligned TF peaks at the Oct4 upstream enhancer are located at the boundary of two 1 kb genome bins, our model predicted both sides with high probability (prob = 0.9487 and 0.8363). In addition to the Lefty1 promoter-interacting enhancer, we identified novel contiguous enhancer regions over 3–5 kb upstream of Lefty2 (prob = 0.9107 and 0.8999).
In addition to Enh candidates overlapping co-OSN sites, we have also identified several putative enhancers with limited OSN occupancy including those shown surrounding Tet1/U6 Zic3, and C80913 (Figure 7). We identified seven contiguous enhancers around the U6 small nuclear RNA, upstream of Tet1 which overlap TF bound regions as well as a CHD7 bound region (obtained from ; Figure 7, S 6B). TET1 modifies methylated cytosine (5mC) by hydroxylation, generating 5-hydroxymethylcytosine (5hmC) , potentially the first step in DNA demethylation , and is involved in regulating the lineage potential of ES cells . We identified two adjacent PrL candidates upstream of the annotated mouse gene which overlap the aligned human TET1 gene (Figure 7). The upstream enhancer (prob = 0.9138) of the Zic3 gene showed nine-fold up-regulation of luciferase activity compared to a minimal Oct4 promoter , and the higher probability enhancer downstream (prob = 0.9550), overlapping one E2F1 and one CHD7 peak, has not been investigated (Figure 7, S 6C). Furthermore, the putative enhancer 10 kb upstream of C80913 overlaps only with a CHD7 peak (Figure 7, S 6D).
TF motif enrichment at ES cell enhancer candidates
Although our modeling approach was trained using co-OSN regions, the prediction is not limited to sites bound by these TFs. In fact, of the 1277 regions with enhancer probability ≥0.8, 522 overlap with co-OSN regions, 394 overlap with at least 4 of the other 7 TFs, 136 of which are not co-OSN. Overall 281 regions are not associated with OCT4, SOX2 or NANOG, and 97 of our Enh candidates were not associated with any of the 12 TFs. As a number of our Enh candidates do not overlap a region bound by any of the 12 TFs for which ChIP-Seq data is available in ES cells, we next investigated which TFs may be binding these regions.
Motif enrichment in mouse ES cell putative enhancers
Max Enh prob
5.34 (Rxra) 17.55 (Rxrb) 30.28 (Rara)
To verify binding of transcription factors at identified motifs we used KLF4, ESRRB, and STAT3 ChIP-Seq data from Chen et al. 2008 . We compared the overlap of our high probability Enh regions to TF enriched regions (defined by the top 0.5 percentile bins for KLF4, ESRRB and STAT3). We found 519, 707 and 518 of the 1277 high probability enhancers (prob≥0.8) contained KLF4, ESRRB or STAT3 enriched regions, respectively (p values <2.2 × 10−16) confirming the motif enrichment results for these transcription factors.
Identified enhancers are mainly active and cell type specific
To further investigate the cell type-specific regulatory potential of our identified ES cell Enh candidates, we compared distal H3K27ac and H3K4me1 marks from various cell types to our Enh candidates ( Additional file 8: Figure S7B). We found significantly greater Enh overlap to distal H3K27ac and H3K4me1 in ES cells compared to the differentiated cells indicating a significant proportion of our Enh candidates are specific to ES cells (p < 3 × 10−5). We observed an increase in the overlap to H3K27ac and H3K4me1 in ES cells, and a decrease in the overlap in the differentiated cell types for the high probability Enh set compared to all Enh (Figure 8C compared to Additional file 8: Figure S7B).
We present a systematic approach to assess discriminative features for enhancer identification in mouse ES cells. We initially established the importance of feature selection using a Naive Bayes classifier and subsequently used LASSO regularized multinomial logistic regression to systematically rank the feature weights. In our model we identified 10 key signatures for distinguishing enhancer regions from promoter-like regions and the rest of the genome. The top signatures predictive of enhancers are p300, H3K4me1, MED12, and NIPBL whereas we identified CpG islands, RNAPII-ser5, CTCF, GC percent and H3K4me3 as the top signatures for promoter identification. Our model predicts previously validated enhancers as well as novel enhancers around Oct4, Sox2, Nanog, Phc1, Lefty1, Lefty2, miR290, Tet1, and Zic3, all of which have important regulatory roles in regulating ES cell pluripotency. In summary, our predicted enhancers appear to regulate the expression of ES cell-expressed genes, coordinately with predicted promoters, are significantly associated with MTL compared to both PrL and unknown regions, enriched in motifs for pluripotency associated TFs and marked by tissue-specific chromatin modifications ( Additional file 10: Table S2).
The features used in our model for enhancer identification are p300, H3K4me1, MED12 and NIPBL, ranked in that order. We identified p300 as the top ranked enhancer signature; in addition p300 is a strong negative predictor for the unknown regions of the genome in our model. In ES cells p300 has been identified as a major H3K27 acetyltransferase ; therefore, it is not surprising that we found a high degree of overlap between our Enh candidate regions and H3K27ac. Although p300, when used as the sole Enh feature, ranked highly with respect to precision of enhancer identification the ability to recall enhancers in the training data was poor, in agreement with p300 being an incomplete representation of enhancers (Figure 1, rank 20). The incorporation of CBP, also linked to H3K27ac in ES cells, into the model may allow for improved enhancer identification . H3K4me1, ranked second for enhancer identification, has been used in several other studies to identify enhancer regions, though we find it performs quite poorly when used as the only enhancer feature (Figure 1, rank 27). This is perhaps due to the broader regions associated with H3K4me1 compared to the more discrete peaks of p300 or MED12.
MED12 and NIPBL are the third and fourth most predictive features for enhancer identification. MED12 is part of the mediator complex and NIPBL is associated with cohesin complex loading; many components of these complexes have recently been identified to co-occupy enhancers in mouse ES cells . It is somewhat surprising that MED12 and NIPBL stand out among the other mediator and cohesin components included in the model (MED1, SMC1A, SMC3) as peaks of all five are apparent at key enhancers . The mediator complex is recruited by many TFs and acts as a bridge to the RNAPII preinitiation complex [22, 73]. MED1 is part of the core mediator module while MED12 is part of the mediator kinase module [74, 75]. The mediator core, when associated with the kinase module has been implicated as a transcriptional repressor; however, MED12 has been shown to be required for transcriptional activation by specific transcription factors, including NANOG . The Drosophila homologue of NIPBL (Nipped-B) has been shown to support enhancer-promoter communication between distant enhancers [77, 78]. Members of the cohesin complex, SMC1A and SMC3, in addition to being associated with enhancer regions, are also found at CTCF occupied regions while the cohesin loading factor NIPBL is less associated with CTCF . CTCF is predictive of PrL in our model, which would account for SMC1A and SMC3 being less discriminatory of Enh and PrL than NIPBL.
As individual features are associated with enhancers and promoters to various degrees, the modeling approach is important to discriminate their relative contribution to enhancers and promoters. The importance of feature extraction when using chromatin signatures has been demonstrated previously using an artificial neural network for enhancer prediction in human ES cells . In addition, models integrating multiple data sources, including histone modification ChIP-Seq data, have been shown to successfully improve cell type-specific TF binding site prediction [36, 80]. These studies used all histone modification ChIP-Seq data available for either human or mouse ES cells to predict TF binding sites without assessing the predictive value of each feature. Narlikar et al. 2010 employed LASSO regularization to identify heart-specific enhancers by using TF binding specificities from PSSMs as features , our approach in contrast used non-TF chromatin-associated features from ChIP-Seq data to identify enhancers. Other unsupervised approaches have been used to systematically annotate various functional elements in the human genome using chromatin features [82, 83] while our approach focused specifically on enhancer identification.
In using minimal features (8 sequencing data sets and 2 genomic features) and avoiding the use of cell-type specific TF signatures in the modeling we have retained the potential to apply this model to other cell types. The features used (p300, H3K4me1, MED12 and NIPBL for Enh; CpG islands, RNAPII-ser5, CTCF, GC percent and H3K4me3 for PrL) are genomic sequence features, ubiquitously expressed proteins, and histone modifications which we expect mark enhancers and promoters in all cell types. In fact, p300 has been used to identify enhancers in several different cells types. In contrast to methods identifying enhancers using p300, alone or in combination with H3K4me1, in ES cells we found that our enhancer candidates overlapped the active H3K27ac mark but not the repressive H3K27me3 mark [68, 69, 71]. This significant association with H3K27ac, and the absence of the H3K27me3 mark in our enhancer candidates indicates that the enhancers we identified are mainly of the active and intermediate, but not poised type. This conclusion is further supported by the finding that genes associated with Enh in addition to PrL showed the highest expression levels in mouse ES cells. These findings also suggest that the TFs identified in motif analysis which are associated with differentiated cell types may have been the result of similarity in PSSMs between members of the same protein family such as SOX2, SOX4 and SOX11.
Despite the reported association of H3K27ac with active enhancers, introducing H3K27ac into our model showed positive weighting of H3K27ac for PrL rather than Enh, indicating that it is a predictor of promoters rather than enhancers. This is in agreement with higher enrichment overall of H3K27ac regions in PrL compared to Enh; whereas, the previously reported H3K27ac regions [68, 69] contain only the H3K27ac regions distal to genes and are therefore more enriched in the Enh category. This dichotomy of H3K27ac marks is similar to RNAPII-ser5, H3K4me3 and H3K4me2 which have been observed at distal enhancer regions (Figure 6, S 6) [20, 84, 85]; however, these features are more enriched around gene TSSs compared to enhancers. As our model detects the overall trend, features enriched in both PrL and Enh either become a discriminating feature of the signature in which they are most enriched or their weights are reduced to zero with LASSO regularization. It is possible that some of this overlap in marks is due to chromatin looping events which bring distal enhancers into close proximity with gene promoters [22, 86]. This close juxtaposition could then allow for cross linking events between proteins at the promoter and enhancer regions capturing the marks in both locations. In addition chromatin modifying proteins recruited to either the promoter or a distal enhancer could act at both locations in the genome when they are juxtaposed in a chromatin loop.
We have shown that Enh candidates are significantly further from gene TSSs than PrL candidates, and the absolute expression distribution of genes associated with Enh and PrL candidates is significantly higher than that of genes associated with PrL alone, indicating additional activation of gene expression by Enh candidates. These results are consistent with the enhancer/promoter DNA looping model which promotes cell-type specific gene activation [3, 4, 6, 10, 22, 87, 88]. Furthermore, we have also found that putative enhancer regions identified in our work are significantly enriched with MTL and that higher probability Enh are associated with an increased number of bound TFs. This finding reveals the utility of our model in identifying high confidence regions bound by multiple TFs which are more likely associated with gene regulation [5, 24]. Moreover, as only 41% of the top candidates are co-bound by OCT4, SOX2 and NANOG, enhancers identified from our approach are not limited to the training set of co-OSN regions. To further identify novel enhancer-bound TFs that may play major roles in mouse ES cells, we performed motif enrichment analysis on enhancer candidates. Although motif enrichment analysis is limited by the PSSM available, we have successfully identified motif enrichment of several known ES cell-regulating TFs, including: KLF4, SOX2, OCT4, ESRRB, STAT3 and ZIC3. In addition, we identified SP1, previously reported to regulate Oct4 and Nanog gene expression through binding to their proximal promoters [89, 90]. Binding motifs for NRF2 and TEAD1 were also identified as enriched in the Enh regions. Both of these TFs are associated with regulatory roles during early development and are expressed in ES cells [60, 61]. Interestingly, we have also identified enhancer regions closest to genes of almost all of these regulatory TFs. In agreement with this we found that genes associated with Enh candidates are more exclusively enriched in GO terms related to DNA binding and transcriptional regulation, while the PrL genes are associated with DNA binding as well as more basal cellular functions. Together these findings suggest that enhancers tend to locate around genes involved in transcriptional regulation in ES cells, and work coordinately with PrL candidates. The combination of regulation by a distal enhancer and proximal basal promoter perhaps allows gene expression to be fine tuned in a cell-type specific manner.
We identified a widely applicable set of features to identify regulatory enhancers and promoter regions in a given cell type. Use of these features yields enhancer regions associated with increased gene expression of neighbouring genes and ES cell-specific histone modifications consistent with active enhancers.
Feature and TF datasets
Thirty public domain ChIP-Seq raw data sets in mouse ES cells were obtained from Gene Expression Omnibus (GEO) : 12 TFs , 8 histone modifications , 3 polymerase occupancy  and 7 chromatin associated proteins [5, 22, 28] (Table 1). Although not listed in the table, five ChIP-Seq controls corresponding to the above features were also downloaded. Processed RNA-Seq data from Guttman et al. 2010 was obtained for expression analysis . Other genomic features such as Phastcons most conserved regions [31, 33], CpG islands , GC contents, SNP and repeats were downloaded from UCSC genome annotation database in mouse mm9 build . We used Ensembl transcripts for gene annotation and UCSC Genes for figure illustration.
Reads from ChIP-Seq data were aligned to mouse mm9 assembly using Bowtie alignment  by suppressing alignments to only 1 best reportable alignment with a maximum number of 2 mismatches within 28 nucleotides in the high quality sequencing end. The mouse genome was segmented into 1 kb bins, and target tag count within each bin was normalized by dividing by the control tag count plus 3 (the rounded average median tag count within a bin for all ChIP-Seq data), to reduce the effect of low input count in generating extreme ratios. Format conversion was done using the Vancouver Short Read Analysis Package (http://vancouvershortr.sourceforge.net/). The same procedure was done for all ChIP-Seq data sets to obtain a vector of values for each protein or TF. While ChIP-Seq data files for each TF in the same experimental setting were combined, exact duplicate tags were removed to avoid PCR amplification bias generated in the sequencing library preparation. As the control ChIP-Seq data sets were not uniformly distributed throughout the genome, 180 bins with read counts in top 0.5 percentile for all 5 controls were first excluded from the analysis. The genomic features were subsequently quantified in each bin using counts without normalization, eg. number of CpG islands within each bin. In enhancer candidate plots, significant TF binding peaks (p < 0.001) predicted with SISSRs algorithm (v1.4)  are labeled in order to show binding peaks defined from another source.
Training data sets
Binding regions of each TF used as training data were defined to be regions with number of tags in the top 0.5 percentile. As ChIP-Seq co-bound regions of the pluripotency transcription factors OCT4, SOX2 and NANOG were confirmed to have enhancer activity in 25/25 cases and ChIP-Seq co-bound regions of MYC or MYCN were shown to have very weak or no ES-cell-specific enhancer activity in luciferase assays , these co-bound regions were used as Enh and PrL training sets respectively using TF independent features. More specifically, 1291 co-bound regions of OCT4/SOX2/NANOG (co-OSN) without either of MYC/MYCN binding were taken as the Enh training set; whereas 4465 co-bound regions of MYC and MYCN (co-MYC) without either of OSN were taken as the PrL training set. Due to the promoter-like nature of the co-MYC cluster (Figure 1a), 5000 random regions were drawn from non co-OSN and non co-MYC regions to be the third category called ‘unknown’.
Feature combination assessment using Naive Bayes
Naive Bayes classifiers with various feature combinations were used to classify 1 kb genome bins into the three categories. Due to limited amount of training data, 10-fold cross validation was performed to the training data set by randomly leaving 10% of the data as a validation set. Classifier assessment was carried out using the mean of area under curve (AUC), precision, modified precision, and recall values computed from the validation set. AUC is the area under ROC curve. Precision, normally given by true positive over sum of true positive and false positive, was modified to report the percentage of co-OSN regions out of putative enhancers that were originally Enh or PrL. This was done to avoid penalizing potential Enh candidates in the unknown validation set. Recall, given by true positive over sum of true positive and false negative, measures the percentage of all co-OSN regions predicted to be enhancers. These indices provide different aspects on model assessment. Ranking of the classifiers with different feature combinations were sorted by the average ranking of all indices.
Feature extraction and weighting with LASSO regularized multinomial logistic regression
All the features were first standardized by subtracting the mean and dividing by the standard deviation in order to prevent biased shrinkage of feature weights. Multinomial logistic regression was applied to model sequence features, chromatin features, and associated proteins to predict the genome-wide location of enhancers. Cross entropy was used as the error function for multi-group classification. To assess predictability of features, LASSO regularization was used to introduce extra penalization with a power raised on the weight vector [25, 26]. Using LASSO regularization, feature weights of less significance shrink to 0 as lambda increases and the lambda, exp(−4), is subsequently obtained through comparison of average multinomial deviances from 10-fold cross validation.
Correlations of Enh and PrL sets with absolute gene expression in mouse ES cells
RPKM, the number of reads per kilobase of exon region per million mapped reads derived from RNA-Seq data has been shown to be approximately proportional to the absolute abundance of mRNAs in cell . We obtained formerly computed RPKM values by Ouyang et al.  from a mouse ES cell RNA-Seq data set . The Enh and PrL candidates were assigned to Ensembl genes of the closest TSSs without taking the degree of transcription into account. The distributions of absolute expression (RPKM) were shown in box plots, and one-sided Kolmogorov-Smirnov tests were performed to assess the differences in empirical cumulative distributions of gene sets.
GO functional enrichment analysis and MTL association
Candidates were first assigned to genes through identifying the closest TSSs. 469 Enh candidates (Enh prob ≥ 0.9) and 2239 PrL candidates (PrL prob ≥ 0.99) were used to carry out the analysis. The discrepancy in probability cut-offs between sets was due to the large number of high confidence PrL candidates. DAVID functional annotation website was used to assess functional enrichment on both Enh and PrL candidate sets compared to all genes in the mouse genome [42, 43]. As genes can have more than one nearby candidate within each set, only the unique genes within each set were subjected to GO analysis. The molecular functions and biological processes significantly enriched in each set were subsequently reported (FDR < 0.1). For better visualization of the enrichment functions, Enrichment map plug-in of Cytoscape was used to cluster functions sharing the same genes [44, 98].
To determine if the putative enhancer candidates are enriched in MTL, we separate Enh, PrL and unknown candidates, and all genome bins into 3 groups: 0 TF peak enrichment, enrichment of 1 to 3 TFs, and enrichment of over 4 TFs (MTL) within the bin. The MTL enrichment of Enh is assessed using Chi-squared statistics of the counts of the last two bins in comparison to that of PrL, unknown and genomic bin sets.
Supervised motif analysis
Clover algorithm  was used to screen a set of given DNA sequences against transcription factor weight matrices, and assess whether any motifs are over- or under-represented in the given sequences by comparing to random sequences drawn from chr19 (42.7% C + G) and 5 kb upstream of TSSs (45.7% C + G). For a more stringent motif screening, we also tested the significance of the Enh regions compared to 2222 PrL regions (PrL prob > 0.99; 61% C + G). Motif and sequence shuffling were also used to account for G + C content biases. 467 putative enhancer regions with Enh probability greater than 0.9 (45% C + G) were used in the motif analysis to narrow down the search. We obtained the human and mouse PSSMs from the JASPAR mammalian database  and mouse protein-binding microarray data .
Binding region comparison with other datasets
Significantly enriched regions of H3K4me1, H3K27ac and p300 in ES, adult liver, progenitor B and neural progenitor cells were obtained from , and genomic coordinates were updated to mm9 using the UCSC genome browser liftOver tool . 10479 and 2916 CHD7 binding peaks from medium and high thresholds were obtained from , and the genome coordinates were also updated to mm9 built. Venn diagrams are drawn on the basis of putative enhancers allowing 500 bp gaps between regions.
- ES cell:
Embryonic stem cell
OCT4 SOX2, and NANOG ChIP-Seq co-bound
MYC and MYCN ChIP-Seq co-bound
Position-Specific Scoring Matrix
Multiple transcription factor-bound loci
Transcription start site
RNA polymerase II
This study was supported by the Natural Sciences and Engineering Research Council of Canada (Discovery Grant held by JAM), the Ontario Research Fund GL2 (Group Grant), and a Canada Foundation for Innovation/Ontario Research Fund Leaders Opportunity Fund (infrastructure grant held by JAM). CYC was supported in part by an Ontario Graduate Scholarship.
CYC thanks Dr. Paul C. Boutros for his critical review and helpful insights, and Dr. Yunchen Gong for the introduction to SISSRs algorithm.
- Kharchenko PV, Tolstorukov MY, Park PJ: Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol. 2008, 26: 1351-1359.PubMed CentralView ArticlePubMedGoogle Scholar
- Amano T, Sagai T, Tanabe H, Mizushina Y, Nakazawa H, Shiroishi T: Chromosomal dynamics at the Shh locus: limb bud-specific differential regulation of competence and active transcription. Dev Cell. 2009, 16: 47-57.View ArticlePubMedGoogle Scholar
- Lomvardas S, Barnea G, Pisapia DJ, Mendelsohn M, Kirkland J, Axel R: Interchromosomal interactions and olfactory receptor choice. Cell. 2006, 126: 403-413.View ArticlePubMedGoogle Scholar
- Tuan DY, Solomon WB, London IM, Lee DP: An erythroid-specific, developmental-stage-independent enhancer far upstream of the human “beta-like globin” genes. Proc Natl Acad Sci U S A. 1989, 86: 2554-2558.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008, 133: 1106-1117.View ArticlePubMedGoogle Scholar
- Fullwood MJ, Liu MH, Pan YF, Liu J, Xu H, Mohamed YB, Orlov YL, Velkov S, Ho A, Mei PH: An oestrogen-receptor-alpha-bound human chromatin interactome. Nature. 2009, 462: 58-64.PubMed CentralView ArticlePubMedGoogle Scholar
- Yu M, Riva L, Xie H, Schindler Y, Moran TB, Cheng Y, Yu D, Hardison R, Weiss MJ, Orkin SH: Insights into GATA-1-mediated gene activation versus repression via genome-wide chromatin occupancy analysis. Mol Cell. 2009, 36: 682-695.PubMed CentralView ArticlePubMedGoogle Scholar
- Heintzman ND, Hon GC, Hawkins RD, Kheradpour P, Stark A, Harp LF, Ye Z, Lee LK, Stuart RK, Ching CW: Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature. 2009, 459: 108-112.PubMed CentralView ArticlePubMedGoogle Scholar
- Visel A, Blow MJ, Li Z, Zhang T, Akiyama JA, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F: ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature. 2009, 457: 854-858.PubMed CentralView ArticlePubMedGoogle Scholar
- Visel A, Rubin EM, Pennacchio LA: Genomic views of distant-acting enhancers. Nature. 2009, 461: 199-205.PubMed CentralView ArticlePubMedGoogle Scholar
- Wingender E, Dietze P, Karas H, Knuppel R: TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res. 1996, 24: 238-241.PubMed CentralView ArticlePubMedGoogle Scholar
- Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A: JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 2010, 38: D105-D110.PubMed CentralView ArticlePubMedGoogle Scholar
- Aerts S, van Helden J, Sand O, Hassan BA: Fine-tuning enhancer models to predict transcriptional targets across multiple genomes. PLoS One. 2007, 2: e1115-PubMed CentralView ArticlePubMedGoogle Scholar
- Berman BP, Pfeiffer BD, Laverty TR, Salzberg SL, Rubin GM, Eisen MB, Celniker SE: Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol. 2004, 5: R61-PubMed CentralView ArticlePubMedGoogle Scholar
- Hallikas O, Palin K, Sinjushina N, Rautiainen R, Partanen J, Ukkonen E, Taipale J: Genome-wide prediction of mammalian enhancers based on analysis of transcription-factor binding affinity. Cell. 2006, 124: 47-59.View ArticlePubMedGoogle Scholar
- Palin K, Taipale J, Ukkonen E: Locating potential enhancer elements by comparative genomics using the EEL software. Nat Protoc. 2006, 1: 368-374.View ArticlePubMedGoogle Scholar
- Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, Minovitsky S, Dubchak I, Holt A, Lewis KD: In vivo enhancer analysis of human conserved non-coding sequences. Nature. 2006, 444: 499-502.View ArticlePubMedGoogle Scholar
- Visel A, Prabhakar S, Akiyama JA, Shoukry M, Lewis KD, Holt A, Plajzer-Frick I, Afzal V, Rubin EM, Pennacchio LA: Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat Genet. 2008, 40: 158-160.PubMed CentralView ArticlePubMedGoogle Scholar
- Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, Barrera LO, Van Calcar S, Qu C, Ching KA: Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet. 2007, 39: 311-318.View ArticlePubMedGoogle Scholar
- De Santa F, Barozzi I, Mietton F, Ghisletti S, Polletti S, Tusi BK, Muller H, Ragoussis J, Wei CL, Natoli G: A large fraction of extragenic RNA pol II transcription sites overlap enhancers. PLoS Biol. 2010, 8: e1000384-PubMed CentralView ArticlePubMedGoogle Scholar
- Kim TK, Hemberg M, Gray JM, Costa AM, Bear DM, Wu J, Harmin DA, Laptewicz M, Barbara-Haley K, Kuersten S: Widespread transcription at neuronal activity-regulated enhancers. Nature. 2010, 465: 182-187.PubMed CentralView ArticlePubMedGoogle Scholar
- Kagey MH, Newman JJ, Bilodeau S, Zhan Y, Orlando DA, van Berkum NL, Ebmeier CC, Goossens J, Rahl PB, Levine SS: Mediator and cohesin connect gene expression and chromatin architecture. Nature. 2010, 467: 430-435.PubMed CentralView ArticlePubMedGoogle Scholar
- Blow MJ, McCulley DJ, Li Z, Zhang T, Akiyama JA, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F: ChIP-Seq identification of weakly conserved heart enhancers. Nat Genet. 2010, 42: 806-810.PubMed CentralView ArticlePubMedGoogle Scholar
- He A, Kong SW, Ma Q, Pu WT: Co-occupancy by multiple cardiac transcription factors identifies transcriptional enhancers active in heart. Proc Natl Acad Sci U S A. 2011, 108: 5632-5637.PubMed CentralView ArticlePubMedGoogle Scholar
- Friedman J, Hastie T, Tibshirani R: Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw. 2010, 33: 1-22.PubMed CentralView ArticlePubMedGoogle Scholar
- Tibshirani R: Regression shrinkage and selection via the Lasso. J Roy Stat Soc B Met. 1996, 58: 267-288.Google Scholar
- Frith MC, Fu Y, Yu L, Chen JF, Hansen U, Weng Z: Detection of functional DNA motifs via statistical over-representation. Nucleic Acids Res. 2004, 32: 1372-1381.PubMed CentralView ArticlePubMedGoogle Scholar
- Goren A, Ozsolak F, Shoresh N, Ku M, Adli M, Hart C, Gymrek M, Zuk O, Regev A, Milos PM, Bernstein BE: Chromatin profiling by directly sequencing small quantities of immunoprecipitated DNA. Nat Methods. 2010, 7: 47-49.PubMed CentralView ArticlePubMedGoogle Scholar
- Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP: Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007, 448: 553-560.PubMed CentralView ArticlePubMedGoogle Scholar
- Rahl PB, Lin CY, Seila AC, Flynn RA, McCuine S, Burge CB, Sharp PA, Young RA: c-Myc regulates transcriptional pause release. Cell. 2010, 141: 432-445.PubMed CentralView ArticlePubMedGoogle Scholar
- Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005, 15: 1034-1050.PubMed CentralView ArticlePubMedGoogle Scholar
- Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol MJ, Gnirke A, Nusbaum C: Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol. 2010, 28: 503-510.PubMed CentralView ArticlePubMedGoogle Scholar
- Miller W, Rosenbloom K, Hardison RC, Hou M, Taylor J, Raney B, Burhans R, King DC, Baertsch R, Blankenberg D: 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res. 2007, 17: 1797-1808.PubMed CentralView ArticlePubMedGoogle Scholar
- Gardiner-Garden M, Frommer M: CpG islands in vertebrate genomes. J Mol Biol. 1987, 196: 261-282.View ArticlePubMedGoogle Scholar
- Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, Goldman M, Barber GP, Clawson H, Coelho A: The UCSC Genome Browser database: update 2011. Nucleic Acids Res. 2010, 39: D876-882.PubMed CentralView ArticlePubMedGoogle Scholar
- Won KJ, Ren B, Wang W: Genome-wide prediction of transcription factor binding sites using an integrated model. Genome Biol. 2010, 11: R7-PubMed CentralView ArticlePubMedGoogle Scholar
- Schnetz MP, Handoko L, Akhtar-Zaidi B, Bartels CF, Pereira CF, Fisher AG, Adams DJ, Flicek P, Crawford GE, Laframboise T: CHD7 targets active gene enhancer elements to modulate ES cell-specific gene expression. PLoS Genet. 2010, 6: e1001023-PubMed CentralView ArticlePubMedGoogle Scholar
- Kim TK, Hemberg M, Gray JM, Costa AM, Bear DM, Wu J, Harmin DA, Laptewicz M, Barbara-Haley K, Kuersten S: Widespread transcription at neuronal activity-regulated enhancers. Nature. 2010, 465: 182-187.PubMed CentralView ArticlePubMedGoogle Scholar
- Miles J, Mitchell JA, Chakalova L, Goyenechea B, Osborne CS, O’Neill L, Tanimoto K, Engel JD, Fraser P: Intergenic transcription, cell-cycle and the developmentally regulated epigenetic profile of the human beta-globin locus. PLoS One. 2007, 2: e630-PubMed CentralView ArticlePubMedGoogle Scholar
- Gribnau J, Diderich K, Pruzina S, Calzolari R, Fraser P: Intergenic transcription and developmental remodeling of chromatin subdomains in the human beta-globin locus. Mol Cell. 2000, 5: 377-386.View ArticlePubMedGoogle Scholar
- Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, Huarte M, Zuk O, Carey BW, Cassady JP: Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature. 2009, 458: 223-227.PubMed CentralView ArticlePubMedGoogle Scholar
- da Huang W, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37: 1-13.View ArticlePubMedGoogle Scholar
- da Huang W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4: 44-57.View ArticlePubMedGoogle Scholar
- Merico D, Isserlin R, Stueker O, Emili A, Bader GD: Enrichment map: a network-based method for gene-set enrichment visualization and interpretation. PLoS One. 2010, 5: e13984-PubMed CentralView ArticlePubMedGoogle Scholar
- Inoue M, Kamachi Y, Matsunami H, Imada K, Uchikawa M, Kondoh H: PAX6 and SOX2-dependent regulation of the Sox2 enhancer N-3 involved in embryonic visual system development. Genes Cells. 2007, 12: 1049-1061.View ArticlePubMedGoogle Scholar
- Takemoto T, Uchikawa M, Yoshida M, Bell DM, Lovell-Badge R, Papaioannou VE, Kondoh H: Tbx6-dependent Sox2 regulation determines neural or mesodermal fate in axial stem cells. Nature. 2011, 470: 394-398.PubMed CentralView ArticlePubMedGoogle Scholar
- Catena R, Tiveron C, Ronchi A, Porta S, Ferri A, Tatangelo L, Cavallaro M, Favaro R, Ottolenghi S, Reinbold R: Conserved POU binding DNA sites in the Sox2 upstream enhancer regulate gene expression in embryonic and neural stem cells. J Biol Chem. 2004, 279: 41846-41857.View ArticlePubMedGoogle Scholar
- Sikorska M, Sandhu JK, Deb-Rinker P, Jezierski A, Leblanc J, Charlebois C, Ribecco-Lutkiewicz M, Bani-Yaghoub M, Walker PR: Epigenetic modifications of SOX2 enhancers, SRR1 and SRR2, correlate with in vitro neural differentiation. J Neurosci Res. 2008, 86: 1680-1693.View ArticlePubMedGoogle Scholar
- Uchikawa M, Ishida Y, Takemoto T, Kamachi Y, Kondoh H: Functional analysis of chicken Sox2 enhancers highlights an array of diverse regulatory elements that are conserved in mammals. Dev Cell. 2003, 4: 509-519.View ArticlePubMedGoogle Scholar
- Tomioka M, Nishimoto M, Miyagi S, Katayanagi T, Fukui N, Niwa H, Muramatsu M, Okuda A: Identification of Sox-2 regulatory region which is under the control of Oct-3/4-Sox-2 complex. Nucleic Acids Res. 2002, 30: 3202-3213.PubMed CentralView ArticlePubMedGoogle Scholar
- Lichner Z, Pall E, Kerekes A, Pallinger E, Maraghechi P, Bosze Z, Gocza E: The miR-290-295 cluster promotes pluripotency maintenance by regulating cell cycle phase distribution in mouse embryonic stem cells. Differentiation. 2011, 81: 11-24.View ArticlePubMedGoogle Scholar
- Zovoilis A, Smorag L, Pantazi A, Engel W: Members of the miR-290 cluster modulate in vitro differentiation of mouse embryonic stem cells. Differentiation. 2009, 78: 69-78.View ArticlePubMedGoogle Scholar
- Tahiliani M, Koh KP, Shen Y, Pastor WA, Bandukwala H, Brudno Y, Agarwal S, Iyer LM, Liu DR, Aravind L, Rao A: Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1. Science. 2009, 324: 930-935.PubMed CentralView ArticlePubMedGoogle Scholar
- Ficz G, Branco MR, Seisenberger S, Santos F, Krueger F, Hore TA, Marques CJ, Andrews S, Reik W: Dynamic regulation of 5-hydroxymethylcytosine in mouse ES cells and during differentiation. Nature. 2011, 473: 398-402.View ArticlePubMedGoogle Scholar
- Koh KP, Yabuuchi A, Rao S, Huang Y, Cunniff K, Nardone J, Laiho A, Tahiliani M, Sommer CA, Mostoslavsky G: Tet1 and Tet2 regulate 5-hydroxymethylcytosine production and cell lineage specification in mouse embryonic stem cells. Cell Stem Cell. 2011, 8: 200-213.PubMed CentralView ArticlePubMedGoogle Scholar
- Lim LS, Loh YH, Zhang W, Li Y, Chen X, Wang Y, Bakre M, Ng HH, Stanton LW: Zic3 is required for maintenance of pluripotency in embryonic stem cells. Mol Biol Cell. 2007, 18: 1348-1358.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu J, Watts JA, Pope SD, Gadue P, Kamps M, Plath K, Zaret KS, Smale ST: Transcriptional competence and the active marking of tissue-specific enhancers by defined transcription factors in embryonic and induced pluripotent stem cells. Genes Dev. 2009, 23: 2824-2838.PubMed CentralView ArticlePubMedGoogle Scholar
- Marin M, Karis A, Visser P, Grosveld F, Philipsen S: Transcription factor Sp1 is essential for early embryonic development but dispensable for cell growth and differentiation. Cell. 1997, 89: 619-628.View ArticlePubMedGoogle Scholar
- Lim LS, Hong FH, Kunarso G, Stanton LW: The pluripotency regulator Zic3 is a direct activator of the Nanog promoter in ESCs. Stem Cells. 2010, 28: 1961-1969.View ArticlePubMedGoogle Scholar
- Leung L, Kwong M, Hou S, Lee C, Chan JY: Deficiency of the Nrf1 and Nrf2 transcription factors results in early embryonic lethality and severe oxidative stress. J Biol Chem. 2003, 278: 48021-48029.View ArticlePubMedGoogle Scholar
- Melin F, Miranda M, Montreau N, DePamphilis ML, Blangy D: Transcription enhancer factor-1 (TEF-1) DNA binding sites can specifically enhance gene expression at the beginning of mouse development. EMBO J. 1993, 12: 4657-4666.PubMed CentralPubMedGoogle Scholar
- Delacroix L, Moutier E, Altobelli G, Legras S, Poch O, Choukrallah MA, Bertin I, Jost B, Davidson I: Cell-specific interaction of retinoic acid receptors with target genes in mouse embryonic fibroblasts and embryonic stem cells. Mol Cell Biol. 30: 231-244.
- Bergsland M, Werme M, Malewicz M, Perlmann T, Muhr J: The establishment of neuronal properties is controlled by Sox4 and Sox11. Genes Dev. 2006, 20: 3475-3486.PubMed CentralView ArticlePubMedGoogle Scholar
- Aruga J, Minowa O, Yaginuma H, Kuno J, Nagai T, Noda T, Mikoshiba K: Mouse Zic1 is involved in cerebellar development. J Neurosci. 1998, 18: 284-293.PubMedGoogle Scholar
- Aruga J: The role of Zic genes in neural development. Mol Cell Neurosci. 2004, 26: 205-221.View ArticlePubMedGoogle Scholar
- Bensinger SJ, Tontonoz P: A Nurr1 pathway for neuroprotection. Cell. 2009, 137: 26-28.View ArticlePubMedGoogle Scholar
- Cui K, Zang C, Roh TY, Schones DE, Childs RW, Peng W, Zhao K: Chromatin signatures in multipotent human hematopoietic stem cells indicate the fate of bivalent genes during differentiation. Cell Stem Cell. 2009, 4: 80-93.PubMed CentralView ArticlePubMedGoogle Scholar
- Rada-Iglesias A, Bajpai R, Swigut T, Brugmann SA, Flynn RA, Wysocka J: A unique chromatin signature uncovers early developmental enhancers in humans. Nature. 2011, 470: 279-283.PubMed CentralView ArticlePubMedGoogle Scholar
- Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ, Hanna J, Lodato MA, Frampton GM, Sharp PA: From the Cover: Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc Natl Acad Sci U S A. 2010, 107: 21931-21936.PubMed CentralView ArticlePubMedGoogle Scholar
- Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X: Diversity and complexity in DNA recognition by transcription factors. Science. 2009, 324: 1720-1723.PubMed CentralView ArticlePubMedGoogle Scholar
- Zentner GE: Tesar PJ. 2011, Epigenetic signatures distinguish multiple classes of enhancers with distinct cellular functions. Genome Res, Scacheri PCGoogle Scholar
- Pasini D, Malatesta M, Jung HR, Walfridsson J, Willer A, Olsson L, Skotte J, Wutz A, Porse B, Jensen ON, Helin K: Characterization of an antagonistic switch between histone H3 lysine 27 methylation and acetylation in the transcriptional regulation of Polycomb group target genes. Nucleic Acids Res. 2010, 38: 4958-4969.PubMed CentralView ArticlePubMedGoogle Scholar
- Conaway RC, Conaway JW: Function and regulation of the Mediator complex. Curr Opin Genet Dev. 2011, 21: 225-230.PubMed CentralView ArticlePubMedGoogle Scholar
- Boube M, Joulia L, Cribbs DL, Bourbon HM: Evidence for a mediator of RNA polymerase II transcriptional regulation conserved from yeast to man. Cell. 2002, 110: 143-151.View ArticlePubMedGoogle Scholar
- Knuesel MT, Meyer KD, Bernecky C, Taatjes DJ: The human CDK8 subcomplex is a molecular switch that controls Mediator coactivator function. Genes Dev. 2009, 23: 439-451.PubMed CentralView ArticlePubMedGoogle Scholar
- Tutter AV, Kowalski MP, Baltus GA, Iourgenko V, Labow M, Li E, Kadam S: Role for Med12 in regulation of Nanog and Nanog target genes. J Biol Chem. 2009, 284: 3709-3718.View ArticlePubMedGoogle Scholar
- Rollins RA, Korom M, Aulner N, Martens A, Dorsett D: Drosophila nipped-B protein supports sister chromatid cohesion and opposes the stromalin/Scc3 cohesion factor to facilitate long-range activation of the cut gene. Mol Cell Biol. 2004, 24: 3100-3111.PubMed CentralView ArticlePubMedGoogle Scholar
- Rollins RA, Morcillo P, Dorsett D: Nipped-B, a Drosophila homologue of chromosomal adherins, participates in activation by remote enhancers in the cut and Ultrabithorax genes. Genetics. 1999, 152: 577-593.PubMed CentralPubMedGoogle Scholar
- Firpi HA, Ucar D, Tan K: Discover regulatory DNA elements using chromatin signatures and artificial neural network. Bioinformatics. 2010, 26: 1579-1586.PubMed CentralView ArticlePubMedGoogle Scholar
- Ernst J, Plasterer HL, Simon I, Bar-Joseph Z: Integrating multiple evidence sources to predict transcription factor binding in the human genome. Genome Res. 2010, 20: 526-536.PubMed CentralView ArticlePubMedGoogle Scholar
- Narlikar L, Sakabe NJ, Blanski AA, Arimura FE, Westlund JM, Nobrega MA, Ovcharenko I: Genome-wide discovery of human heart enhancers. Genome Res. 2010, 20: 381-392.PubMed CentralView ArticlePubMedGoogle Scholar
- Ernst J, Kellis M: Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat Biotechnol. 2010, 28: 817-825.PubMed CentralView ArticlePubMedGoogle Scholar
- Hon G, Ren B, Wang W: ChromaSig: a probabilistic approach to finding common chromatin signatures in the human genome. PLoS Comput Biol. 2008, 4: e1000201-PubMed CentralView ArticlePubMedGoogle Scholar
- Pekowska A, Benoukraf T, Zacarias-Cabeza J, Belhocine M, Koch F, Holota H, Imbert J, Andrau JC, Ferrier P, Spicuglia S: H3K4 tri-methylation provides an epigenetic signature of active enhancers. EMBO J. 2011, 30: 4198-4210.PubMed CentralView ArticlePubMedGoogle Scholar
- Pekowska A, Benoukraf T, Ferrier P, Spicuglia S: A unique H3K4me2 profile marks tissue-specific gene regulation. Genome Res. 2010, 20: 1493-1502.PubMed CentralView ArticlePubMedGoogle Scholar
- Ong CT, Corces VG: Enhancer function: new insights into the regulation of tissue-specific gene expression. Nat Rev Genet. 2011, 12: 283-293.PubMed CentralView ArticlePubMedGoogle Scholar
- Carter D, Chakalova L, Osborne CS, Dai YF, Fraser P: Long-range chromatin regulatory interactions in vivo. Nat Genet. 2002, 32: 623-626.View ArticlePubMedGoogle Scholar
- Tolhuis B, Palstra RJ, Splinter E, Grosveld F, de Laat W: Looping and interaction between hypersensitive sites in the active beta-globin locus. Mol Cell. 2002, 10: 1453-1465.View ArticlePubMedGoogle Scholar
- Pesce M, Marin Gomez M, Philipsen S, Scholer HR: Binding of Sp1 and Sp3 transcription factors to the Oct-4 gene promoter. Cell Mol Biol (Noisy-le-grand). 1999, 45: 709-716.Google Scholar
- Wu DY, Yao Z: Functional analysis of two Sp1/Sp3 binding sites in murine Nanog gene promoter. Cell Res. 2006, 16: 319-322.View ArticlePubMedGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R80-PubMed CentralView ArticlePubMedGoogle Scholar
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM: NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Res. 2010, 39: D1005-D1010.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R25-PubMed CentralView ArticlePubMedGoogle Scholar
- Jothi R, Cuddapah S, Barski A, Cui K, Zhao K: Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 2008, 36: 5221-5231.PubMed CentralView ArticlePubMedGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5: 621-628.View ArticlePubMedGoogle Scholar
- Ouyang Z, Zhou Q, Wong WH: ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc Natl Acad Sci U S A. 2009, 106: 21521-21526.PubMed CentralView ArticlePubMedGoogle Scholar
- Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008, 5: 613-619.View ArticlePubMedGoogle Scholar
- Kohl M, Wiese S, Warscheid B: Cytoscape: software for visualization and analysis of biological networks. Methods Mol Biol. 2011, 696: 291-303.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.