Genomic distribution of chromatin accessibility, pol II and transcription
Chromatin accessibility, transcription and the distribution of RNA Pol II using ChIP-seq were used for the identification of candidate developmental enhancers in 20 h sea urchin embryos (Fig. 1A), and the best performing predictors, ATAC- and PRO-seq, see later, were also characterized in 12 h embryos. The genomic profile of 3′ end transcripts identified by PRO-seq was analyzed with dREG (Fig. 1A), a support vector regression tool trained to identify TREs associated with active chromatin marks using the shape of transcription [15]. In 12 and 20 h embryos, dREG identified 43,912 and 56,753 TRE predictions or “peaks”, respectively, while a total of 238,838 and 258,515 ATAC-seq peaks, respectively, were called by MACS (QC reports in supplementary information). In 20 h embryos, 554,846 Pol II ChIP-seq peaks were called.
The dynamic range of the read distribution at peak calls expands several orders of magnitude for the three functional genomic data types (Fig. 2B and D). The Pearson correlation of total PRO-seq reads at dREG peak calls of biological replicates is higher for 12 h embryos (R = 0.88, p-value < 2.2 × 10− 16, Spearman = 0.71, Fig. 2A) than for 20 h embryos (R = 0.35, p-value < 2.2 × 10− 16, Spearman = 0.75, Fig. 2C). Similarly, higher correlation for ATAC-seq profiles at MACS peak calls is found for 12 h embryos (R = 0.91, p-value < 2.2 × 10− 16, Spearman = 0.79, Fig. 2A) relative to 20 h embryos (R = 0.62, p-value < 2.2 × 10− 16, Spearman = 0.46, Fig. 2C). The low correlation of PRO-seq biological replicates may be due in part to inherent batch heterogeneity associated with seasonal and genetic variability in the wild populations from which the embryos where obtained. This natural variation may shift the relative timing of major transcriptional regulatory changes during and prior to the 20 h embryo stage, as previously reported [36, 37], along with the associated histone modification and chromatin accessibility signals (Fig. 2).
Interestingly, there is less variability among Pol II ChIP-seq biological replicates (R = 0.96 to 0.94, with p-values < 2.2 × 10− 16, Fig. 2C). Nevertheless, the biological or technical source of the variation among the different marks could not be resolved in this study, because different embryo batches, sometimes from different seasons, were used. In addition, for the 20 h ATAC-seq data, a distinct nuclear extraction protocol for one of the replicates [25] may have contributed to some technical variability. Given these caveats no quantitative genome-wide comparison of the different genomic profiles between stages is performed, but only primarily qualitative comparisons among stages based in peak calls (Fig. 2F and Fig. S1 F). However, despite the higher variability of the PRO-seq and ATAC-seq 20 h data sets, generally similar signal profiles are seen among biological replicates in both developmental stages, as illustrated at the H2A.Z locus [25] (Fig. 1B). Similar reproducibility trends are observed at promoters and CRMs (Fig. S1 A-D), with much higher replicate correlations for the subset of CRMs that are the primary target of this study (CRMs hereafter) (Fig. S1 B and D).
Genome-wide, most dREG peaks overlap Pol II peaks (Fig. 2E), as expected. However, because much of Pol II is found in the body of transcribed genes, the majority of pol II peak calls did not overlap with dREG predictions. About 40% of ATAC-seq peaks do not overlap Pol II peaks, and about 90% of ATAC-seq peaks do not overlap dREG predictions, revealing that a substantial fraction of chromatin-accessible regions do not associate with RNA Pol II or transcription initiation detected using dREG, which depends on local transcription initiation profiles. Similar overlapping trends are observed in the CRMs target of this study, with a much larger fraction of ATAC peaks overlapping Pol II peaks, dREG predictions and both (Fig. S1 E), possibly in association with a transcriptional regulatory enrichment in the CRM data set, which is strongly biased for evolutionary sequence conservation [33].
The distinct peak numbers and particular overlaps among the three genomic assays anticipate distinct contributions and/or the requirement of combinatorial analysis for the prediction of distal TREs. Globally, about 42 and 50% of the 12 and 20 h dREG peaks are stage specific, respectively (Fig. 2F), while 36 and 38% of the 12 and 20 h ATAC peaks are stage specific, respectively (Fig. 2F) . However, for peaks overlapping CRMs (Fig. S1 F), the majority of ATAC and dREG peaks present in the 12 h stage remain in the 20 h stage, but a much larger proportion of dREG peaks than ATAC peaks are 20 h specific, 60% versus 30%. This reveals that during the 12 to 20 h transition there is a general increase of transcription initiation and pause at developmental TREs while accessibility, estimated by peak calls, is more constant. This suggests that increased accessibility of developmental enhancers generally precedes transcriptional output of developmental TREs.
Validation and evaluation of functional genomic marks for the identification of developmental TREs
We manually contrasted our functional genomic data sets with previous experimental cis-regulatory analyses in order to explore how they could facilitate the identification of active developmental TREs. TRE necessity for the control of endomesoderm transcription factor SpHox11/13b developmental expression has been characterized by deletion from BAC reporters, and TRE sufficiency by plasmid reporter constructs testing an overlapping array of genomic regions that scan the entire locus [24]. ATAC-seq peaks underscore regulatory element ME in 12 and 20 h embryos but only dREG highlights ME in 20 h embryos (Fig. 3), which corresponds to the stage of higher ME reporter activity [24].
This pattern of 20 h specific dREG activity follows the general ATAC and dREG stage prevalence trend in the proximity of regulatory elements, a generally constant number of ATAC peaks and increased number of dREG peaks during the later blastula stage (Fig. S1 F). Module ME has been demonstrated to be both necessary by deletion in BAC reporters and sufficient in plasmid reporters [24] to drive the embryonic SpHox11/13b expression profile, which is spatially dynamic and increases during the 12 to 20 h transition [38]. Like many other early embryo TREs, ME has distinct enhancer and silencer functions in different embryonic territories. In 12 and 20 h embryos, module ME responds to spatially restricted vegetal wnt signaling by enhancing transcription in the endomesoderm and endoderm territories, and by silencing transcription in the ectoderm and the mesoderm, where the wnt pathway remains and becomes inactive, respectively [24]. Therefore, the whole embryo genomic profiles derive from both enhancer and silencer activities from different territories.
DNA binding sites for transcription factor TCF are required for the wnt signal dependent enhancer and silencer functions of element ME. There is an increase in the transcriptional pause at the 20 h embryo SpHox11/13b promoter relative to the 12 h stage (Fig. 3), which could correspond with its ME transcriptional silencing in the ectoderm and mesoderm [24]. Interestingly, the cofactor of TCF, groucho, implements silencing by pause in Drosophila embryos [39], which may represent an evolutionarily conserved function in sea urchins. Module D, in isolation, drives unrestricted reporter expression in 15 and 18 h embryos that can be dominantly silenced by module ME when placed in the same reporter construct [24]. Module D is inactive in 6 and 21 h embryos, leaving uncertain its activity in 12 h embryos. There are ATAC-seq peak calls with relatively low signal within Module D in both stages, but no dREG peaks (Fig. 3). Thus, module D lacks silencing functions and dREG peak calls.
Module D was not deleted in BACs and therefore its endogenous function remains uncertain [24]. Finally, element L, which drives reporter activity at later stages, was undetectable with reporter constructs in 15 and 24 h embryos [24], dREG marks regulatory element L in 20 h embryos but not in 12 h embryos, while ATAC detects this regulatory element in both stages. In this case, the dREG peak calls and associated pausing in module L of 20 h embryo may correspond with its priming for subsequent activation during later embryonic and larval stages.
Similar unbiased tiling array reporter scan was performed for the regulatory elements controlling the expression of transcription factor onecut [40]. During pregastrular stages, onecut zygotic transcript levels reach a minimum soon after the 12 h stage and peak at 20 h, around the time when its restricted oral-aboral boundary expression is stablished [40]. Analogous to SpHox11/13b module ME, distal regulatory module Intron-D of onecut integrates enhancer and silencer functions that are necessary and sufficient to recapitulate the expression of this transcription factor [40]. Likewise, ATAC-seq peak calls underscore Intron-D in both stages, but only 20 h dREG peak calls highlight Intron-D, coincident with an augment of pause at the onecut promoter (Fig. S2). Thus, both in SpHox11/13b ME and onecut Intron-D PRO-seq signals may correspond to a blend of transcriptional activation and silencing functions in different embryo regions. Only ATAC-seq peaks intersect onecut Intron-C module, which is inactive in 20 h embryos (Fig. S2). Thus, similar to module L of SpHox11/13b, the ATAC-seq peak call does not correspond with enhancer activity.
There are far more ATAC-seq peak calls than dREG predictions at both loci (Fig. 3 and Fig. S2), along with the general genomic trend (Fig. 2E). Both loci were scanned by a comprehensive reporter tiling scheme to test the entire regions for enhancer activity in an unbiased manner [24, 40]. Remarkably, dREG TRE predictions correspond closely with regulatory elements experimentally mapped to their minimum range (Fig. 3 and Fig. S2). However, most dREG predictions do not match CRM enhancer reporter activity (Fig. 3 and Fig. S2), and even a higher proportion of ATAC peaks do not correspond with enhancer-active CRMs. The distribution of Pol II ChIP peaks is even broader, particularly at introns, which are prone to contain TREs (Fig. 3 and Fig. S2), and, therefore, Pol II ChIP signal seems poorly suited for TRE predictions on its own. Manual analysis of other experimentally characterized transcriptional regulatory elements [41,42,43] generally confirms the trends outlined above (Fig. S2). Additional regions of the genome can be analyzed with the data sets deposited at the NCBI Gene Expression Omnibus [44], which include ATAC-, Pol II ChIP- and PRO-seq genomic profiles along with their associated peak calls. In summary, our manual analysis suggest that accessibility may have a looser correspondence with enhancer activity. In addition, although increased pause does not necessarily correspond with silencing, as it may be associated with increased release and elongation, the dual report of transcriptional elongation and pause by PRO-seq may correspond to enhancer and silencer activities differentially deployed in space, which is very common among early embryo regional specification TREs [4].
Prediction of enhancer activity from chromatin accessibility and pol II
We decided to systematically test if machine learning models using chromatin accessibility, Pol II distribution and transcription initiation could predict the previously quantified enhancer activity of 389 CRMs (Supplementary Table 1) primarily selected for their evolutionarily sequence conservation [33]. We first tested a subset of reporters quantified at high temporal resolution using nanoString technology [32], but the very small number of inactive reporters was unsuitable for model training (results not shown). We therefore settled for a previous data set that measured reporter enhancer activity by qPCR and included 12 and 24 h time points [33]. Although this generated a mismatch with the 20 h stage examined in our genomic data, no major regulatory transitions have been identified for the majority of the genes involved during this 4 h period [45]. The average size of the 389 CRMs (2839 bp) is about one order of magnitude larger than the average of ATAC peaks (316 bp) Pol II peaks (338 bp), and dREG TRE predictions (373 bp) (Fig. 4A). Thus, in order to reduce confounding inputs to the CRMS that may not relate to TRE function, such as background transcription at introns, or background accessibility along the CRM span, the ATAC-, Pol II ChIP- and PRO-seq signals were computed at CRM regions overlapping peak calls and dREG predictions. CRMs were defined as active if they drove reporter expression twice above the basal promoter (Fig. 4B and C). CRM activity cannot be explained by CRM size because there is no significant size difference between active and inactive enhancers (Fig. 4A, inset, Wilconox p-value = 0.11). However, active CRMs in 12 and 24 h embryos have significatively higher PRO-, ATAC-, and Pol II ChIP-seq signals (Fig. 4D, p-values < 1.8 e-06).
Logistic regression classifiers trained and tested by 5 fold cross-validation repeated 200 times resulted in predictions significatively above random guess in both embryo stages (Fig. 4E and F). The performance of models using ATAC, dREG (PRO-seq reads at TRE dREG predictions), their combination (ATAC + dREG), and Pol II ChIP was slightly higher for ATAC + dREG models when evaluated by the Area Under the Receiver Operating Characteristic (AUROC) plot (Fig. 4E and F), which graphs the relation between true and false positive rates at different model prediction thresholds. However, the CRM expression data set is highly unbalanced, with about 10 times more CRMs reporting inactive than active enhancer activity (Fig. 4B and C), and, in these cases, Precision-Recall Curves (PRCs), which plot precision values along the range of true positive rates, provide a better discrimination metric for classifier evaluation [46].
When AUPRCs are used for model evaluation, more distinct model performances are obtained, particularly for the 20 h data sets (Fig. 4F, bottom). Individually, all assays perform much better than chance in both stages (Fig. 4E and F). The combination of ATAC and dREG predictors barely improves performance at some recall values (Fig. 4E and F, bottom), and similarly, Pol II ChIP signal does not facilitate better enhancer activity predictions alone (Fig. 4F) or in combination with other data sets (results not shown). The limited model improvement with combined data sets is perhaps expected given their substantial correlation, with Pearson’s correlation coefficients ranging from 0.6 to 0.8 between predictors of the same stage. Incorporation of other parameters such as peak summit value, did not improve any predictive models, as illustrated for dREG (Fig. 4H). As expected, lower model performance was also obtained when the functional genomic data were computed along the entire CRM instead of restricting the signal input to peak call windows (not shown). Optimization of other machine learning methods, such as random forest and support vector machine, did not improve classifier performance over logistic regression (not shown), likely reflecting the small size of the available data. In short, total ATAC-seq and PRO-seq signals at dREG peaks are the best predictors of active enhancer activity among the profiles tested in this study.
About half of the CRMs that overlap promoters are active in the reporter assays, indicating a high degree of enhancer activity from promoter-adjacent DNA in sea urchin embryos [33]. Nearly all these promoter-overlapping CRMs were previously shown to be active in both orientations [33], demonstrating bona fide enhancer activity. We further confirmed that the concurrent and divergent orientation of promoter-overlapping CRMs in reporter constructs were evenly represented in our data set and did not correspond with significant enhancer activity differences (Fig. S3), excluding the relevance of confounding effects due to transcription initiation at the CRMs followed by elongation into the reporter. The sizes of promoter-overlapping CRMs (Fig. 4A) suffice to include both distal and proximal TREs, including promoters. Interestingly, ATAC and dREG models trained with the entire data set (Fig. 4F) underperformed relative to Pol II ChIP based models in the prediction of enhancer activity of the promoter-overlapping CRM subset (Fig. 4G, top). In the complementary analysis, the Pol II ChIP model trained with the entire set further underperformed relative to ATAC and dREG models in the prediction of CRMs not overlapping promoters, while ATAC and dREG maintained performance similar to predictions with the entire set (Fig. 4G, bottom). The exclusion of the 41 promoter-overlapping CRMs from the training and testing data set decreased the prediction performance of all models in both stages (Fig. S4 A and B). Overall performance was broadly similar between ATAC and dREG models trained and tested with the promoter overlapping or non-overlapping in 12 h embryos (Fig. S4 A and C). In contrast, ATAC and dREG models trained with CRM-overlapping promoters failed to predict the activity of their hold out set and were outperformed by Pol II ChIP models in the 20 h data set (Fig. S4 D). All of the above, reveals that the enhancer activity predictive power of ATAC-seq and PRO-seq for promoter-proximal CRMs dramatically devalues during the 12 to 20 h transition.
The larger proportion of positive enhancers among CRMs that overlap promoters relative to CRMs not overlapping promoters, ~ 50% vs. ~ 13% in 20 h embryos, is not surprising given the bias for regulatory genes active during development of this data set [33] combined with the general trend of enhancers to be near their promoter targets [47]. The enhancer activity of promoters has precedents [48, 49] and it is perhaps not surprising for evolutionary reasons [2].
We tested if the functional genomic datasets could predict the levels of reporter enhancer activity of CRMs. In all cases, better linear regression model prediction was obtained with non-promoter overlapping CRM sets. The best performing model included the ATAC-seq plus the ATAC:dREG interaction, which explained about one third of the expression variation (average R2 = 0.29) in 20 h embryos (Fig. 5). ATAC-seq was a better predictor of enhancer activity in 20 h embryos relative to 12 h embryos (R2 = 0.26 versus R2 = 0.17, p-value < 2.2 e-16), and outperformed dREG in 20 h embryos (R2 = 0.17, p-value = 7.4 e-16) (Fig. S5). The difference in dREG model performances between stages or with ATAC-seq models in 12 h embryos was not significant due to the small size of the dataset. Nevertheless, the relative enhancer predictive power of ATAC-seq and PRO-seq is stage-dependent. The predicted value of most CRMs generally varies with the training set, as expected. However, there is a group of CRMs that are consistently and erroneously predicted as barely active (Fig. S5) due to their low signals in all assays (not shown). Highly active reporter constructs with low predictor signals may result from not uncommon miss-regulation outside the endogenous genomic context [4], which may cause ectopic expression, such as the one observed for SpHox11/13b module D [24], or it may reflect mismatches between the time points, especially at the 20 h stage.