In order to test for congruence between enhancers defined by reporter gene assays, and those defined by chromatin-based assays, we compared Drosophila enhancers obtained from two sources: REDfly [19] and EnhancerAtlas2.0 [20]. The REDfly database contains over 38,000 empirically-defined enhancers manually curated from the literature, split roughly evenly between in vivo reporter gene assays and cell-culture based reporter gene assays. The EnhancerAtlas2.0 database, by contrast, is composed of 294,158 enhancers predicted by means of an unsupervised learning approach that combines data from ChIP-seq, ATAC-seq, FAIRE-seq, and other chromatin-level assays [20]. This comparison is uniquely possible for Drosophila: as the overwhelming majority of REDfly’s enhancers are based on evidence developed without consideration of chromatin-level features, they are therefore identified fully independently from those in EnhancerAtlas, using different evidence.
For an initial comparison, we took all REDfly enhancers 2 kb or shorter (11,549 total) and determined how many of these sequences overlapped an EnhancerAtlas enhancer, without regard for the annotated tissue-specificity of the REDfly enhancers (Fig. 2A; see Methods). Of the 21 tissue-specific Drosophila EnhancerAtlas datasets, 17 had statistically significant overlap (adjusted P < 0.01, two-tailed z test), three had no significant overlap, and, surprisingly, one had significantly less overlap than expected by chance (Fig. 3A “STARR-seq included”, Table S1a). Examination of these data revealed that more than three-quarters of the REDfly enhancers we were using (8779/11549, 76%) were identified using a single method, STARR-seq [21], a cell-culture based, episomal reporter gene assay; these STARR-seq enhancers made up 80% of the REDfly-EnhancerAtlas overlapping enhancers (6027/7527). When we eliminated the STARR-seq enhancers from the REDfly test set and repeated the analysis, the results were strikingly different: only five of the EnhancerAtlas datasets (24%) now showed significant overlap (with one additional dataset just below our significance threshold), whereas ten datasets (48%) had significantly less overlap than expected by chance (Fig. 3A “STARR-seq not included”, Table S1b [Additional File 1]).
STARR-seq-defined enhancers thus have a profound effect on how well sequences from the two databases compare. Only four EnhancerAtlas datasets make use of STARR-seq data, indicating that a simple confounding of the source data cannot explain the results. In the absence of STARR-seq enhancers, almost half of the REDfly datasets had less-than-expected overlap with EnhancerAtlas, suggesting that the majority of REDfly enhancers lack activity in most of the tissues covered by EnhancerAtlas. Conversely, these results suggest that STARR-seq, performed in a cultured cell line and using an episomal rather than an integrated reporter, may be identifying many sequences indiscriminately with respect to their tissue specific activity.
To examine more directly how REDfly enhancers active in specific tissues compare with the EnhancerAtlas datasets, we constructed tissue-specific REDfly enhancer sets using only in-vivo reporter gene tested enhancers < 1000 bp in length (Fig. 2B; Table S2 [Additional File 2]). Of the 21 EnhancerAtlas sets, 11 had one or more corresponding REDfly sets. We then determined how many enhancers from each paired set overlapped. Our expectation was that there should be significant overlap, with the EnhancerAtlas sets (based on undirected genome-wide assays) encompassing the great majority of REDfly enhancers (drawn from individual ad hoc experiments)(Fig. 2C). Surprisingly, only four (36%) of the eleven EnhancerAtlas sets showed significant overlap with their corresponding REDfly set (P < 0.01, Fig. 3A “all tissue-specific datasets”, Table S1c [Additional File 1]). Moreover, even for the sets with significant overlap, the number of in-common enhancers was strikingly limited (median 26%, range 5%-65%; Fig. 3B, Table S1c [Additional File 1]).
Since the EnhancerAtlas definitions integrate the data from multiple assays, we reasoned that the integration algorithm might be filtering out some of the true enhancers. To test this, we took the underlying data sets used by EnhancerAtlas (Table S1e [Additional File 1], referred to as EnhancerAtlas “subsets”; Fig. 2B) and compared them individually to the matched REDfly enhancer sets. Indeed, we saw a higher number of significantly overlapping enhancer sets (66%, Fig. 3A, Table S1d,e [Additional File 1]), but again, the number of in-common enhancers within each matched set was low (median 39%, Fig. 3B, Table S1d,e [Additional File 1]).
The low degree of in-common enhancers could represent a small number of REDfly enhancers that consistently match EnhancerAtlas enhancers, or different subsets of REDfly enhancers for each EnhancerAtlas subset. To distinguish between these possibilities, we looked at the correlation between the sets of REDfly enhancers present in individual EnhancerAtlas subsets. The sets of enhancers found in different experiments were not overall well-correlated, suggesting that distinct REDfly enhancer sets are being identified (Fig. 4, Fig. S1 [Additional File 3]). However, a clear correlation structure was evident in which assays from a particular laboratory and method tended to cluster together. For example, the REDfly enhancers from one set of experiments (series GSE102839) are highly correlated (Fig. 4, white box; mean r = 0.74 ± 0.11), and those from a different set of experiments (series GSE101827) are highly correlated (Fig. 4, yellow box; mean r = 0.67 ± 0.18). However, even though both sets contain ATAC-seq experiments, they find a disjoint group of REDfly enhancers (mean r = 0.12 ± 0.08). Although assay-specific correlation is occasionally observed (Fig. 4, yellow asterisks; r = 0.77), in other cases similar assays are only weakly correlated (e.g., Fig. 4, white asterisks; r = 0.45). EnhancerAtlas data based on ATAC-seq assays had a clear bias toward significant REDfly overlap (58/65 subsets significant; Table S1d [Additional File 1]), as did other open-chromatin assays, while ChIP-based methods tended to be more evenly distributed between significant and non-significant overlap (Table S1d [Additional File 1]). However, the fact that batch effects appear to dominate the correlation structure makes it difficult to draw conclusions as to the most effective enhancer identification methods. For instance, we note that over two-thirds of the ATAC-seq results are from a single experiment series, GSE101287, and the number of subsets with significant REDfly overlap for the ATAC-seq experiments in series GSE102441 and series GSE102839 was a more modest 60% and 50%, respectively.
These observations suggest two important conclusions: (1) identification of putative enhancers is highly dependent on not just the type of assay performed, but on the precise conditions under which it is performed; and (2) enhancer identification is reasonably robust given a particular set of assays and replicates. The fact that the sets of REDfly enhancers are stable within groups of replicate assays suggests that these enhancers are being specifically (i.e., non-randomly) found, despite the fact that, as shown above, the number of REDfly enhancers identified through chromatin-level assays is frequently indistinguishable from random expectation. Thus, these assays do appear to be able to identify reporter-gene-defined enhancers, but with low efficiency and potentially high false-positive rates.
Recently, Gao et al. released “scEnhancer”, an EnhancerAtlas-like database based on single-cell ATAC-seq data drawn from different embryonic time points [22]. Unlike the results using EnhancerAtlas, the majority of scEnhancer datasets had a significant degree of overlap with their REDfly set (45/56, 80%; Fig. 1A, Table S1f [Additional File 1]), although similar to what we observed with EnhancerAtlas, the number of overlapping enhancers was low (median 27%, range 10%-65%; Table S1f [Additional File 1]). scATAC-seq may thus represent a more promising method for enhancer detection, although a proper assessment is difficult as all of the Drosophila data currently in scEnhancer are from a single source.