The 1,000 bp binding peaks for cooperative TFs in different datasets have extensive overlaps
After quality-control filtering of the 9,060 collected TF ChIP-seq datasets (Table S1) (Materials and Methods), we ended up with 8,884 datasets containing at least 20 binding peaks for 696 TFs in 435 cell line/tissue/organ types. As in the case in humans [50, 61], these datasets are high biased to a few well-studied cell/tissue types (Fig. 1A) and TFs (Fig. 1B). For example, 1,020, 504 and 545 datasets were collected from mouse embryonic stem cells, epithelial cells and macrophage in bone marrow, respectively, while only one dataset was generated from 68 cell/tissue types, including pancreas beta cell MIN6B1, superior cervical ganglion, hepatocellular carcinoma, and so on (Table S1). Moreover, 460 and 160 datasets were collected for TFs Ctcf and SpI1, respectively, while just one dataset was produced for 131 TFs, such as Tfcp2, Nelfb, Hoxd11, and so on (Table S1). The number of remaining binding peaks in a dataset vary widely, ranging from 20 to 110,347, with an average of 15,359 peaks (Fig. 1C). The length of the called binding peaks also vary widely, ranging from 21 to 11,047 bp with a mean of 315 bp, but the vast majority of them (98.7%) are shorter than 1,000 bp (Fig. 1D). We extracted 1,000 bp genomic sequences centered on the summits of the called binding peaks for motif-finding to identify motifs of both the ChIP-ed TFs and their cooperators in each dataset [50, 62]. Therefore, we extended the lengths of most (98.7%) of the originally called binding peak. We have previously shown that such extension (~ 1,000 bp) of called peaks does not affect finding the motifs of ChIP-ed TFs, which typically reside in the middle of the peaks, but allows to find motifs of cooperative TFs, which can reside anywhere along the extended peaks [50, 62].
In theory, the larger the number of TF ChIP-seq datasets available and used, and the less bias of the datasets to few TFs and cell/tissue types, the better predictions that dePCRM2 can achieve [50, 61]. To see whether such the highly biased datasets include enough datasets for cooperative TFs that are reused in different cell/tissue types, an assumption upon which dePCRM2 is designed for predicting CRMs and constituent TFBSs [50], we calculated an overlapping score \({S}_{o}\) (formula 1) for each pair of the 8,884 filtered datasets, and hierarchically clustered them. As show in Fig. 1E, there are numerous overlapping clusters among the datasets which are either for largely the same TFs that were ChIP-ed in different cell/tissue types, or for different known cooperative TFs that were ChIP-ed in the same and/or different cell/tissue types. For example, as seen in the human datasets [50], a cluster is formed by 50 datasets for cooperative TFs Ctcf, Rad21, Stag1, and Smc1A in various cell/tissue, reflecting the conserved cooperative relationships of the TFs in forming the cohesin complex [63]. Shown in Fig. 1F is another example of cluster formed by 88 datasets for 20 TFs in various cell/tissue types, many of these TFs are known or likely to collaborate with each other according their physical interactions documented in the BioGRID [64] and reactome [65] databases (Fig. 1G). Therefore, notwithstanding these datasets are highly biased to few TFs (Fig. 1A) and cell/tissue types (Fig. 1B), they include datasets of many cooperative TFs that are reused in various cell/tissue types. The 1,000 bp peaks in all the 8,884 datasets contain a total of 136,441,496,000 bp, which is 50.1 times the size of the mouse genome (version mm10/GRCm38), but cover only 2,178,603,271 bp (79.9%) of the mappable genome (2,725,521,370 bp). Compared with the originally called peaks that cover a total of 1,398,035,305 bp (51.3%) of the mappable genome, we substantially increased the coverage of the genome by extending the called peaks to 1,000 bp, the size of shorter enhancers [60]. dePCRM2 will predict which DNA segments in the 79.9% genome regions covered by the 1,000 bp peaks are CRM candidate (CRMCs), and which are non-CRMCs, based on cooccurring patterns of putative TFBSs of motifs found in the binding peaks in all the datasets.
Most of identified unique motifs (UMs) resemble known motifs and show intensive cooccurring pattens
dePCRM2 [50] starts by identifying all possible motifs in each dataset using ProSampler, an ultrafast motif finder [62]. ProSampler finds at least one motif in 8,294 (93.4%) of the 8,884 datasets, with a total of 1,062,339 motifs found. As shown in Fig. 2A, the number of motifs found in a dataset increases with the number of peaks in it, but becomes stabilized around 250 when the number of peaks is above 50,000. dePCRM2 next identifies co-occurring motifs pair (CPs) as potential motifs, thereby filtering out most spurious motifs. To do so, dePCRM2 computes a co-occurring score \({S}_{c}\) (formula 2) for each pair of motifs in each dataset and selects the pairs with high scores as CPs. As in the case of human genome [50], the \({S}_{c}\) scores show a trimodal distribution (Fig. 2B). dePCRM2 selects motifs pairs as PCs that account for the mode with the highest \({S}_{c}\) scores (\({S}_{c}\) > 0.7 by default). More specifically, dePCRM2 identifies 4,028,221 CPs containing 225,809 (21.3%) potential motifs from 7,076 (85.3%) of the 8,294 datasets, while filtering out the remaining 1,218 (15.7%) datasets where no CPs are kept, and 836,530 (78.7%) possible spurious motifs. Many motifs in different CPs can be sub-motifs of the same TF, or of different members of a TF family that recognize highly similar motifs [66, 67]. Therefore, dePCRM2 clusters the 225,809 motifs in the 4,028,221 CPs by constructing a graph whose nodes are the motifs and edges are the SPIC similarity score [68] between the motifs pairs, and then cutting the graph into dense subgraph as clusters of similar motifs. This results in 276 clusters, each containing from 28 to 49,308 motifs (Figure S1A). From these 276 motif clusters, dePCRM2 identifies 238 unique motifs (UMs) (Figure S1B). The UMs contain highly varying number of TFBSs, ranging from 72 to 14,025,382 with an average of 1,107,677 (Fig. 2C). The lengths of the UMs range from 10 to 20 bp with a mean of 10.3 bp, and are in the range of the lengths of known TF binding motifs (Fig. 2D). The bias of the lengths of UMs to 10 bp is due to the limitation of ProSampler that needs to be improved. As expected, the UMs and their member motifs are highly similar to one another. For example, the 11,799 member motifs of UM41 form a dense subgraph/cluster (Fig. 2E), and UM41 resembles its highly similar member motifs (Fig. 2E, F). To evaluate the UMs, we compared the 238 UMs against 875 annotated non-redundant motifs in the HOCOMOCO [69, 70] and JASPAR [71] databases using TOMTOM [72]. Of the 238 UMs, 146 (61.3%) match at least one annotated motif, and 113 (77.4%) of the 146 UMs match at least two (Table S2), suggesting that most of the UMs might represent the motifs of the same TF family/superfamily which bind highly similar motifs [66, 67]. For instance, UM41 matches known motifs of five TFs of the “Jun-related factors” family (Jund, Bach1, Bach2, Junb and Nfe2) (Fig. 2G), and five TFs of the “Fos-related factors” family (Atf3, Fosl2, Fosb, Fosl1 and Fos) (Table S2). On the other hand, the remaining 92 UMs might be novel motifs of unknown cognate TFs. We also evaluated the coverage of the UMs on motif families in the two databases [70, 71], and found that 82 (64.1%) of the 128 annotated TF motif families match one of the 238 UMs (Table S2), indicating that our predicted UMs recovery most of the known TF motif families.
To model cooccurring patterns of the UMs and interactions between their cognate TFs, dePCRM2 computes a cooccurrence/interaction score SINTER (formula 3) between each pair of UMs based on the co-occurrence of binding sites of UMs. As shown in Fig. 2H, there are extensive cooccurrences between the UMs and interactions of their cognate TFs. These patterns of cooccurrences of the UMs indeed reflect the interactions among their cognate TFs or TF families for transcriptional regulation. For example, in a cluster formed by 14 UMs (Fig. 2I), 10 of them (UM14, UM26, UM28, UM29, UM32, UM45, UM53, UM55, UM57 and UM116) match known motifs of TF families. More specifically, UM116 matches Msantd3, UM14 matches Ctcfl, UM26 matches Nfe2|Fosb|Atf3|Bach1|Pknox1|Jund|Nkx2-2|Jdp2|Fos|Junb|Fosl1|Fosl2|Batf|Msantd3|Bnc2|Mafk|Pbx3|Batf3|Jun, UM28 matches Zfp57|Atf3, UM29 matches Sp3|Mxi1|Nr1h4|Plagl1|Zfx|Klf3|Rfx1, and UM57 matches Nkx2-5|Fos|Fosb|Atf3|Pbx3|Junb|Jund|Pknox1|Fosl1|Batf3|Fosl2|Jun|Batf|Nkx2-2|Msantd3|Bnc2, etc. Some of these TFs are known collaborators in transcriptional regulation, such as Fos and Jun [73,74,75,76], Atf3 and Jun [77], Pbx3 and Pknox1 [78], Jun and Batf [78].
Prediction of CRMs and constituent TFBSs in the mouse genome
To predict CRMs and constituent TFBSs in the mouse genome, dePCRM2 projects the TFBSs of the UMs to the genome and links adjacent TFBSs if their distance is less than 300 bp (roughly, the length of two nucleosomes). dePCRM2 predicts each linked sequence as a CRM candidate (CRMC) and each sequence between two adjacent CRMCs in the peak-covered regions as a non-CRMC, thereby partitioning the peak-covered genome regions in two exclusive sets, CRMCs and non-CRMCs. Concretely, dePCTM2 predicts a total of 912,197 CRMCs and 1,270,937 non-CRMCs in the peak-covered genome regions, consisting of 55.5% and 24.4% of the genome, respectively. The CRMCs contains a total of 125,113,756 TFBSs, consisting of 23.9% of the genome and 42.9% of the CRMCs (Fig. 3A). Many of these TFBSs have overlaps due partially to the aforementioned limitation of our motif-finder ProSampler, although it has been shown that certain patterns of transcriptional regulation are achieved by competitive or cooperative binding of the same or different TFs to overlapping TFBSs in a CRM [79,80,81,82,83]. We connected each two adjacent overlapping putative TFBSs, resulting in a total of 38,554,729 non-overlapping putative TFBS islands with a mean length of 17 bp.
Interestingly, as in the case of human genome [50], 75.6% of genome positions of the originally called binding peaks were predicted as CRMC positions (kept-original), while the remaining 24.4% were predicted as non-CRMC position (abandoned-original) (Fig. 3A). On the other hand, 58.7% of the extended positions were predicted as CRMCs (kept-extended), while the remaining 41.3% were predicted as non-CRMC positions (abandoned-extended) (Fig. 3A). These results suggest that originally called binding peak positions may not necessarily parts of CRMs, while many flanking positions of the called peaks may be parts of CRMs. Therefore, as we concluded earlier [50], extension of the originally called peaks to roughly half of the mean length (1,000 bp) of known of CRMs (2,400 bp) [60] could greatly increase the chance of finding more CRMs in genomes.
To evaluate the CRMCs, dePCRM2 computes a \({S}_{CRM}\) score (formula 3) and a corresponding p-value for each CRMC (Materials and Methods). As shown in Fig. 3B, the distribution of the \({S}_{CRM}\) scores of the CRMCs is strongly right-skewed relative to that of the Null CRMCs with the same number and lengths of the CRMs (Materials and Methods), suggesting that the CRMCs are unlikely produced by chance. Moreover, with the increase in the \({S}_{CRM}\) cutoff α, the corresponding p value drops rapidly, while both the number of predicted CRMs with a \({S}_{CRM}>\mathrm{\alpha }\) and their coverage of the genome decrease only slowly (Fig. 3C), suggesting that most of the CRMCs have quite low p-values. More specifically, when the p-value drops precipitously from 0.05 to 1.00 × 10–6 the number of predicted CRMs and their coverage of the genome only decrease from 798,257 to 295,382, and from 55.5% to 42.1%, respectively (Fig. 3D). Moreover, with the p-value dropping from 0.05 to 1.00 × 10–6, the coverage of putative TFBSs on the genome decreases only from 23.9% to 19.3%, and their percentage in the CRMs increases only from 43.0% to 45.8% (p-value ≤ 1.00 × 10–6) (Fig. 3D). As expected, in the 0.05 ~ 1.00 × 10–6 range of p-value cutoffs, the vast majority of the predicted CRM positions (94.9 ~ 95.9%) and constituent TFBS positions (93.8 ~ 94.8%) are located in non-exonic sequences (Fig. 3D), converging 41.0 ~ 54.70% and 18.6 ~ 23.2% of their lengths, respectively (Fig. 3E). Interestingly, the remaining 4.1 ~ 5.1% of the predicted CRM positions and 5.2 ~ 6.2% of constituent TFBS positions are located in exonic sequences (Fig. 3D), a well-known phenomenon in mammal genome [84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100]. We will address these exonic CRMs and TFBS positions in great detail elsewhere.
We next compared the lengths of predicted CRMs at different \({S}_{CRM}\) cutoffs α with those of known mouse enhancers in the VISTA database [60]. As shown in Fig. 3F, the predicted CRMCs have a shorter mean length (1,682 bp) than the VISTA enhancers (2,432 bp). This is not surprising since most VISTA enhancers are involved in complex embryonic development and tend to be longer than other types of enhancers [101]. However, with the increase in the \({S}_{CRM}\) cutoff α, the distribution of the lengths of predicted CRMs shifts to right. Specifically, 252,349 (27.7%) of the 912,197 CRMCs were shorter than the shortest VISTA enhancer (330 bp), but they cover only 2.1% of total length of the CRMCs, suggesting that they are likely either short CRMs or components of full-length enhancers remained to be fully predicted using more TF ChIP-seq datasets in the future. The remaining 659,848 (72.3%) CRMCs that are longer than the shortest VISTA mouse enhancer (330 bp) consist of 97.9% of the total length of the CRMCs, and they are likely full-length CRMs. Thus, the vast majority (97.9%) of the CRMC positions are covered by predicted full-length CRMs. The predicted CRMs and constituent TFBSs are available at (https://cci-bioinfo.uncc.edu).
Predicted CRMCs tend to be under strongly evolutionary constraints
To see how the CRMCs and non-CRMC evolve, we plotted the distributions of the phyloP scores [102] of their nucleotide positions. The phyloP treats negative and positive selections in a unified manner and detects departures from the neutral rate of substitution in either direction, while allowing for clade-specific selection [102]. A positive phyloP score indicates the position is under purifying selection, a negative score indicates the position is under positive selection, and a score around zero means the position is selective neutral or nearly so. For convenience of discussion, we consider a position with a score in the range [-δ, δ] (δ > 0) to be selectively neutral, in the range (δ, max) to be under positive selection, and in the range (min, -δ) to be under negative selection, respectively. We define the proportion of neutrality of a set of position as the areas under the distribution of the scores within range [-δ, δ], and choose δ = 1 in this study. For this analysis, we focused on the CRMCs and the non-CRMCs in non-exonic sequences, because including exonic sequences would confound the analysis due to their coding functions. The distribution of the phyloP scores of the non-CRMCs peaks at the neutral range with a proportion of neutrality of 0.89 (Fig. 4A), suggesting that the non-CRMC positions are largely selective neutral as expected, although it is possible that some non-CRMC positions that are under some level of selections might have functions other than cis-regulatory. In contrast, the distribution of the phyloP scores of the CRMC positions displays a lower peak in the neutral range with a proportion of neutrality of 0.77 (Fig. 4A), and spreads to both negative selection and positive selection ranges. These results indicate that CRMC positions are more likely to be under evolutional constraints than the non-CRMC positions. Thus, the CRMCs are more likely to be functional than non-CRMCs, although some CRMC positions that are selected neutral might not be functional. Notably, the mouse VISTA enhancers are even more likely to be evolutionarily conserved than our predicted CRMCs (Fig. 4A), although the former are largely a small subset of the latter (see below). This is not surprising that the VISTA enhancers were selected for validation in transgene animal models due to their ultra-conservation [28] and thus are mainly involved in embryonic development [103, 104] Therefore, as in the case of the human genome [50], dePCRM2 is able to partition the peak-covered genome regions into a functional set, i.e., the CRMCs, and a non-functional set, i.e., the non-CRMCs.
As we indicated earlier, there are still 20.1% of genome regions that are not covered by the extended peaks. To see whether the non-exonic sequences in these peak-uncovered regions contain functional elements such as CRMs, we plotted the distribution of the phyloP scores of their genomic positions. The proportion of neutrality (0.83) of these positions is in between those of the peak-covered regions (0.78) and those of the non-CRMCs (0.89) (Fig. 4A), suggesting that they might contain functional elements, albeit with a lower density than that in the peak-covered regions. Based on the difference in the proportion of neutralities of the peak-covered and peak-uncovered regions as well as that of the non-CRMCs, we estimate that proportion of CRMC positions in the peak-uncovered regions is about [(1–0.83)-(1–0.89)]/[(1–0.78)-(1–0.89)] = 54.55% that of CRMC positions in the peak-covered regions.
As expected, the kept-original positions as well as the kept-extended positions have almost the same phyloP score distributions as the CRMCs (Fig. 4B), indicating that they all are under strongly evolutionary constraints. In contrast, the abandoned-original peak positions as well as the abandoned-extended positions have an almost identical phyloP score distributions to that of the non-CRMCs (Fig. 4B), indicating that they all are largely selectively natural or nearly so. These results strongly suggest that the kept extended positions are likely functional, while the abandoned-original positions are unlikely functional. This results confirm our earlier conclusion that originally called binding peaks cannot be equivalent to CRMs, and appropriate extension of the originally called short binding peaks can greatly increase the power of available datasets for predicting CRMs and constituent TFBSs in genomes [50].
Higher-scoring CRMs are more likely under evolutionary constraints
To investigate the relationship between the evolutionary behaviors of the CRMCs and their \({S}_{CRM}\) scores, we plotted the distribution of the phyloP score of subsets of CRMCs with \({S}_{CRM}\) scores in nonoverlapping intervals. As shown in Fig. 4C and D, with the increase in the \({S}_{CRM}\) scores, the proportion of neutrality of the corresponding CRMs first drops rapidly and then enters a gradually decreasing phase. Thus, CRMCs with higher \({S}_{CRM}\) scores are more likely under evolutionary constraints, indicating that the \({S}_{CRM}\) score captures the evolutionary behavior of a CRM. Interestingly, even the CRMCs with scores in the lowest interval [0, 1) have a lower proportion of neutrality than that of the non-CRMCs (0.87 vs 0.89) (Fig. 4C and D), suggesting that even these lowest scoring CRMCs that tend to be short (Fig. 3F) are under stronger evolution constraints than the non-CRMCs, and thus are likely functional.
Next, we examined the phyloP scores for the CRMs predicted at different \({S}_{CRM}\) score cutoffs α (or p-values). As shown in Fig. 4E and F, with the increase in the \({S}_{CRM}\) score cutoff α, the proportion of neutrality of the predicted CRMs decreases gradually, suggesting again that the \({S}_{CRM}\) score captures the evolutionary behavior of the CRMCs. As indicated earlier, even at the lowest \({S}_{CRM}\) cutoff (α = 0), the predicted CRMs (i.e., all the CRMCs) have smaller neutral composition than that of the non-CRMCs, suggesting that at least most of the CRMC are functional, and the higher the \({S}_{CRM}\) score of a CRM, the more likely it is evolutionarily constrained, and thus the more likely it is functional.
Predicted CRMs are supported by independent experimental data
We next evaluated the sensitivity (recall rate) of our CRMs predicted at different p-values for recalling four types of experimentally determined CRM-related elements, including 620 mouse enhancers documented in the VISTA database [60], 163,311 mouse promoters and 49,385 mouse enhancers determined by the FANTOM project [105, 106], and 2,208 QTLs documented in the Mouse Genome Informatics (MGI) databases [107]. Interestingly, most of these experimentally determined elements are located in the peak-covered genome regions, including 579 (93.4%) VISTA enhancers, 163,311 (99.1%) FANTOM promoters and 49,385 FANTOM enhancers (99.2%) [108], with the exception for QTLs with only 1,023 (46.3%) being located in the peak-covered regions. If a predicted CRM and an element overlaps each other by at least 50% of the length of the shorter one, we say that the CRM recovers the element. As show in Fig. 5A, with the increase in the p value (decrease in -log(p)) cutoff, the sensitivity increases rapidly and saturates at a p-value cutoff 0.05 (α ≥ 65) to 99.3%, 93.8%, 86.8%, 82.3% for recovering the VISTA enhancers, FANTOM promoters, FNATOM enhancers and QTLs, respectively. Thus, the VISTA enhancers are largely a subset of our CRMCs. In contrast, the control sequences with the matched number and lengths of the predicted CRMs at different p-value cutoffs only recall an expected proportions of the elements by chance (p < 2.2 × 10–302, χ2 test) (Fig. 5A). Figures S2A ~ S2D show examples of the predicted CRMs that recover these four different types of experimentally determined elements.
The varying range of sensitivity from 82.3% for QTLs to 99.3% for VISTA enhancers might reflect the varying reliability of methods used to characterize these four types of elements. For example, VISTA enhancers and FANTOM promoters were determined by highly reliable transgene animal models [60] and CAGE methods [109], respectively, and our predicted CRMs achieve very high sensitivity to recall them. On the other hand, FANTOM enhancers and QTLs were determined by less reliable eRNA quantification [108] and association studies, respectively, and our predicted CRMs achieve relatively low sensitivity to recall them.
To find out whether our predicted CRMs missed these unrecalled elements, or they are simply false positives due to the limitations of experimental methods used to characterize them, we compared the phyloP scores of the recalled and unrecalled elements. As shown in Fig. 5B, for all the four types of elements, the recalled elements (solid lines) tend to be under strongly evolutionary constrains like our predicted CRMs, thus are likely functional. In contrast, the unrecalled elements (dashed lines) are largely selective neutral like our predicted non-CRMCs, thus are likely false positives produced by the methods used to characterize them. Based these results, we estimated an FDR of 0.7% (100%-99.3%), 6.2% (100%-93.8%), 13.2% (100%-86.8%) and 17.7% (100%-82.3%) in VISTA enhancers, FANTOM promoters, FANTOM enhancers and QTLs, respectively.
Most of predicted CRMs might be in correct lengths
Correct characterization of the lengths of CRMs is notoriously difficult both experimentally and computationally, because even short components of a long CRM might still be at least partially functional in transgene animal models [28], and because functionally related independent enhancers may cluster with each other to form super-enhancers [110, 111], or locus control regions (LCRs) [112]. Although VISTA enhancers are by no means a gold standard set of CRMs with correctly characterized lengths [60], they are the only available set of validated enhancers in mouse. As we indicated earlier, our CRMs predicted at p-value cutoff 0.05 recall 575 (99.3%) of the 579 VISTA enhancers in the peak-covered genome regions (Fig. 5A), we thus ask whether the recalling CRMs have a length matching the recalled VISTA enhancers. To this end, we computed the ratio of the length of a recalling CRM over that of its recalled VISTA enhancer. As shown in Fig. 6A, the recalling CRMs are on average twice as long as the recalled VISTA enhancers. To see whether we over-predict the lengths of the recalling CRMs or the recalled VISTA enhancers are only shorter functional components of long enhancers, we compared phyloP scores of the 1,303,562 bp positions shared by the recalling CRMs and the recalled VISTA enhancers, with those of the 3,005,862 bp (69.75%) and 73173 bp (5.31%) positions specific to the recalling CRMs and the recalled VISTA enhancers (Fig. 6B). As expected, like our predicted CRMC positions (Fig. 4A), positions shared by the CRMs and the VISTA enhancers tend to be under strongly evolutionary constraints (Fig. 6C). Moreover, the CRM specific positions (69.75%) also tend to be under strongly evolutionary constraints (Fig. 6C) as expected, suggesting that the positions in recalling CRMs that the recalled VISTA enhancers lack might be functional. In contrast, like our predicted non-CRMCs (Fig. 4A), the VISTA enhancer specific positions (5.31%) are largely selectively neutral (Fig. 6C), suggesting that the positions in the recalled VISTA enhancers that the recalling CRMs lack might not be functional. Therefore, although the recalled VISTA enhancers are only half as long as the recalling CRMs, they might contain non-enhancer sequences that comprise 5.31% of the total length of the recalled VISTA enhancers. On the other hand, we noted that 38 (6.6%) VISTA enhancers were recalled by multiple short CRMCs, suggesting that some of short CRMCs are indeed only components of a long CRMC, whose full-length forms remain to be predicted when more TF ChIP-seq data are available in the future.
We also compared the lengths of the recalling CRMs and their recalled FANTOM5 enhancers. As shown in Fig. 6A, the recalling CRMs (median length 4,233 bp) are about 14.7 times as long as the recalled FANTOM enhancers (median length 288 bp). Moreover, 34.8% of the recalled FANTOM enhancers were located in the same CRMs. Thus, FANTOM enhancers tend to be short components of long CRMs. Taken together, these results strongly suggest that although some of our CRMCs might be short components of long CRMs, the vast majority of the CRMs predicted p-value cutoff of 0.05 might be in correct full length, while many VISTA enhancers and most FANTOM enhancers might be only a component of otherwise long enhancers.
Our predicted CRMs and constituent TFBSs are more accurate and complete than existing predictions
We further evaluated our 798,257 CRMs predicted at p-value ≤ 0.05 (\({S}_{CRM}\ge 65\)) with two sets of predicted mouse enhancers, including 339,815 cCREs predicted recently by the ENCODE phase 3 consortium [30] and 519,386 enhancers from the EnhancerAtlas database [42]. As shown in Fig. 7A, these three sets of predicted CRMs containing highly varying numbers of elements cover highly varying portions of the genomes, i.e., 55.5%, 3.4% and 81.6% by our CRMs, the cCREs and the EnhancerAtlas enhancers, respectively. Since all our CRMs are located in the peak-covered genome regions, we only consider for comparison the cCREs and the EnhancerAtlas enhancers that have at least one nucleotide position overlapping the peak-covered genome regions. As shown in Fig. 7A, the vast majority of the cCREs (339,721 or 99.97%) and the EnhancerAtlas enhancers (436,504 or 84.0%) have at least one nucleotide position overlapping the peak-covered genome regions. The cCREs and EnhancerAtlas enhancers that at least partially overlap the peak-covered genome regions cover 3.4% and 81.6% of the genome (Fig. 7A). Therefore, our CRMs in the peak-covered genome regions cover a much larger proportion (55.5%) of the genome than do the cCREs (3.4%), but a much smaller proportion of the genome than do the EnhancerAtlas enhancers (81.6%).
To see whether we over-predicted the CRMs with respect to the cCREs, or under-predicted the CRMs with respect to the EnhancerAtlas, we first identified the shared and unshared genome positions among the three sets of sequence elements. As shown in Fig. 7B, most (85,075,038 bp or 92.0%) of the cCRE positions overlap our CRM positions, but they only cover 5.6% of our CRM positions, while missing 94.4% of our CRM positions because of the much shorter total lengths of the cCREs (Fig. 7A). The remaining 8.0% of the cCRE positions do not overlap our CRM positions. A total of 1,364,995,621 bp (61.3%) EnhancerAtlas enhancer positions overlap our CRM positions (Fig. 7B), covering 90.3% of our CRM positions, while missing 9.7% of our CRM positions. The remaining 39.7% of the EnhancerAtlas enhancer positions do not overlap our CRMs.
We then compared the phyloP scores of the cCRE and EnhancerAtlas enhancer positions that they shared and unshared with our CRMs positions (Fig. 7B). As expected, like our CRM positions (Fig. 7C), both the cCRE and the EnhancerAtlas positions shared with our CRMs tend to be under strongly evolutionary constraints, suggesting that they are likely functional. In stark contrast, the eCRE and the EnhancerAtlas positions unshared with our CRMs are largely selectively neutral like the non-CRMCs, suggesting that they might be not functional, and thus are false positive predictions. These results suggest that the cCRE and EnhancerAtlas enhancer positions that overlap our CRMs are more likely to be functional, while those that do not overlap our CRMs are more likely to be false positives. Therefore, based on the proportion of the unshared positions, we estimate the FDRs of the cCREs and EnhancerAtlas enhancers to be about 8.0% and 39.7%, respectively.
We also compared sensitivity of our CRMs, EnhancerAtlas and cCREs for recalling FANTOM prompters and VISTA enhancers in the peak-covered genome regions. We choose the FANTOM promoters and VISTA enhancers for this validation because the high quality of the two datasets with an estimated FDR of 0.7% and 6.2%, respectively, based on their proportions of neutrality (Fig. 5B). As shown in Fig. 7D, our CRMs substantially outperform the cCREs for recalling the FANTOM promoters (93.8% vs 46.3%) and VISTA enhancers (99.3% vs 91.4%). However, this comparison might not be meaningful as the total length of our CRMs is 16 time as large as that of the cCREs. On the other hand, although the total length of our CRMs is only 68.0% that of the EnhancerAtlas enhancers, our CRMs outperform the EnhancerAtlas enhancers for recalling VISTA enhancers (99.3% vs 97.9%) and FANTOM promoters (93.8% vs 70.1%).
Finally, we compared the lengths of our CRMs with those of the cCREs and the EnhancerAtlas enhancers. As shown in Fig. 7E, the distribution of the lengths of the cCREs has a very sharp peak around 250 bp with a mean length of 272 bp, indicating that the cCREs have almost the same lengths, a possible artifact of the prediction methods. Both the distributions of the lengths of our CRMs and EnhancerAtlas enhancers are similarly strongly skewed toward right with a mean length of 1,893 and 4,285 bp, respectively. Since there is no gold standard set of full-length CRMs, we could not validate the length of our CRMs and EnhancerAtlas enhancers. However, based on the evolutionary constraints on our CRMs, most our predicted CRMs might be in full-length, while 39.7% of the EnhancerAtlas enhancers positions might be false positives as we argued earlier. Taken together, our results suggest that our CRMs might be more accurate and complete than both the cCREs and the EnhancerAtlas enhancers.
About 64% of the mouse genome might code for CRMs
As we indicated earlier, our predicted 912,197 CRMCs make up of 55.5% of the mappable mouse genome. To estimate the FDR of the CRMCs, we took a semi-theoretic approach as we did earlier in the human genome [50]. Specifically, we calculated the expected number of true positives and false positives in the CRMCs with a \({S}_{CRM}\) score in each of non-overlapping interval based on the density of the \({S}_{CRM}\) scores of the CRMCs and the density of the \({S}_{CRM}\) scores of the Null CRMCs (Fig. 8A), yielding 910,711 (99.84%) expected true positives and 1,486 (0.16%) expected false positives in the CRMCs (Fig. 8B). Most (1,373/1,486 = 92.40%) of the 1,486 expected false positive CRMCs have a low \({S}_{CRM}\) score < 50 (insets in Fig. 8A and B) with a mean length of 64 bp, comprising 0.004% (1,486*64 bp /2,725,521,370 bp) of the mappable genome and 0.007% (0.004/55.5) of the total length of the CRMCs, i.e., an FDR of 0.007% for the CRMC positions (Fig. 8C). Thus, our predicted true CRMCs would comprise 55.5%-0.004% = 55.496% of the genome. On the other hand, as the CRMCs miss 0.7% of VISTA enhancers in the peak-covered regions [the point at -log (p) = 0 in Fig. 5A], we assume the FNR of predicting CRMC positions to be about 0.7%. We estimate false negative CRMC positions to be 0.007*0.55496/(1–0.007) = 0.39% of the genome, which is 0.39%/24.4% = 1.60% of the total length of the non-CRMCs, meaning a false omission rate (FOR) of 1.60% for the non-CRMC positions (Fig. 8C). Hence, true CRM positions in the peak-covered regions would be 55.5%-0.004% + 0.39% = 55.89% of the genome (Fig. 8C). In addition, as we argued earlier, the CRMC density in the peak-uncovered 20.10% genome regions is about 54.55% of that in the peak-covered genome regions, CRMCs in the uncovered regions would be about 0.201*0.5589*0.5455/0.779 = 7.87% of the genome (Fig. 8C). Taken together, we estimated about 55.89% + 7.87% = 63.76% of the genome to code for CRMs, for which we have predicted 55.89/63.76 = 87.66%. Moreover, as we predicted about 42.9% of CRMs to be made up of TFBSs (Fig. 3D), we estimated about 0.429*63.76% = 27.35% of the genome to encode TFBSs. Furthermore, assuming a mean length 1,893 bp for CRMs (the mean length of our predicted CRMs at p-value \(\le\) 0.05), and a mean length of 17 bp for TFBS islands, we estimated that the mouse genome would encode about 918,010 CRMs (2,725,521,370 × 0.6376/1,893) and 43,848,829 non-overlapping TFBS islands (2,725,521,370 × 0.2735/17).