Systematic profiling of tissue-specific regulatory elements (TSREs) in mouse
To systematically identify potential regulatory elements in the mouse genome, we annotated genome-wide chromatin states using a multivariate hidden Markov model called ChromHMM [42]. We constructed the model using three primary histone marks (namely H3K4me1, H3K4me3 and H3K27ac) in 22 mouse epigenomes from ENCODE [2]. These chromatin states can be broadly categorised into active promoter, weak promoter, strong enhancer and weak enhancer states (Additional file 1: Figure S1). Overall, we annotated 923,791 strong enhancer and 309,581 active promoter annotations (each being 200 bp in length) across the 22 epigenomes (posterior probability of states ≥0.95). To validate the accuracy of our predicted promoters and strong enhancers, we compared them to known promoter and enhancer elements in the mouse genome (see methods). The predicted regulatory elements achieved a recall sensitivity of 81.7% (18,543/22,707) for the promoters of protein-coding genes, and 91.2% (331/363) for enhancers. To accurately identify mouse TSREs, we implemented the previously described TAU algorithm [43, 44] to calculate the tissue specificity index (τreg) of every strong enhancer and active promoter (see methods). In total across 22 mouse tissues, 31% of all strong enhancers were shown to be highly tissue-specific (τreg ≥ 0.85) and 43% of active promoters. Both, also show a high degree of positive correlation with DNaseI hypersensitive sites (DHS) in the corresponding tissues (Pearson’s correlation, p < 2.2e-16), confirming these TSREs are highly tissue-specific (Fig. 1a-b, Additional file 1: Figure S2).
To identify mouse SEs, we used the ROSE algorithm [31] to combine tissue-specific enhancer elements within a span of 12.5 kb into cohesive units and rank them based on H3K27ac signal which distinguishes them from TEs (Fig. 1c). The enhancer elements within the cohesive units (for both categorised as SEs or TEs) are referred to as constituent enhancers (Additional file 1: Figure S2d). Using this approach, 6.6% (5082) of all cohesive units (or 24% of all tissue-specific enhancers) are SEs while 93.4% (71,824) are TEs (or 76% of all tissue-specific enhancers) (Additional file 1: Figure S2e). As expected, we found SE cohesive units are occupied on average by 2.4x H3K27ac and span large genomic regions (median size = 12.4 kb) compared to TEs (median size = 0.4 kb) (Fig. 1d-e, Additional file 1: Figure S3). The number of constituent enhancers are enriched in SEs compared to TEs (Fig. 1f). Enrichment of H3K4me1 and DHS at SEs is observed to be in agreement with H3K27ac levels (Additional file 1: Figure S4). To determine whether the high levels of histone modification activity at SEs are a consequence of the total genomic length of their cohesive units, we compared the enrichment of H3K27ac and H3K4me1 among their constituent enhancers to TEs. We find that constituent enhancers within SEs show a higher density of H3K27ac and H3K4me1 histone marks compared to TEs (Additional file 1: Figure S5a and S5b), suggesting the increased levels of chromatin activity in SEs is not a consequence of the total genomic length of their cohesive units. A similar trend was identified for RNA polymerase II indicating a potential role of enhancer RNAs (eRNAs) in enhancer activity and gene regulation, as reported in recent studies [45, 46] (Additional file 1: Figure S5c).
SEs have been found to frequently overlap the genes they regulate [21, 31]. A previous study in murine ESCs identified more than 80% of SEs and TEs to interact with their nearest active gene [47]. To explore the functional role of enhancers we associated each enhancer element to a potential target gene using a community accepted tool, GREAT [48]. We identified 3617 and 14,791 protein-coding genes associated with SEs and TEs in at least one tissue or cell type, respectively (Additional file 2). The resulting enhancer-gene associations were highly consistent with previously identified topological associated domains (TADs) (96% in cortex TADs and 93% in mESC TADs) [49] (Additional file 1: Figure S6a, Additional file 3). Similarly, 87% of associations overlapped with computationally derived enhancer-promoter units (EPUs) [6]. As expected, the majority (62.53% of SEs, 57.25% of TEs) of the tissue-specific enhancers are located within 50 kb from the transcription start sites (TSSs) of their associated genes (Additional file 1: Figure S6b-S6d). The predicted SEs, TEs and their associated genes were used for all subsequent analysis.
Typical and super-enhancers can boost tissue-specific gene expression
Previous studies in human and mouse cell types have shown SEs to be related with highly expressed genes [21], however the studies in mouse were less comprehensive and limited to a few tissues [31, 35, 39, 50]. In addition to this total-expression, a few studies have demonstrated SEs to be associated with tissue-specific gene expression in cell lines. For instance, genes associated with SEs in multiple myeloma cell lines were preferentially expressed in myeloma cells [32]. With the aim of exploring whether this association prevails genome-wide, across multiple tissue types and different enhancers, we examined the impact of these newly identified enhancers in 22 tissues. To inspect this, we utilised ENCODE RNA-Seq data. To effectively identify any common expression patterns between genes, tissues and enhancers, we constructed a dataset formed of genes expressed within a particular tissue, termed gene-tissue pairs, followed by categorisation on their type of enhancer association, hence grouping them into three classes: (1) gene-tissue pairs associated with SEs, referred to as super-enhancer class (SEC); (2) gene-tissue pairs associated with TEs, referred to as typical-enhancer class (TEC); and (3) gene-tissue pairs associated with weak/poised enhancers, referred to as weak-enhancer class (WEC).
We found that both SEC and TEC are associated with highly expressed genes in comparison to the WEC (SEC: effect size (ES) = 0.95, p < 2.2 × 10− 16; TEC: ES = 0.86, p < 2.2 × 10− 16; Wilcoxon Rank Sum Test) but that the SEC appears to have the highest level of total-expression (SEC compared to TEC: ES = 0.56, p < 2.2 × 10− 16) (Fig. 2a, Additional file 1: Figure S7a). Likewise, the SEC have higher tissue-specific expression (quantified as τexp − frac, see methods) compared to the TEC (ES = 0.62, p < 2.2 × 10− 16; Wilcoxon Rank Sum Test) or WEC (ES = 0.96, p < 2.2 × 10− 16) (Fig. 2b). To further understand tissue-specific expression of the genes within different enhancer classes, we categorised it into three levels of low, intermediate and high (see methods). We identified, 16.46% (690/4191) of SEC, 4.42% (1923/43,484) of TEC and 3.38% (230/6795) of WEC to have high tissue-specific expression (Fig. 2c, Additional file 1: Figure S7b). Further examination of the high tissue-specific expression category shows the absolute number of genes within the TEC (1923) is notably higher than in the SEC (690) or WEC (230). Overall this data suggests the ratio of genes within the SEC with high tissue-specific expression is at least 4 times larger than the genes within other enhancer classes. However, their absolute number is smaller compared to the TEC which contribute the largest amount (68%) of enhancer associated tissue-specific expression in the genome (Fig. 2d). This body of work in mouse strengthens the theory that super-enhancers can boost tissue-specific gene expression, while highlighting that high numbers of typical-enhancers, can also boost tissue-specific expression and should not be overlooked.
While identifying SEs we observed they are comprised of a large number of constituent enhancers (Fig. 1f). The average number of constituent enhancers within SEs is 13, compared to 3 in TEs. To this end, we examined whether an increase in the number of constituent enhancers results in an increase in total-expression of their associated genes. To increase the power of this analysis, we combined both the SEC and TEC into a single dataset. We correlated the frequency of the constituent enhancers (total number of constituent enhancers associated with a gene) within the combined dataset with total-expression of their associated gene, which revealed a weak positive correlation (Spearman’s correlation rho = 0.12, p < 2.2 × 10− 16) (Additional file 1: Figure S8a). To ensure this observation was not driven predominantly by one class of enhancer, we examined this correlation separately within SEC and TEC, and found no notable difference between the two classes (Additional file 1: Figure S8b and S8c). In contrast, weak-enhancer elements show little to no correlation with total-expression (Spearman’s correlation rho = − 0.03, p = 0.02) of their associated genes (Additional file 1: Figure S8d). Overall this shows that total-expression of a gene modestly increases with an increase in the number of constituent enhancers, indicating a non-additive relationship between them. This suggests that constituent enhancers appear to exert a complex, instead of a simple additive effect on the transcriptional output.
Since a gene could be related to SEs or TEs in multiple tissues, we inspected these multiple gene-enhancer associations for their effect on tissue-specific expression. For this purpose, we assessed the number of distinct tissues, where an enhancer associated with a gene occurs, which we define here as “enhancer tissue-types” (Fig. 2e). A large portion (∼78%, 2821 out of 3617) of the SEC is associated with one enhancer tissue-type, i.e. the genes are associated with SEs from one tissue (Fig. 2f). However, only 27% (3956 out of 14,791) of the TEC have one enhancer tissue-type, while the remaining 73% are associated with TEs of two or more tissues (Additional file 4 provides the list of these genes). Furthermore, we see that genes with a higher number of enhancer tissue-types are associated with low values of τexp − frac (Fig. 2g), hence increasing enhancer tissue-type association increases ubiquitous expression.
We next turned our attention to the genes which are associated with more than one enhancer tissue-type. Since these genes are associated with enhancers in multiple tissues (two or more), we sought to examine what type of enhancer has a higher propensity to adopt an “enhancer usage switch”. We define “enhancer usage switch” as the phenomenon where the enhancer usage associated with a gene could differ across multiple tissues. We use the number of constituent enhancers (within SEs or TEs) associated with a gene-tissue pair as a measure of its enhancer usage. The standard deviation of its enhancer usage across the 22 tissues was used to predict the level of “enhancer usage switch”. A gene with a large “enhancer usage switch” score refers to an enhancer usage which varies highly across the different tissues. We compared the enhancer usage switch scores between SEC and TEC with multiple enhancer tissue-types, which shows that SEC exhibit significantly higher enhancer usage switch across the tissues (ES = 0.89, p < 2.2 × 10− 16; Wilcoxon Rank Sum Test) (Additional file 1: Figure S9). The genes with a high enhancer usage switch score for SEC include: Ntm, Grm4, Foxa2, and Max, whereas the genes with a high enhancer usage switch score for TEC include: Csmd1, Ntrk3, Grin2a and Opcml (Additional file 1: Figure S10; Additional file 5). Overall, this analysis shows that both SEC and TEC display enhancer usage switch, but SE usage of a gene varies significantly more across different cell- and tissue-types compared to TE.
Enhancers drive phenotype and disease causation
Previous studies have identified SEs to be associated with genes that regulate cell identity and are therefore unlikely to be involved in a housekeeping role [21, 31]. To increase our understanding of the functional role of SE and TE associated genes we performed Gene Ontology (GO) enrichment analysis in 22 mouse tissues. Genes associated with SEs belonging to the SEC category are enriched for transcription factor binding activity (p = 10− 10), regulation of cell development (p = 10− 16) and regulation of cell differentiation (p = 10− 23) (Additional file 6). The breadth of this analysis demonstrates novel cell identity associations in unexplored tissues in the mouse. As expected, these are also important in the control and regulation of tissue or cell identity. Some examples of these novel SE associated genes include Ucp1 (responsible for generating body heat in mammals [51]) in brown adipose tissue; Gata4 (critical for heart development and cardiomyocyte regulation [52]) in heart; Cxcr2 (regulates the emigration of neutrophils from bone marrow [53]) in bone marrow; and Rbfox3 (splicing regulator of neuronal transcripts [54, 55]) in cerebellum. On the other hand, TEC appear to have different enrichments in GO analysis and are linked with genes involved in nucleotide and protein containing-complex binding (p = 10− 6), cellular protein localisation (p = 10− 7) and cell morphogenesis (p = 10− 5). Furthermore, TEC is significantly enriched for housekeeping genes (p = 2.7 × 10− 11, Odds Ratio (OR) = 1.49, 95% Confidence Intervals (CI) [1.32, 1.68]), while SEC is depleted (p = 0.012, OR = 0.82, 95% CI [0.69, 0.98]).
To further explore the regulatory function of enhancers, we investigated mouse phenotypes and human diseases associated with genes within SEC and TEC (see methods). Significant enrichment in both phenotypes and disease ontology terms in the corresponding tissue types was identified (Fig. 3, Additional file 7), suggesting a strong relationship between both SEC and TEC and resulting pathological outcomes (disease causation). For instance, genes associated with cerebellum-specific enhancers are enriched for phenotypes such as impaired coordination (q = 4.83 × 10− 8) and abnormal synaptic transmission (q = 2.46 × 10− 7), and diseases such as bipolar disorder (q = 8.52 × 10− 7) and unipolar disorder (q = 6.26 × 10− 5). Similarly, genes related to heart-specific enhancers are enriched for phenotypes like abnormal cardiac muscle contractility (q = 9.05 × 10− 16) and diseases like cardiomyopathy (q = 5.45 × 10− 14) (Fig. 3). In addition, enrichment of blood-related cancers (such as Hodgkin Disease, q = 1.90 × 10− 12; T-cell Leukemia, q = 1.41 × 10− 5) in CH12 enhancer associated genes is consistent with the idea that oncogenes are placed under the effect of strong enhancers during cancer development leading to over-expression of these genes [32, 56]. On the other hand, the WEC display either an insignificant or a weak association with phenotypes in majority of the tissues (Additional file 1: Table S1).
However, there is a marked difference in the expression patterns of SEC compared to TEC, which is not observed in their relationship with phenotypes. We explored this dichotomy further by comparing the phenotyping data from knockout mouse lines of genes in SEC and TEC across all tissues within the IMPC data. We reasoned that if SE associated genes are predominantly related to phenotype occurrence, their associated gene knockouts would cause a more severe phenotype condition (a phenotype with an increased effect size) relative to knockouts of other genes (such as those associated with TEs). We compared several standardised phenotyping procedures within the IMPC and observed a significant difference in severity only for acoustic startle and pre-pulse inhibition (ES = − 0.63, p = 0.001) (Fig. 4). However, for the majority of the procedures, we observed no significant difference in severity of phenotypes between SEC and TEC (Open field test, ES = 0.19, p = 0.13; Grip strength, ES = 0.19, p = 0.55; DEXA, ES = − 0.02, p = 0.75; Heart weight, ES = 0.16, p = 0.63; Hematology, ES = 0.16, p = 0.1). Next, we sought to examine the breadth of the phenotypes associated with SEC and TEC. For this purpose, we computed the number of top-level phenotype ontology terms associated with SE and TE associated gene knockouts from IMPC (Additional file 1: Figure S11). No notable difference is observed in the breadth of phenotypes between SEC and TEC (ES = 0, p = 0.42), indicating both SE and TE associated gene knockouts are likely to produce comparable number of phenotypes and therefore, have similar pleiotropic effects. Furthermore, we explored the mouse essential genes by retrieving all the genes from IMPC which generate a lethal knockout [57] to examine if the SEC is enriched with lethality. There is no enrichment of lethal genes among SEC (p = 0.24, OR = 1.08, 95% CI [0.88, 1.30]) and TEC (p = 0.83, OR = 0.93, 95% CI [0.79, 1.09]). Finally, using GTEx data, we compared the number of expression quantitative trait loci (eQTLs) associated with SEC and TEC and observed no significant difference in the number of cis-eQTLs associated with SEC and TEC (ES = 0, p > 0.56; Wilcoxon Rank Sum Test) (Additional file 1: Figure S12). Overall these results highlight that tissue- and cell-specific relevant traits are associated with both SEs and TEs associated genes.
Enhancer associated genes are connected in a dense interactome
Having shown that enhancer associated genes are enriched for tissue-specific traits, we hypothesised that the proportion of these with no prior phenotypic annotations related to the tissue maybe involved in disease-causing pathways. To identify novel disease-associated genes, we first analysed the protein-protein interactions (PPI) among enhancer-associated genes in each of the 22 tissues, using the STRING database [58]. Then in each network, we identified the genes currently known to be associated with the corresponding tissue-type phenotypic annotations from MGD [59], while the genes with no-prior phenotypic information were labelled as ‘novel’. For each tissue, both the known and unknown disease genes (referred to as known and novel respectively) in the PPI network of enhancer-associated genes are observed to be connected in a remarkably dense interactome (Fig. 5, Additional file 1: Figure S13). Interestingly, the novel genes (blue nodes) are highly connected with the phenotype-associated genes (pink nodes), suggesting a potential functional relationship between them. Simulating these PPI networks with random protein-coding genes showed that novel genes connect significantly more with known phenotype-associated genes, compared to randomly added genes (p ≤ 0.016, except thymus p = 0.056) (Additional file 1: Figure S14). This outcome demonstrates enhancer associated genes to be potentially engaged in the same functional pathway as the known phenotype genes and therefore, could also be linked with the corresponding phenotypes and ultimately disease causation.
Preferential transcription factor binding in super-enhancers
Enhancer regions contain many binding sites for TFs which contribute to important tissue-specific functions by regulating the target genes [60]. To investigate transcription factor binding activity within SEs and TEs, with the aim of identifying potential key regulators in each tissue, we used publicly accessible ChIP-Seq data for mouse TFs. For many TFs, the information available on their specific binding in various cell types is rather sporadic, thus we flattened all available ChIP-Seq peaks for each TF into single binding profiles referred to as “cistrome” (see methods). Next, for each cell type, we systematically identified TFs, for which cistrome peaks significantly colocalised with their corresponding active enhancers.
First, we found that TFs which have significant colocalisation with enhancer elements include regulators known to be implicated in the corresponding tissue-specific regulation (Fig. 6). For example, Spi1, with cistrome peaks colocalized with bone marrow enhancers, is implicated in myeloid and B-lymphoid cell development [61]; Gata4, with cistrome peaks colocalized with heart enhancers, is involved in myocardial differentiation and function [62]; and Dmrt1, with cistrome peaks colocalized with testis enhancers plays a key role in male sex determination and differentiation [63]. Overall, we observed cistrome peaks of 214 TFs (509 TF-tissue pairs) to significantly colocalise with TEs (with OR > 1; corrected p-value < 10− 3) and 113 TFs (148 TF-tissue pairs) with SEs across all tissues and cell types (Additional file 8). The 214 TFs colocalised with TEs included all the 113 TFs identified for SEs. Second, we observed that some TFs with cistrome peaks significantly colocalised with enhancers are expressed in a tissue-specific manner in the corresponding tissues (Additional file 1: Figure S15a). In total, we identified 56 such TFs with highly tissue-specific expression (τexp − frac > 0.85) and significant colocalisation with corresponding TEs, and 29 TFs with SEs across all tissues and cell types. Examples of such TFs include Hnf6 in liver (τexp − frac = 1), Nkx2–5 in heart (τexp − frac = 1), Gata1 in MEL cells (τexp − frac = 0.93) and Neurog2 in brain (τexp − frac = 0.98).
Overall, TF cistrome peaks were identified to significantly colocalise with both SEs and TEs, but a greater number of TFs were identified to colocalise with TEs compared to SEs. This could be explained by the relatively large number of TEs in the genome. To investigate this further, for each TF with significant enhancer localization, we computed their transcription factor binding site (TFBS) density in SEs and TEs. The TFBS density could be defined as a measure of TFBS clustering in SEs or TEs (see methods). To summarise our analysis, we counted the number of TF-tissue pairs which have significantly greater TFBS density in SEs compared to TEs, and vice-versa for TEs. Overall, we find that SEs have more TF-tissue pairs with higher TFBS density compared to TEs (Additional file 1: Figure S15b). Altogether, this data indicates that although TEs are more often colocalised by TF cistrome peaks, frequency and degree of TFBS clusters is higher in SEs.
Combinatorial learning approach for phenotype prediction
Our findings show mouse enhancer associated genes are correlated to a great extent with tissue-specific gene expression as well as phenotypes. We explored the utilisation of this dataset to infer mammalian gene-phenotype associations as has previously been done for protein-protein interaction (PPI) and gene expression data [64,65,66]. We implemented the random forest classifier to predict gene-phenotype associations from 13 different phenotypic domains, where each domain is relevant to at least one tissue type in our dataset. For this learning approach, we extracted gene features from TSRE profiles, expression data, transcription factor binding sites and protein-protein interaction data in 22 mouse tissues (Fig. 7a) (see methods). For the purpose of training this random forest classifier and maximising its learning process, we combined the SE and TE dataset together and used their constituent enhancers (or tissue-specific enhancers) to calculate the enhancer-associated gene feature. We first trained a random forest classifier on a subset of protein-coding genes using a combination of various gene features as predictor variables and the top level mammalian phenotype terms from MGD as the response variable (true positives), while genes not associated with a phenotype in MGD were considered as true negatives. This model was used to predict gene-phenotype associations in the remaining set of genes not used in the training of the model.
By integrating various features together, 10 combinations were formed, constructing 10 distinct classifiers for each phenotypic domain. The predictive power of each classifier was assessed by generating Receiver Operating Characteristic (ROC) and precision-recall (PR) curves based on 5-fold cross validation, repeated 10 times with different seeds. The classifier trained on all the gene features combined achieved the best performance with a mean AUC-ROC of 0.78 and AUC-PR of 0.27 across all the phenotype domains (Fig. 7b, Additional file 1: Figure S16, Additional file 9). However, high precision recall rates (AUC-PR > 0.35) are observed in phenotypes with a high number of known mammalian phenotype annotation counts in MGD (such as behavioural/neurological, nervous system, cardiovascular, immune and hematopoietic system, see Additional file 1: Figure S17). Focusing on predicting gene-phenotype associations within the nervous system domain, the classifier trained on all the gene features achieved the greatest mean AUC-ROC of 0.80 and AUC-PR of 0.42 (Fig. 7c). The PPI score with genes known to be associated with nervous system phenotype was identified to contribute the most in predicting nervous system gene-phenotype associations, followed by expression data in whole brain and cortex (Fig. 7d). In fact, PPI data is the most informative and the main contributor to the performance of these classifiers in all the 13 phenotypes. While the models trained solely on regulatory features have limited predictive power, they improved the performance of models when integrated with other features, suggesting regulatory data are a useful addition for modelling mammalian phenotypes.
In order to evaluate the validity of the predictions from the model, we investigated the novel gene-phenotype predictions made by these classifiers. The predictions classified as incorrect are based on the current knowledge of gene-phenotype associations, but it is possible that there are no, or little, prior knowledge about particular gene function, and thus are novel. This also leads to undermining the true predictive power of a classification model. For such reasons, the top false-positive predictions are most interesting as they could provide new hypotheses about gene function. To systematically examine the top false-positive predictions (prediction score ≥ 0.90) in each phenotype domain, we used the Open Targets Platform [67] and the DisGeNET discovery platform [68] which links potential novel genes to diseases via evidence based on genetic associations, somatic mutations, animal models, expression, pathways, drugs and text mining from literature. We identified that ~ 75% (495/659) of the false-positive predictions examined (see methods) with Open Targets and ~ 63% (338/539) with DisGeNET could be potentially associated with the corresponding disease (Additional file 1: Figure S18) and hence, could serve as potential novel disease targets. For example, out of the 76 top scoring false-positives (prediction score ≥ 0.90) examined for nervous system phenotype, 72 could be associated with nervous system disease (p = 5.00 × 10− 9) based on evidence integrated from a range of data sources by Open Targets platform. Additional file 10 provides these novel predictions for each phenotype and the evidence supporting their association with the corresponding diseases.