Maps of direct TF-DNA interactions across nine species
Prediction of direct TF-DNA interactions
We aimed at providing a collection of direct TF-DNA interactions by combining experimental and computational approaches in several species. We applied an updated version of the ChIP-eat pipeline [14] to ChIP-seq datasets to discriminate high-confidence TFBSs within ChIP-seq peaks from indirect binding events and ChIP-seq noise/artifacts (see Methods). In a nutshell, ChIP-eat uses an entropy-based parameter-free algorithm to automatically define an enrichment zone, which contains TFBSs with high PWM scores and close proximity to ChIP-seq peak summits. These criteria provide computational (high PWM score) and experimental (proximity to peak summit) support for direct TF-DNA interactions. This process is carried out in a ChIP-seq dataset-specific manner. It first optimizes JASPAR PWMs [24] using the DAMO tool [25], which adjusts the PWMs through a perceptron algorithm to best discriminate ChIP-seq peaks from random sequences (see Methods). Next, the optimized PWMs are used to detect, for each dataset, the optimal thresholds on the PWM score and distance to the peak summits. These thresholds define the enrichment zone, which highlights direct TF-DNA interactions (see Methods and [14] for more details).
We collected ChIP-seq peaks for 11,373 ChIP-seq experiments from ReMap 2018 [26] and GTRD [5] for nine species: Arabidopsis thaliana, Caenorhabditis elegans, Danio rerio, Drosophila melanogaster, Homo sapiens, Mus musculus, Rattus norvegicus, Saccharomyces cerevisiae, and Schizosaccharomyces pombe. For 10,264 datasets, we were able to associate a TF binding profile in JASPAR with the ChIP’ed TF. The ChIP-eat pipeline was applied to each ChIP-seq peak dataset - JASPAR PWM pair independently to predict TFBSs. ChIP-eat identified enrichment zones to predict direct TF-DNA interactions in 9654 datasets. Altogether, this analysis culminated with the prediction of ~ 72 million TFBSs in ChIP-seq peaks for 841 TFs in 1316 cell lines and tissues (Supplementary Figure 1; Supplementary Table 1).
We provide these predictions through the UniBind database at https://unibind.uio.no/ (see section “UniBind web-application and web-services” for details). In the database, the datasets are annotated with information about the ChIP’ed TF (UniProt ID [27]), the cell line or tissue name with ontology IDs from Cellosaurus [28], Cell Line Ontology [29], Experimental Factor Ontology [30], UBERON [31], Cell Ontology [32], and BRENDA [33] whenever possible, and the treatment used, if any.
Quality control to establish a robust collection of direct TF-DNA interactions
In UniBind, we aimed to create a robust collection of bona fide direct TF-DNA interactions found in high-quality ChIP-seq peak datasets. This robust collection was obtained by implementing two quality control metrics and only retaining the datasets that satisfy the corresponding criteria. First, we expect high-quality ChIP-seq peak datasets to be enriched for the TF binding motif known to be bound by the ChIP’ed TF. Hence, we filtered out datasets where the DAMO-optimized TF binding motif, which maximizes the discrimination of ChIP-seq peaks from random sequences, was not similar to the expected canonical motif (see Methods). Second, we expect the ChIP-seq peaks to be enriched for TFBSs close to their summits. Hence, we filtered out the datasets where the predicted direct TF-DNA interactions did not show a significant enrichment around the summits (see Methods). While we provide the complete set of TFBSs predicted by ChIP-eat in the permissive collection to the community, we specifically contribute with the robust collection of quality-controlled direct TF-DNA interactions in high-quality ChIP-seq peak datasets.
After applying the quality-control filters, the robust collection of UniBind culminates with ~ 56 million TFBSs obtained from 6902 ChIP-seq peak datasets, all species combined (Fig. 1A; Supplementary Table 2). Note that none of the five datasets from S. pombe passed the quality-control criteria due to a lack of enrichment around the ChIP-seq peak summits. The TFBSs in the robust collection are associated with 644 distinct TFs ChIP’ed in 1096 cell lines and tissues (Fig. 1A; Supplementary Table 2). We found that the predicted TFBSs cover between 0.04 and 6.05% of the genome of their respective organism (Fig. 1B). For example, human and mouse TFBSs cover 6.05 and 5.39% of the genomes, respectively (Fig. 1B). Of course, these numbers are somehow a reflection of the number of ChIP-seq experiments available in the corresponding species (Supplementary Figure 2).
Since TFs are known to regulate transcription cooperatively through locally enriched TFBSs [2], we aimed to identify cis-regulatory modules (CRMs) corresponding to clusters of TFBSs. Specifically, we used CREAM [34] to locate DNA segments with local enrichment for UniBind TFBSs, which culminated with the predictions of > 197,000 CRMs (Supplementary Table 2).
With many TFs associated with multiple ChIP-seq datasets and similar TF binding profiles for TFs sharing DNA binding domains (DBDs) from the same structural class, the TFBS collection contains redundant instances. We aimed to reduce redundancy of the TFBS information to facilitate visualization, analyses, and interpretation [10]. Following the approach developed by Vierstra et al. [10], we defined TF binding archetypes representing similar TF binding profiles for TFs sharing DBD structural classes (see Methods). This approach allowed to identify a single TFBS location from several overlapping TFBSs predicted from TF profiles in the same archetype. Supplementary Figure 3 depicts a comparison between original and archetypal TFBSs at an exemplary genomic loci.
To summarize, we provide a collection of TFBSs with both experimental and computational support for direct TF-DNA interactions in quality-controlled ChIP-seq peak datasets. Hereafter, the complete collection of unfiltered TFBS predictions is referred to as the “permissive” collection, while the filtered, high-quality TF-DNA interactions are referred to as the “robust” collection.
Support for the functional relevance of the TFBSs in the robust collection of UniBind
To further confirm the high-quality of the identified TFBSs in the robust collection of UniBind, we sought to provide support for their biological relevance. Hence, the analyses performed below were applied to the complete robust collection of TFBSs, except when explicitly stated otherwise.
Human and mouse TFBSs are evolutionarily conserved
We hypothesized that functionally relevant TFBSs should be enriched for evolutionary conservation. Indeed, conservation of DNA segments through evolution represents a hallmark of functional importance [35]. We considered evolutionary conservation scores in the human and mouse genomes computed by the PhyloP [36] and PhastCons [35] methods from the PHAST package [35]. Specifically, we investigated the average conservation of 2 kilobases (kb) DNA regions centered around the TFBS mid-points. Both scores estimate the probability of each nucleotide to belong to a conserved element [35, 36]. While phyloP scores reflect conservation of each nucleotide, phastCons scores consider flanking nucleotides to measure evolutionary acceleration (negative scores) and conservation (positive scores). For both human and mouse, we noticed that evolutionary conservation gradually increased when the distance to the TFBSs decreased, with sharp peaks of higher conservation at the TFBSs (Fig. 2). Increased evolutionary conservation was similarly observed at CRMs (Supplementary Figure 4). The signal was consistently found when considering multiple alignments of 19 (phyloP20way and phastCons20way, Fig. 2A) or 99 vertebrate genomes (phyloP100way and phastCons100way, Fig. 2A) to the human genome and 59 vertebrate genomes (phastCons60way and phyloP60way, Fig. 2B) to the mouse genome. The evolutionary conservation of TFBSs is not expected by chance as no conservation was observed when randomly shuffling the positions of the TFBSs in the human and mouse genomes (Fig. 2, grey lines). The acute increase of evolutionary conservation scores right at the TFBS locations reinforce the biological relevance of the direct TF-DNA interactions stored in UniBind.
To further investigate evolutionary conservation, we evaluated the conservation of predicted TFBSs at conserved elements between human and mouse. We lifted the mouse robust archetypal TFBSs over to the human genome and assessed their proximity to human TFBSs from the same archetype (see Methods). Next, we evaluated the relative distances between mouse archetypal TFBSs lifted over to the human genome and human TFBSs from the same archetype using bedtools reldist following [37, 38]. Across TF binding archetypes, we observed an enrichment for lifted mouse TFBSs to overlap human TFBSs (see the peak at distance 0, corresponding to an overlap, in Fig. 2C). The relative distances between mouse TFBSs lifted over to the human genome and human TFBSs confirm the enrichment for conservation of TFBSs between human and mouse associated with TFs sharing DBD structural classes.
Finally, we assessed the added value of the ChIP-eat approach to predict TFBSs over raw PWM mapping genome-wide. Specifically, we compared the evolutionary conservation of the TFBSs from the UniBind robust collection to TFBSs solely predicted from raw PWMs from JASPAR (see Methods). Supporting the functional relevance of UniBind TFBSs, we observed that UniBind TFBSs were significantly more evolutionarily conserved than TFBSs predicted from raw PWMs (Fig. 2D).
Altogether, these results pointed to the likely functional role of the TFBSs in transcriptional regulation of gene expression.
UniBind TFBSs are enriched at active promoters and enhancers
Next, we sought support for the biological relevance of the TFBSs by assessing their overlap with cis-regulatory regions that are active in different cell types and tissues. We started by mapping out the distribution of the TFBSs with respect to promoter regions, 5′ and 3′ UTRs, exons, introns, regions downstream of genes, and distal intergenic regions (Fig. 3, top bar for each species). These distributions were compared to random expectations obtained by shuffling the TFBS positions along the corresponding genomes (Fig. 3, bottom bar for each species). By comparing the observed and expected distributions, we noticed that TFBSs were prominently found in promoter regions (< 1 kb upstream of transcription start sites). The enrichment for TFBSs in promoter regions was further confirmed by (i) OLOGRAM [39], which uses a Monte Carlo simulation approach and a negative binomial model to compute the significance of overlap between two sets of genomic regions (Supplementary Figures 5, 6, 7, 8, 9 and 10), and (ii) bedtools reldist [38], which computes the relative distances between the TFBSs and the genomic regions considered [37] (Supplementary Figure 11). Nevertheless, considering the distribution of TFBSs for each TF independently in each species revealed TFs with binding preferences for promoter regions while others prefer intronic or intergenic regions (Supplementary Figures 12, 13, 14, 15 and 16). The TSS-proximal versus TSS-distal preferences could explain the previously reported short- versus long-range regulatory effects of TFs [40].
In the vertebrate species (human, mouse, rat, and zebrafish), the majority of TFBSs lie in introns and distal intergenic regions (Fig. 3), which is expected given the large portion of the corresponding genomes covered by these non-coding regions. To confirm the biological function of the TFBSs stored in UniBind, we examined their overlap with active cis-regulatory regions in the mouse and human genomes. We considered the candidate cis-regulatory elements (cCREs) predicted using epigenetic marks by the ENCODE consortium [41]. Specifically, DNase I hypersensitive open chromatin regions were first identified and then overlapped with H3K4me3 and H3K27ac histone modification marks and CTCF ChIP-seq data to predict five types of cCREs with: (1) a promoter-like signature (PLS), (2) an enhancer-like signature proximal (pELS) or (3) distal (dELS) to TSSs, (4) a H3K4me3 signature (DNase-H3K4me3), or (5) a CTCF-only signature [41]. Consistently, we confirmed that UniBind TFBSs were enriched in PLS and ELS cCREs when considering both the OLOGRAM and bedtools reldist evaluations of overlap (Fig. 4A-B; Supplementary Figures 17, 18). The enrichment at regions of active promoter signature is consistent with the genomic distribution observed above. The enrichment at regions harbouring active enhancer signature suggests that the TFBSs are not randomly spaced in the introns and intergenic regions. Furthermore, we confirmed that CRMs were enriched for cCREs with active promoter- or enhancer-like signatures when considering the 105,104 and 73,917 CRMs predicted in human and mouse, respectively (Supplementary Figures 19, 20 and 21). Figure 4C shows an example of the UCSC Genome Browser [42] at the human LDLR gene locus where we observe the overlap between UniBind TFBSs, CRMs, and cCREs.
Together, these results highlight the biological relevance of the UniBind TFBSs and CRMs for transcriptional regulation via their association with active promoters and enhancers in human and mouse.
Specificity of enhancer activity in cell types and tissues correlates with binding TF composition
We further investigated how the number of TF binding events at enhancers could be related to their regulatory effects. We considered enhancers that were identified through the capture of bidirectional transcription of enhancer RNAs (eRNAs) at their boundaries using Cap Analysis of Gene Expression (CAGE) in 1829 human libraries [43]. Cell type and tissue specificity was assessed by considering the amount of eRNAs captured by CAGE across the libraries [43]. We overlapped the UniBind TFBSs with the CAGE-derived enhancers and assessed the relationship between the expression specificity of the enhancers and the number of TFs with binding sites in these enhancers. We observed that cell type / tissue specific enhancers tend to harbour a lower number binding TFs, while more ubiquitously active enhancers tend to harbour a higher number of binding TFs (Fig. 5; Supplementary Figure 22). The correlation between the number of binding TFs and cell type / tissue expression specificity of enhancers is in line with previous observations showing an association between the number of TFBSs and the combinatorics of TFs at promoters and enhancers with enhancer activity strength and specificity [44,45,46]. Altogether, these observations underline the importance of TF cooperation for cis-regulatory activity.
UniBind TFBSs reveal TF binding combinatorics at cis-regulatory regions
We explored the capacity of UniBind TFBSs to further pinpoint relevant TF binding combinatorics at cis-regulatory regions. As a case study, we examined the direct TF-DNA interactions stored in UniBind and derived from ChIP-seq experiments in the untreated MCF7 cell line. This cell line is representative of estrogen receptor positive (ER+) invasive ductal breast carcinoma, which is known to be mainly driven by the combined activity of the TFs ESR1, GATA3, and FOXA1 [47]. We extended the genomic locations of UniBind TFBSs predicted in MCF7 by 50 bp on each side and intersected these regions between each pair of MCF7 TFBS datasets using the Intervene tool [48]. Next, we computed the fractions of overlap for each pair and calculated the pairwise Pearson correlation coefficients of the fractions of overlap between all pairs of datasets. A high pairwise correlation coefficient between two datasets indicates that the underlying TFBS regions are co-localizing. Hierarchical clustering of the pairwise correlation coefficient revealed 4 main clusters (Fig. 6). As expected, we observed high correlations between datasets for the same TF (e.g. red cluster in Fig. 6 with exclusively CTCF TFBSs). The largest cluster (Fig. 6, green) was mainly composed of TFBSs from ESR1, FOXA1, and GATA3. Co-localization of binding events for these TFs confirm the potential of UniBind TFBSs to highlight TFs known to cooperate at cis-regulatory regions. The second largest cluster (Fig. 6, blue) contained TFBSs for E2F1, NRF1, MAX, MYC, ELK1, ELF1, GABPA, EGR1, and SRF. Among these TFs, MAX and MYC as well as ELK1 and SRF are known to dimerize to bind DNA. Finally, the purple cluster was composed of JUN and FOS TFBSs, known to bind DNA as a dimer to form the AP1 complex. This case study exemplifies how UniBind TFBSs can be used to derive biologically relevant information about TF binding combinatorics.
Altogether, the assessments of the functional and biological relevance to study transcriptional regulation outlined here support, a posteriori, the high-quality of the direct TF-DNA interactions stored in the robust collection of UniBind.
UniBind web-application and web-services
Accessing and exploring UniBind data
All the direct TF-DNA interactions from the permissive and robust collections are freely available through the UniBind web-application at https://unibind.uio.no. The predictions come with metadata about the associated ChIP-seq experiments and external links to useful resources such as ReMap [26], GeneCards [49], and GEO [50]. Users can search and explore the data through the user-friendly web-interface. The web-application provides a search interface for users to filter the datasets using the metadata fields and search results are downloadable as a metadata table as well as FASTA and BED files for the TFBSs. To improve the searchability of the data, the search engine supports gene synonyms when searching for TFs. All data can be downloaded for individual datasets as well as through bulk download links per species or collection. In addition, we developed a RESTful API (https://unibind.uio.no/api/) to allow programmatic access to the stored data from any programming language. Finally, we built genome track hubs that are easily visualized through the UCSC [51] and Ensembl [52] genome browsers. The track hubs can be accessed through the UniBind web-application (https://unibind.uio.no/genome-tracks/) as well as through the public track hubs at UCSC [51] and the track hub registry (https://trackhubregistry.org/).
TFBS sets enrichment application tool
A regular task when studying transcriptional regulation is to find TFs that are the most likely to control the activity of a set of cis-regulatory regions. Classical strategies rely on the prediction of enriched potential TFBSs for a set of TFs derived from either ChIP-seq peaks datasets [53,54,55,56] or PWM predictions [55, 57]. As UniBind stores TFBSs with both ChIP-seq and PWM evidence of direct TF-DNA interactions, one can rely on this resource to infer the TFs likely to bind a set of cis-regulatory regions. The method consists in computing the enrichment for specific TFBS sets in given DNA regions compared to background regions. We provide a web-service (and the underlying source code) to perform this TFBS dataset enrichment analysis to the users at https://unibind.uio.no/enrichment/ (Fig. 7A). The enrichment computation relies on the Locus Overlap Analysis (LOLA) tool [58]. The enrichment tool provides three different types of enrichment analyses: (1) using a provided universe of potentially bound regions; (2) comparing enrichment with another set of genomic regions to perform differential enrichment; or (3) comparing the enrichment to all TFBSs stored in UniBind as a universe (Fig. 7A).
As a case example, we applied the enrichment tool to genomic regions surrounding CpGs found to be demethylated in ER+ breast cancer patients [59]. As a background set, we used all CpG probes from the Illumina Infinium HumanMethylation450 microarray. The demethylated CpG regions in ER+ patients were predicted to be bound by FOXA1, GATA2, GATA3, ESR1 and AR (top 5 TFs, Fig. 7B). The enrichment of these TFs is in line with the known ER+ TF drivers [59]. Further, the enrichment tool allows users to filter the results by restricting the search to TFBS datasets derived from specific cell lines / tissues. In our case study, limiting to breast-related cell types and tissues highlights FOXA1, GATA3, and ESR1 with the most enriched TFBS sets (Fig. 7C-D), which is in agreement with the driving role of these TFs in ER+ carcinogenesis.