Overview
TADKB has the following main components: browse, family view, acrossCells, search, and download. Detailed description of each component will be presented as follows.
Browsing component
The browse component allows users to select species, cells or cell lines, reference genomes, chromosomes, resolutions, and domain-caller methods. After a user makes the selection, all the TADs that meet the criteria will be displayed in a list as shown in Fig. 1. The TADs are listed with their starting positions in the chromosome. The ID, start genomic position, end genomic position, and length for each TAD will be displayed. Given two points on a chromosome, TADKB can check whether the two points are in a same TAD.
Once the user clicks one TAD, the main information page of that TAD will be displayed as shown in Fig. 2. This information page contains the Hi-C 2D visualization along with TAD annotations and 1D tracks (gene and various histone modifications from roadmap epigenomics project [49]) via Juicebox.js [55], the reconstructed 3D structures (MDS-based) of the selected TAD, the 3D structure of its chromosome with the selected TAD highlighted (need to click the corresponding tab), the 3D structure of its chromosome in single cells with the selected TAD highlighted (currently only structures for GM12878 are available), the numbers of protein coding genes, the lncRNAs (NONCODE, LNCipedia, and lncRNAdb) existent in the selected TAD, and the loops or peaks detected in the selected TAD which usually indicate promoter-enhancer interactions.
When a user clicks the tab of 3D structure of the chromosome, the 3D structure of the chromosome will be displayed with the selected TAD highlighted. Figure 3 shows an example page of single-cell chromosomal 3D structure. This function allows users to know the 3D location of the selected TAD in the chromosome.
When a user clicks the panel for protein coding gene information, a new page will be displayed as shown in Fig. 4 for MDS-based 3D structure of TAD using population Hi-C and Fig. 5 for single-cell structure of TAD using single-cell Hi-C. The user can select the coding gene(s) of interest, which will be highlighted in the 3D structure of the TAD. In this way, the user can know whether two genes are spatially proximate. The annotations of selected coding gene(s) will be automatically listed in the panel on the right, which contains: the gene ID in Ensembl, all the transcript IDs, all the protein IDs, description, gene start position, gene end position, and additional information. Once the user clicks additional information, he/she will be redirected to the annotation page on Ensembl.
Once the user clicks the lncRNAs page, all the lncRNAs defined in NONCODE, LNCipedia, and lncRNAdb will be listed. Similarly, when a user selects any lncRNA(s), the annotations will be displayed in the panel on the right as shown in Fig. 6. For each lncRNA, TADKB provides the information of lncRNA start and end locations, predicted functions, binding protein and class predicted by lncRNAtor [56], exons number, transcripts, and links to the three major lncRNA databases for more details. An important feature of TADKB is that it combines the three different databases for lncRNAs. These three databases have their own scheme of assigning IDs to lncRNAs, which causes inconvenience for biologists to cross-reference the definitions in these databases. In TADKB, the definitions or IDs for the same lncRNA will be combined. The ID from another lncRNA database(s) will be shown in the “Alternative lncRNAs” drop list on the panel on the right. Figure 6 shows the example of a lncRNA in NONCODE that is also overlapped with a lncRNA definition in LNCipedia.
When the user clicks the Loops/Peaks tab, all the peaks will be displayed as shown in Fig. 7. Loops or peaks can indicate enhancer-promoter interactions. The selected peaks will be highlighted in the 3D structure of the TAD. If the user also highlighted coding gene(s) or lncRNA(s) previously, he/she can see whether a peak existed between genes or lncRNAs.
Under the Fold enrichment of chromatin states tab, users can see the fold enrichment of each chromatin state as shown in Fig. 8. Rows with red color indicate that fold enrichment of that state is larger than one (i.e., enriched for the state), whereas blue color highlights the depleted chromatin states.
TAD family component
As described in the construction and content section, we used spectral clustering algorithm to cluster the TADs in a cell type based on their structural and chromatin-state similarities. Since spectral clustering needs the number of clusters as input, we predefined three numbers of clusters (i.e., 10, 20, and 30) for chromatin-state clustering with Pearson’s correlation between two TADs’ fold enrichments of chromatin states as similarity, and predefined four numbers of clusters (i.e., 2, 3, 5, and 10) for structural clustering with TM-score between two TADs’ MDS-inferred 3D structures as similarity. After obtaining the chromatin-state clusters, we gathered all TADs in a same cluster, computed their fold enrichment for each chromatin state, and found that each cluster has a unique state enrichment pattern. We also found that some clusters are apparently enriched with most of the states (log2 of fold enrichment larger than zero), whereas some other clusters are heavily depleted of chromatin states (log2 of fold enrichment less than zero). An example for GM12878 with the number of chromatin-state clusters equal to 20) can be found in Fig. 9(a), which shows that there are at least three clusters apparently depleted of chromatin states (i.e., clusters 2, 12, and 20).
We compared the clusters of chromatin states and 3D structures and found that there are overlapping TADs, that is, the same TADs were found in both types of clusters. An example shown in Fig. 9(c) has the numbers of chromatin-state and structural clusters equal to 20 and 5, respectively. We then normalized the number of overlapping TADs by the sizes of the two types of clusters to obtain the overlapping TAD enrichment, which is insensitive to the size of clusters. For example, the number of overlapping TADs between the structural cluster number 1 and chromatin-state cluster number 1 is 18 (see Fig. 9(c)); This value 18 was divided by 167 (the size of chromatin-state cluster number 1) and further divided by 338 (the size of structural cluster number 1), which results in 0.00031 (times 1000 for better visualization); and the final value is 0.3 (see the value in the left-bottom in Fig. 9(b)).
From Fig. 9(a) and (b), we observed that most of the TADs in the chromatin-state clusters that are depleted of chromatin states (e.g., 2, 12, and 20) can be found in the second and third structural clusters, especially in the third structural cluster. We tested all the possible number-of-cluster combination configurations (i.e., select one from 10, 20, and 30 as the number of chromatin-state clusters, and select one from 2, 3, 5, and 10 as the number of structural clusters) for TADs detected by DI at 50 kb resolution of the six human cell types, including GM12878, HMEC, HUVEC, IMR90, K562, and NHEK (6 × 3 × 4 = 72 heat-maps; all can be downloaded from the TADKB website) and observed the same patterns, that is, most of the TADs in the chromatin-state clusters that are depleted of chromatin states can be found in one or two structural clusters, indicating that this observation does not occur by accident. This observation may provide a novel way to connect TADs’ 3D structures with DNA functions indicated by chromatin states.
We plotted the distributions of exponent parameters and radius of gyrations of the mutual TADs overlapped in (1) the structural cluster number 3 and the chromatin-state cluster number 2 (86 mutual TADs), and (2) the structural cluster number 3 and the chromatin-state cluster number 12 (23 mutual TADs) (Fig. 9(d) and (e)) and found that compared with the other mutual sets, the TADs in these two mutual sets have smaller exponent parameters and larger radius of gyrations, which may indicate that these TADs have a less compacted 3D structure and they all have depleted chromatin-state enrichment. We also plotted the gene density distribution (Fig. 9f), showing that the TADs in these two mutual sets have apparently smaller gene density.
We next explored whether our observations are resulted from heterochromatins or gene desert. First, we downloaded the gap table for hg19 from UCSC genome Table Browser, compared the gaps of heterochromatins and centromeres with the 2773 TADs from GM12878, and found that (1) only 15 TADs (see Additional file 1: Table S2 for details of the 15 TADs) are overlapped with some heterochromatin or centromere regions; (2) only three out of 15 TADs belong to the two structural clusters (clusters 2 and 3 in Fig. 9) with depleted chromatin state enrichment. Therefore, we think our observations are not related to heterochromatins or centromeres. Second, from Fig. 9 we can observe that most of the TADs have positive gene densities, indicating that most of the TADs do not belong to gene desert. Therefore, we think our observations may not be related to gene desert neither.
We listed the chromosomes, coordinates, exponent parameters, and radius of gyrations of the TADs in the overlapping sets between (1) the chromatin-state cluster number 2 and the structural cluster number 3 (Additional file 1: Table S3), (2) the chromatin-state cluster number 12 and the structural cluster number 3 (Additional file 1: Table S4), and (3) the chromatin-state cluster number 20 and the structural cluster number 3 (Additional file 1: Table S5). We gathered the coding-genes existent in the TADs in these three sets and run a GO enrichment test using AmiGO2 (http://amigo.geneontology.org/rte). The enriched GO terms in biological process ontology (BPO), cellular component ontology (CCO), and molecular function ontology (MFO) are also listed in the caption of corresponding Additional file 1.
TADKB defines the overlapping/common TADs between chromatin-state and structural clusters as a family. In this way, both structural and chromatin-state features of TADs are considered when grouping TADs into families. Each family comes with a score assigned to that specific family. For the families constructed based on chromatin states, the score is the percentage of positive values in fold enrichment of each chromatin state (log2). Notice that a smaller score (e.g., < 0.5) indicates that on average the corresponding chromatin-state cluster/family is depleted of chromatin states. An example of the details of one family is shown in Fig. 10. We next calculated the average Hi-C heat map for each family. Since the sizes of TADs (number of bins) in each family vary, we cannot directly calculate the average Hi-C matrix for a family. Therefore, for each TAD we extracted a 30 × 30 Hi-C submatrix from the large matrix of the whole chromosome by evenly extending the sizes of smaller TADs (< 30 bins) and evenly reducing the sizes of bigger TADs (> 30 bins). After that, all TADs’ Hi-C matrices were with the same sizes (30 × 30) and we then calculated the average Hi-C heatmaps. The average heatmap (log2 scale) can be found under Table of family members on the Family webpage. An example of the average heatmaps of three families can be found in Additional file 1: Figure S1, which shows that they have different patterns in terms of their average Hi-C contact matrices.
acrossCells component
We defined acrossCells as the set of two TADs from the same species that are in different cell lines but exist in the same chromosome and with the same coordinates (start and end positions in the chromosome). We provided acrossCells of 15 cell-pairs among the six cell types in human with available chromatin-state annotations. For each pair of acrossCells, we computed the Pearson’s correlation coefficient between their fold enrichment of 25 chromatin states and the TM-score between their MDS-inferred reconstructed 3D structures. The distribution of these two similarity measures can be found in Fig. 11, which shows that the acrossCells found in HMEC and NHEK have very similar enrichment pattern of chromatin states, and acrossCells always have very high TM-scores. An example of the TADKB webpage showing the acrossCells between GM12878 and HMEC can be found in Fig. 12. Users can also browse the two dynamic Hi-C heatmaps side by side for TAD pairs in acrossCells by clicking the chromosome column in the selected acrossCells TADs table. An example figure can be found in Additional file 1: Figure S2.
TAD search component
A user can submit gene (protein coding gene or lncRNA) names, IDs, or query DNA sequences; and TADKB will search against in-house sequence sets and provide matched genes and their associated TADs. Figure 13 shows the web page of searching. If there are hits found, TADKB will display the TADs that contains the hit sequences as shown in Fig. 14. A user can then further click one of the TADs and then browse detailed information of it.
Downloading component
The download component allows users to download 90 TAD annotation files as described in Additional file 1: Table S1 and 72 heatmaps about the overlapping TAD analysis between chromatin-state and structural clusters.