Expressed LncRNAs
To investigate the triplex-forming sites of lncRNAs in a cell line of interest, we only considered the expressed lncRNAs in that cell line (Methods). This yielded 2,072 lncRNAs which were expressed in at least one of the seven human cell lines. We found 970, 853, 199, 773, 760, 322, and 325 lncRNAs which were expressed in cell lines GM12878, H1ESC, HMEC, HUVEC, HeLa, IMR90, and NHEK, respectively. To investigate the expression patterns of these 2,072 lncRNAs, their TPM values across the cell lines were clustered using Hierarchical Ordered Partitioning and Collapsing Hybrid (HOPACH) algorithm [21] (Fig. 1A). This revealed nine clusters of lncRNAs with distinct expression patterns (Fig. 1A). There were seven clusters, each one exhibited amplified expression in exactly one unique cell line (clusters I, II, III, V, VII, VIII, and IX for IMR90, HUVEC, H1ESC, HeLa, GM12878, HMEC, and NHEK, respectively) (Fig. 1A). There were only two clusters (clusters IV and VI) that showed nonspecific expression patterns (Fig. 1A). These observations resonate with previous reports of high cell and tissue specificity of lncRNAs [22, 23].
Thousands of LncRNA:DNA triplex-forming sites
To determine the lncRNA:DNA triplex-forming sites (TFSs) of expressed lncRNAs for each cell line, we aligned the lncRNA sequences to the hg19 genome using triplexator tool [24] restricting the length of the triplex structures to a minimum length of 20 bp. Interestingly, about 54 % (1110 out of 2072) of lncRNAs did not form lncRNA:DNA TFSs (Fig. 1B). The remaining 962 lncRNAs which formed at least one lncRNA:DNA TFS fell into two main categories: the first group (17 % or 361 lncRNAs) had less than 50 TFSs, and the second group (13 % or 275) had more than 5000 TFSs (Fig. 1B). LncRNAs use short regions within their sequence to form the triplex structures with the double-stranded DNA. We call such regions Triplex forming domains (TFDs). The alignment results by triplexator tool contain information on the portions of lncRNAs that bind to the DNA. We found that even though lncRNAs have the potential to form many triplex sites throughout the genome, they had very few triplexes forming domains (TFDs) within their sequence (Fig. 1C). Out of the 962 lncRNAs which have TFDs, 541, 221, and 82 had 1, 2, and 3 TFDs, respectively (Fig. 1C). The majority of the TFDs have a length ranging between 20 nucleotides and 30 nucleotides (Fig. 1D). These results indicate that lncRNAs may harbor one or two specific short sequences (TFDs) that allow them to anchor to many sites in the DNA via a lncRNA: DNA triplex-forming mechanism.
Next, we checked the relationship between the triplex-forming potential of lncRNAs and clusters identified in Fig. 1A. We found no statistically significant dependence between the number of TFDs and TFSs of lncRNAs, length of the TFDs with their expression pattern identified in the 9 clusters (p-value > 0.08 using ANOVA test) (Fig. 1E and F, and G) suggesting a triplex-forming mechanism as a general mechanism followed by lncRNAs across multiple cell lines.
Triplex-forming sites are enriched within topologically associated domains, their boundaries, and loop anchors more than expected, but they are evenly distributed across TADs
Next, we investigated the positions of TFSs relative to TADs to detect any positional preference. Genomic coordinates for TADs, their boundaries, and loop anchors were acquired from a previous study [1, 2] (Methods). In the seven cell lines, the TAD boundaries and loop anchors constitute a small fraction of the genome (between 1 and 6 %). In the majority of the cell lines close to 50 % of the genome is covered by TADs (Table S1 in Additional file 1). In IMR90 and H1ESC cell lines, about 65 and 83 % of the genome are covered by TADs, respectively (Table S1 in Additional file 1). To assess whether the lncRNA:DNA TFSs are enriched in TADs, we computed the observed coverage (or number of base pair overlaps) of TADs with the TFSs (Fig. 2A). Because of different coverages of the genome by TADs, we performed this separately for the cell lines. An expected coverage was generated by randomly positioning the TFSs within the genome and computing the coverage of this random set with the TADs Fig. 2A. This random shuffling was performed 1000 times, for each shuffled set; an expected coverage was obtained to generate a distribution of expected coverage. These distributions followed a normal distribution for all the seven cell lines (Anderson-Darling normality test: p-value > 0.01, Table S2 in Additional file 1). We found that in all the seven cell lines, the observed coverage of TFSs of lncRNAs with TADs was significantly higher than the expected coverage (p-value < 10− 16) (Fig. 2B and Fig. S1 in Additional file 1). Similarly, the observed coverage of TFSs with boundaries of TADs (Fig. 2C and Fig. S2 in Additional file 1) and loop anchors (Fig. 2D and Fig. S3 in Additional file 1) were significantly higher than the expected coverage in all the seven cell lines.
Next, we checked if there was a positional preference of the TFSs at specific locations across a TAD. This can inform if TSSs prefer regions close to the boundaries or away from them. For this, each TAD was divided into five bins of equal length. The frequencies of TFSs in the bins were computed. The TFSs were positioned randomly within the entire genome and frequencies of randomized regions in the five bins were also computed. We found that the TFSs were roughly evenly distributed across the entire length of a TAD (Fig. 2E, Fig. S4 in Additional file 1) and not significantly different from the random control (p-value > 0.1 using Kolmogorov-Smirnov test). This indicates no significant preference for TSSs for any specific region across a TAD.
Triplex-forming sites occupancy correlates with the size of domains and is positioned distant from CTCF binding sites
Next, we explore the relationship between the number of TFSs and the size of TADs. For this, we first normalized the coverage of TFSs in a TAD by the size of the TAD. Then the normalized coverages were compared to the corresponding sizes of TADs. There was a small negative linear correlation between the normalized coverage of TFSs and the size of TADs (Fig. 3A). When the same analysis was performed with the randomly positioned TFSs, no correlation between the normalized coverages and sizes of TADs was observed (Fig. S5 in Additional file 1). This suggests that TFSs are present in smaller domains at a moderate density compared to larger domains. CTCF is an insulator binding factor and has been linked to different properties of the 3D chromatin organization. To check the relationship between CTCF binding sites and TFSs, we computed a histogram plot of the distances between closest pairs of CTCF binding sites and TFSs using four bins (Fig. 3A). We also positioned the binding sites of CTCF randomly to obtain a set of randomized locations. The same histogram plot was constructed using the closest pairs of TFSs and randomized CTCF sites. We found no statistical difference between the two histogram plots (Chi-Square Test, p-value > 0.1 for all cell lines) (Fig. 3B) most likely because CTCF are preferred near boundaries while TFSs are roughly evenly distributed across TADs. Next, we investigated the densities of TFSs in different functional genomic elements (Methods). We found that the highest density of TFSs was in promoter or intronic regions with 2 TFSs for every 10 kb of a promoter or intronic region. Intergenic regions had a comparable (but slightly lower) density of 1.8 TFSs for every 10 kb interval. (Fig. 3C) compared to TFSs densities of 1.3, 0.5, and 0.2 TFSs for every 10 kb 3’UTR, 5’UTR, and exonic regions, respectively. The enrichment of TFSs in other functional genomic elements such as intergenic, intronic, and 3’UTR regions (not just in promoter regions) indicates a broader role of TFSs beyond direct gene regulation via protein-complex transportation to promoters.
Triplex forming sites within TADs that are shared in many cell types are associated with early development processes
Next, we focused on the TFSs, which occur within domains present in all the 7 sets of human TADs. We required such TFSs to occur within a domain in each of the seven sets of human TADs. Pooling together the TFSs from all the cell lines that overlapped with at least one domain yielded 571,832 unique sites. Out of this, 17,589, 55,851, 7650, 1150, 8551, 7369, 4055 sites were specific to domains belonging to GM12878, H1, HeLa, HMEC, HUVEC, IMR90, and NHEK cell lines, respectively. 81, 864 sites occurred within a domain present in each of the seven sets of human TADs. One should note that the domains within which 81, 864 sites occur might have different boundaries across two different cell lines. Gene ontology was performed on the genes (5,662) nearest to the 81, 864 sites, revealing associations with development terms and immune system-related terms such as somatic stem cell maintenance, aorta development, Fc receptor signaling, blastocyst development, trophectodermal cell differentiation (Table S3 in Additional file 1).
“TAD-lncRNAs”: LncRNAs as predictors for topologically associated domains
If lncRNA:DNA TFSs are important and enriched features in TADs, they can serve as predictors to differentiate between TADs and other regions. For this, a background set of genomic intervals that were similar in size (a number equal to the number of TADs) was generated by randomly selecting from the genome (excluding the original TAD locations) (Fig. 4A). The TFSs of expressed lncRNAs were also identified in this background set separately for each cell line (Fig. 4A). We used four different feature-based machine learning models to predict the class label of a region of interest (“TAD” or “non-TAD”) by using the frequency of TFSs of expressed lncRNAs in the region as features (Fig. 4B). The models were tuned using a 5-fold cross-validation approach while varying the appropriate model parameters on a training set (80 % of the total pool of data) (Fig. 4C and Table S4 in Additional file 1). Using five different evaluation metrics on the test set, the best performing model was selected (Methods) (Fig. 4D). In this approach, we excluded the H1 cell line because about 83 % of the genomic regions are located within TADs.
On average, the Random Forest model performed the best with an average accuracy of 74 % across the cell lines (Fig. 4D and Table S5 in Additional file 1). The best accuracy achieved were 71.58 %, 71.48 %, 71.20 %, 68.09 %, 70.58 %, and 76.70 % for cell lines GM12878, HeLa, HUVEC, HMEC, NHEK, and IMR90, respectively (Table S5 in Additional file 1 and Fig. 4D). While the best Area Under the Curve (AUC) achieved was 0.81, 0.77, 0.80, 0.68, 0.77, and 0.84 for cell lines GM12878, HeLa, HUVEC, HMEC, NHEK, and IMR90, respectively (Table S5 in Additional file 1 and Fig. 4D). These results show that TFSs of lncRNAs are important and enriched features in TADs and can be used as predictors to discriminate TADs from other regions.
Next, we aimed to identify important “TAD-lncRNAs” which were top predictors in the model performance. To do so, we assigned an “importance score” to each lncRNA based on its discriminating power in the Random Forest model using the “target shuffling” method (Methods). The top 10 “TAD-lncRNAs” for each cell line are shown in Table S6 (Additional file 1). We highlight one particular TAD-lncRNA predictor called DANCR or ENSG00000226950.6 (Fig. 4E) in cell line GM12878. The dominant isoform of DANCR with GENCODE id ENST00000444958.1 is 709 bp long and has a single 23 bp long triplex-forming domain at its 3’ end (Fig. 4E). The triplex-forming domain is rich in T bases and has 2,953 TFSs within TADs (Fig. 4E). Most of the triplex sites of DANCR are either in intergenic (44 %) or intronic regions (53 %) (Fig. 4F). Gene ontology analysis of genes closest to the triplex-forming sites of DANCR showed top enrichment in the regulation of GTPase activity (Fig. 4 G) and closely related terms such as JUN kinase activity. Some target genes include Rho GTPase Activating Proteins such as Arhgap36, and Arhgap40; Fibroblast growth factors such as Fgf3, and Fgf9. Enrichment in multiple pathways related to cancer such as Ras, Wnt, ErbB, and MAPK (Fig. 4G) was also observed. Some of the important target genes were Wnt2b, Wnt5b, Wnt5A, Wnt8a from the Wnt pathway, and Fgf9, Mapk1, Pak2, Igf1, Rasa2 from the Ras pathway. If TFSs are used as anchors by TAD-lncRNAs to mediate chromatin organization, their deregulation such as DANCR can disrupt the formation of TFSs and may alter chromatin organization. Consequentially, it may contribute to diseases including cancer.