Exploiting DROSHA knockout cells to analyze intergenic miRNA gene structure
To identify the transcription start sites of intergenic miRNAs, we utilized the DROSHA knockout human colorectal cancer cell lines that we established recently [18]. In contrast to the cells with intact DROSHA, in which the pri-miRNAs cleaved rapidly, the pri-miRNAs in the DROSHA knockout cells showed increased accumulation (Additional file 1), which enabled us to identify the gene structure of intact pri-miRNAs more easily. We extracted total RNA from the DROSHA knockout cells and their corresponding parental wild-type cells. As pri-miRNAs are known to be polyadenylated [19], we enriched the polyadenylated RNAs and used them for next-generation sequencing (Fig. 1a).
From the analysis of the sequencing results, we confirmed that the number of reads from genomic regions encompassing miRNA hairpin sequences was highly increased in DROSHA knockout library compared to wild-type library (Fig. 1b and Additional file 2). Compared to the sequencing reads obtained from the library made using wild-type cells, those from the DROSHA knockout library delineated the transcribed region more reliably (Fig. 1c and Additional file 3). Notably, the graph representing the sequencing reads from the DROSHA knockout library showed a prominent boundary at the 5′ end of the transcribed region (Fig. 1c and Additional file 3). Therefore, it can be expected that pri-miRNA transcription begins at this 5′ end site and it is possible to pinpoint the transcription start site based on the RNA-seq data from DROSHA knockout cells.
Precise mapping of the transcription start sites of miRNAs
To pinpoint the transcription start site of pri-miRNAs, we employed the RACE technique; to select miRNA candidates for RACE, we applied several criteria. First, we selected candidates among 274 miRNAs that are included in the list of authentic miRNAs, which we had reported previously [20]. Briefly, we carefully selected these authentic miRNAs from the whole miRBase entries, by observing their expressions from a large number of sequencing libraries, analyzing the homogeneity at 5′ termini of their sequences, and curating them through manual inspection based on literature; therefore, they are expected to be biologically important. These miRNAs comprise 175 pri-miRNAs, since clustered miRNAs in close genomic proximity are transcribed as a single transcript (Additional file 4). The co-transcription of miRNA clusters that we selected for RACE was confirmed by literature search, expressed sequence tags (ESTs) analysis, or PCR experiments (Additional files 4 and 5). Second, by inspecting the graphs of sequencing reads (Fig. 1c and Additional file 3) we selected only those pri-miRNAs whose expression signals were detected at high levels in the HCT116 cells that we tested (see Methods). In addition, only the pri-miRNAs whose reads were enriched in DROSHA knockout library compared to wild-type library were selected. However, the pri-miRNAs that overlap with protein-coding genes in the genome were excluded from this inspection. After filtering, 34 pri-miRNAs that met our criteria were selected (Additional file 4). When we compared the graphs of sequencing reads of the candidate pri-miRNAs between DROSHA knockout and wild-type libraries, most of the pri-miRNAs showed enrichment of reads in the regions containing pre-miRNA hairpin sequences (Fig. 1c and Additional files 3 and 4). We assumed that these enriched regions were the transcribed regions of pri-miRNAs and designed PCR primers for their 5′ end cloning. By performing 5′ RACE, we could pinpoint the 5′ ends of 29 pri-miRNAs, which account for 60 mature miRNAs. We annotated these 5′ end nucleotides as the transcription start sites of pri-miRNAs (Fig. 2a).
The calculated distances between the identified transcription start sites and the 5′ end nucleotides of pre-miRNAs sequences ranged from less than 1 kilobase (kb) to more than 200 kilobases (kbs). For example, the transcription of miR-200c, which is a member of the pri-miR-200c ~ 141 cluster (tilde indicates transcript spanning the indicated gene cluster), is initiated 472 base pairs (bps) upstream of the hairpin sequence. The genomic distances between the transcription start sites and pre-miRNA hairpins for 16 out of 29 pri-miRNAs are less than 10 kbs (Fig. 2a). In some cases, however, the transcription start site was identified more than 100 kbs upstream of the pre-miRNA sequences (pri-miR-222 ~ 221 and pri-miR-29b-1 ~ 29a). We could not identify any factor that might affect the distances between the transcription start sites and the position of pre-miRNAs.
We analyzed the distribution of RNA polymerase II (Pol II) and TATA-binding protein (TBP) signals near the transcription start sites of pri-miRNAs using the public data from the Encyclopedia of DNA Elements (ENCODE) project [21]. As a control, we compared the distribution to that of Pol II and TBP signals near the transcription start sites annotated for the RefSeq mRNAs. The distribution graphs of Pol II and TBP were similar between pri-miRNAs and RefSeq mRNAs with the signals made a peak near the transcription start site (Fig. 2b and c). This suggests that Pol II and TBP bind near the transcription start sites of pri-miRNAs, similar to the case in mRNAs, and initiate transcription from this region in consistent with our RACE results. Moreover, the transcription start sites of pri-miRNAs clearly associated with CpG islands in a similar pattern with that of RefSeq mRNAs. We also observed the cap analysis gene expression (CAGE) data near transcription start sites of pri-miRNAs and RefSeq mRNAs, and found that the peak of CAGE signals exactly matched with the transcription start sites in both cases (Fig. 2b and c) [21]. The promoter-associated histone modifications including H3K4me3 and H2A.Z were also enriched near the transcription start sites of miRNAs. Overall, these data show that the pri-miRNAs analyzed in this study actually initiate their transcription from the positions obtained from our RACE analysis.
Analysis of transcription factor binding to miRNA promoters
Although the regulatory relationship between miRNAs and their target mRNAs has been studied extensively, the transcriptional regulation of miRNAs themselves by transcription factors is not well understood. To analyze the transcriptional regulation of miRNAs, we utilized the data generated by the ENCODE project, which contains the chromatin immunoprecipitation followed by sequencing (ChIP-seq) data for 161 transcription factors [21]. To identify the transcription factors that may regulate miRNA expression, we selected the genomic region spanning the promoter of the pri-miRNA, from −2000 to +500 (i.e., 2000 nts upstream to 500 nts downstream of the transcription start site) as the binding region of transcription factors. Although some transcription factors regulate transcription from a distance, most proximal binding events of functional transcription factors can be captured in this region. Consistent with this notion, the binding sites of transcription factors have been found to be concentrated in this region [22–25]. After collecting promoter sequences of pri-miRNAs, we looked for ChIP-seq signals for transcription factors within this region. As a control, we collected the sequences in corresponding regions near the transcription start sites of RefSeq mRNAs.
We found the binding sites of diverse transcription factors to be clustered in the promoter region of pri-miRNAs that we selected (Figs. 1c and 3a). Notably, the transcription start sites of all selected pri-miRNAs contained the ChIP-seq signal for Pol II (POLR2A) in this region. In contrast, about a quarter of the promoters of RefSeq mRNAs did not contain the ChIP-seq signal for Pol II (Fig. 3a). In addition to Pol II, other general transcription factors, such as TBP and TAF1, also showed a higher fraction of binding to pri-miRNA promoters than to RefSeq mRNA promoters. It is possible that the RefSeq mRNAs without ChIP-seq signal for general transcription factors at their promoters might be expressed at lower levels in the cells. When we divided the RefSeq mRNAs into four classes based on their expression levels, the group of highly expressed mRNAs also associated with the general transcription factors at a higher degree (Additional file 6).
Upon comparing the binding fractions of transcription factors between the pri-miRNA and RefSeq mRNA promoters, we found that several transcription factors were significantly enriched in pri-mRNA promoters (Fig. 3b). Among the highly enriched factors, there were many that were functionally related; for example, the transcription factor MAFF was found to bind to the promoters of miRNAs to which another transcription factor MAFK also bound (Additional file 7). Based on this data and previous reports that showed MAFF and MAFK to be members of the same protein family, we suggest that they may work together to regulate the expression of common miRNAs [26]. Interestingly, several transcription factors including TFAP2A, EBF1, and STAT3, were also included in the list of enriched transcription factors for intergenic miRNAs from a previous study [9].
It is plausible that a similar set of transcription factors would bind to the promoters of miRNAs that need to be regulated together. To identify such pairs of miRNAs, which might be regulated by common transcription factors, we calculated the correlation value between transcription factor binding to promoters of both the pri-miRNAs of each pair; these correlation values differed considerably among pri-miRNA pairs (Fig. 3c, see Methods). The correlation values between paralogous pri-miRNAs, i.e., between pri-miR-181a-1 ~ 181b-1 and pri-miR-181a-2 ~ 181b-2, or between pri-miR-29b-1 ~ 29a and pri-miR-29b-2 ~ 29c, were high (0.60 and 0.64 respectively), as expected. It is possible that they are under the control of similar sets of transcription factors because they had originated from the same ancestor gene. Interestingly, some miRNA pairs, which did not have any sequence homology, showed higher correlation values than did paralogous pri-miRNAs pairs; for example, pri-let-7a-1 ~ let-7f-1 ~ let-7d and pri-miR-30d ~ 30b showed correlation value of 0.83, which suggests that these pri-miRNAs are under the transcriptional control of a highly overlapping set of transcription factors (Fig. 3c). To test whether the high correlation of transcription factor binding to pri-miRNA promoters results in a high correlation of the expression levels of the corresponding mature miRNAs, we selected the miRNA pairs from the top 20 % and the bottom 20 % of the list of pri-miRNA pairs placed in the order of their correlation values for transcription factor binding (Fig. 3c). We compared the correlation of each pri-miRNA pair with the correlation between the expression of the corresponding mature miRNAs, which was determined using the expression profiles of mature miRNAs from diverse tissues [27]. Interestingly, the pri-miRNA pairs from the top 20 % of the list also showed higher correlation between the expression of their mature miRNAs (Additional file 8), suggesting that they are indeed under similar transcriptional control; as a result, the mature miRNAs tend to be expressed simultaneously from these pri-miRNAs.
Expression analysis of the host genes of intronic miRNAs
In the case of intergenic miRNAs, most of the regions covering pri-miRNA sequences were enriched by DROSHA ablation (Fig. 1c and Additional file 3). Surprisingly, however, analysis of sequencing reads of the host genes of intronic miRNAs showed an increase in the number of sequencing reads of only the introns that harbor miRNA hairpin sequences. For example, analysis of the genomic locus of EGFL7 gene, which is the host gene of a well-known intronic miRNA, miR-126, showed an increase in the number of sequencing reads of only the miRNA-containing intron from DROSHA knockout cells, compared to those from wild-type cells (Fig. 4a and Additional file 9). None of the other introns and exons showed a significant difference between wild-type and DROSHA knockout cell lines. These observations confirm our previous results that the splicing of only the miRNA-containing introns is delayed by DROSHA knockdown, while the production of mature mRNAs is not affected [4].
To determine whether this result applies to other intronic miRNAs, we analyzed the host genes of all intronic miRNAs that show detectable expression in the cells that we tested (Additional file 4). For most intronic miRNAs, we found a significant enrichment of sequencing reads of the miRNA-containing introns, although in some cases, this enrichment was not seen (Fig. 4b and Additional file 10). These results suggest that our previous observation can be generalized for most intronic miRNAs [4]. We also noted the enrichment of sequencing reads of the neighboring upstream intron, for some intronic miRNAs, albeit less significant. We do not understand the underlying mechanism of this phenomenon; however, one possible explanation is that the upper intron is also influenced by the protein complexes bound to the miRNA hairpin-containing intron. Interestingly, we found that the sizes of miRNA-containing introns with enriched reads in DROSHA knockout compared to wild-type, are slightly but significantly shorter than the sizes of miRNA-containing introns without reads enrichment (Additional file 11). In addition, the relative positions of miRNA-containing introns with enriched reads had a tendency to be located more downstream of host genes (close to 3′ end of genes). Further analysis is required to understand the crosstalk between miRNA processing and splicing reactions.