Identification of candidate structured ncRNAs
Promising ncRNA motifs were identified by initially applying our computational pipeline to fungal genome sequences present in NCBI RefSeq Release 29 [35], and later data from RefSeq Release 62 were incorporated into this study. Refinement of the list of representatives, sequence alignments and conserved sequence/structural models were completed with the sequence database as updated in RefSeq Release 75.
Briefly, the workflow for discovering structured ncRNAs (Fig. 1) involved extracting noncoding fungal DNA sequences as informed by pre-existing genome annotations, clustering of similar sequence regions by using BLAST [36], and filtering to remove clusters matching known ncRNAs present in Rfam [37] or to remove those with extensive protein-coding potential by using RNACode [38]. This process yielded many “pre-candidate” motifs that required further analysis to assess their relative likelihoods of functioning as common fungal ncRNAs. Each pre-candidate was subjected to analysis by CMfinder, which in part uncovers evidence for nucleotide sequence covariation that is indicative of secondary structure formation. If nucleotides in a predicted base-paired stem frequently co-vary in a manner that suggests conservation of the stem, then the cluster of RNAs was considered a strong candidate for having ncRNA function. Iterative analysis of conserved sequences and substructures, augmented by the discovery of additional representatives, was conducted using Infernal [39] to yield a refined sequence and structure model for the ncRNA candidate.
Analysis of candidate structured ncRNAs
Each ncRNA candidate was evaluated based on the complexity of its conserved sequence and structure model, its common genomic locations, and its phylogenetic distribution, among other features. Below are described the most promising candidate motifs that likely serve as functional ncRNAs. The alignments of these high-ranking motifs in Stockholm format are presented as supplemental material (Additional files 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 and 16).
HDV self-cleaving ribozyme variants
The first member of the HDV ribozyme class [40, 41] to be reported was identified in the antigenomic sequence of the Hepatitis Delta Virus [42]. Since this initial HDV ribozyme discovery, numerous self-cleaving RNAs that conform to this same general consensus sequence and secondary structure model have been discovered [43,44,45]. Despite the relatively large size of this ribozyme, there are only a few highly conserved nucleotides interspersed among the four extended base-paired regions that form its unique nested pseudoknot architecture [46] (Fig. 2a). The vast majority of HDV ribozyme representatives have been discovered by using a computational strategy that only seeks RNAs conforming to this unique secondary structure, while largely ignoring the few conserved nucleotides [44, 45]. Thus, bioinformatics search methods that also take into consideration conserved nucleotides have the potential to uncover additional representatives.
Our computational pipeline has uncovered a total of 230 representatives of a motif with considerable similarity in sequence and secondary structure to the known members of the HDV self-cleaving ribozyme class (Fig. 2b). These RNAs are present among 26 fungal species. The most noteworthy differences between these new RNAs and the previously published consensus (Fig. 2a) are (i) a C-G base-pair typically represents the otherwise more general R-Y base-pair at the ribozyme cleavage site, (ii) the P4 stem is partially replaced by an E loop [47,48,49] RNA motif, and (iii) an additional nucleotide (most commonly a C residue) is inserted between the P2 and P3 stems. The latter two differences violate the criteria used previously [41, 44, 45] to find additional members of the HDV ribozyme class, which in part explains why these distinctive fungal examples remained undiscovered. Only 11 fungal HDV ribozyme examples that conform to the published consensus had been discovered previously, suggesting that members of the unusual variant type revealed by our search method predominate in fungi.
Given the distinctive features of the newly-found fungal representatives of this motif, and given their widely variable gene associations, we chose to determine whether members can cleave RNA. Bimolecular substrate-ribozyme complexes were constructed for two examples derived from different organisms. For example, a bimolecular construct based on a representative from Penicillium chrysogenum (Fig. 2c) promotes cleavage of the RNA substrate strand at the base of the P1 stem (Fig. 2d). This cleavage site precisely matches that expected for HDV ribozymes, based on the similarity between the predicted structure of the fungal RNAs and the previously-established architecture of self-cleaving ribozymes belonging to the HDV class [46].
Similar results were observed for another bimolecular RNA construct derived from the fungal species Aspergillus niger (Fig. 3a). Again, the cleavage site for this fungal HDV ribozyme variant corresponds to that expected for more conventional HDV ribozymes as determined by gel mobility (Fig. 3b) and analysis by mass spectrometry (Fig. 3c) of the substrate cleavage products. Moreover, the masses of the products are consistent with a ribozyme mechanism wherein a 2ˊ-oxygen atom serves as a nucleophile to attack the adjacent phosphorus center to yield cleavage products with 2ˊ, 3ˊ-cyclic phosphate and 5ˊ hydroxyl termini on the 5ˊ and 3ˊ cleavage fragments, respectively. This is the same mechanism used by all other small self-cleaving ribozymes discovered to date [50].
SDC motif
The SDC motif, represented by 34 distinct examples from 26 fungal species, typically forms a small hairpin with an 11-base-pair stem (Fig. 4a). However, SDC motifs can exhibit some variation in stem integrity and length (e.g. see Fig. 4b for the representative from N. crassa). Each representative is located immediately upstream of a SAM (S-adenosylmethionine) decarboxylase gene, suggesting that hairpin formation and the associated conserved nucleotides are important for regulation of the SDC gene. In N. crassa, there appear to be two uORFs (Fig. 4c), which are short peptide-coding regions located in some mRNAs a short distance upstream of a main open reading frame. Such uORF regions are commonly involved in controlling translation initiation of the adjoining gene [51, 52]. For example, in N. crassa, high arginine concentrations cause ribosomes to increase stalling within the arg-2 uORF, which reduces translation initiation at the main ORF located immediately downstream [53, 54]. Both arginine and newly synthesized uORF-derived peptides are required for ribosomal stalling.
The SDC enzyme catalyzes the synthesis of S-adenosylmethionine to provide propylamine for the production of polyamines, such as spermidine and spermine [55, 56]. The production of polyamines is known to be highly regulated by cells at multiple levels including transcription, translation, enzyme activation and protein degradation. At the translational level, mammalian and plant SDC expression is primarily regulated by the expression of uORFs located in their 5´ UTRs [55, 56], and to a lesser extent by secondary structures formed by the 5ˊ UTR [57]. Indeed, polyamine-responsive gene control very commonly involves uORF elements [58].
The two uORFs associated with the SDC motif of N. crassa are located nearby two consensus 5ˊ splice sites and one 3ˊ splice site (Fig. 4c). Given that some TPP riboswitches in fungi control both alternative splicing and uORF expression [55], we initially speculated that the SDC motif might be part of a complex set of regulatory elements that similarly controls SDC gene expression. To examine the biological function of the SDC motif, the 5′ UTR of the N. crassa SDC gene including a portion of the SDC ORF was fused in-frame with a luciferase reporter gene. This construct (Pre, Fig. 4c) was transformed into N. crassa cells and the presence of various RNA transcripts was examined by RT-PCR (Fig. 4d). The mRNA precursor was found to be efficiently alternatively spliced into two predominant forms called Sp-I, which contains two predicted uORFs, and Sp-II, which lacks these two uORFs (though the 3ˊ half of uORF 2 is still present) (Fig 4c, d). Therefore, any effects on expression of the SDC ORF possibly caused by the uORFs cannot occur with the Sp-II form of the processed mRNA.
We then assessed the importance of the SDC motif hairpin to gene expression by examining luciferase activity in fungal preparations carrying the WT and M1 through M4 reporter-fusion constructs. Disruption of the P1 stem by changing one (M1), two (M3), or three (M5) base-pairs to mismatches causes a substantial increase in reporter gene expression relative to that observed with the WT construct (Fig. 4e). By introducing additional mutations that restore base-pairing (M2, M4), these mutant constructs exhibit gene expression levels that approach that of the WT construct. These findings indicate that the SDC motif behaves as a negative regulatory element to inhibit SDC gene expression.
RT-PCR analysis on the variant constructs revealed that the disruptive SDC motif mutations M1 and M3 do not influence the levels of Sp-I and Sp-II mRNA products (data not shown), suggesting that the SDC motif regulates gene expression in a manner that does not directly involve alternative splicing. Moreover, we did not observe evidence for specific binding of spermine or spermidine to a representative of the RNA motif by using in-line probing [59, 60] assays (data not shown). Therefore, the mechanism of regulation by the SDC motif appears to differ from that observed for some fungal TPP riboswitches that regulate alternative splicing and the translation of uORFs [61]. Additional studies will be necessary to determine (i) if alternative splicing of the SDC precursor RNA is regulated, (ii) if uORF-mediated regulation of SDC gene expression occurs, (iii) how the SDC motif hairpin suppresses gene expression, and (iv) how SDC motif structure can be naturally manipulated to affect expression.
amd motif
A total of 23 unique examples of the amd motif (Fig. 5) have been identified among 20 fungal species. The consensus sequence and secondary structure model based on these sequences reveal that the motif likely adopts an elongated two-stem junction wherein the most-highly conserved nucleotides reside in the internal loop and in sections of P1 and P2 nearest to this loop. Moreover, the nucleotides immediately upstream of P1, along with several others extending into the internal loop, appear to code for a short uORF. This indicates that the ORF and overlapping RNA structure of the amd motif might collaboratively regulate the downstream gene. Of the 23 unique amd motif examples, nine members carry a uORF that codes for eight amino acids, and 14 members carry a uORF that codes for only six amino acids (Fig. 5). The putative peptides are highly conserved, and the last five amino acids of both groups carry the sequence A(V/F)(A/V)EL.
Each amd motif example resides upstream of a protein-coding gene of unknown function, making it difficult to formulate hypotheses regarding the function and mechanism of this putative ncRNA element and associated uORF. In more than nine species, the associated gene is annotated as coding for an amidase enzyme, whose function is to hydrolyze amide functional groups to yield ammonia and a carboxylic acid group. Given the presence of a glutamate codon in the penultimate position of the uORF, the amidase activity might be related to the production of this amino acid. For example, a shortage of glutamine might cause the uORF system to trigger expression of a glutaminase ORF to convert glutamine to glutamate. A previous attempt to create an amd gene (NCU05182) knockout in N. crassa was unsuccessful [62]. The heterokaryotic strain, but not the homokaryotic strain, survived, indicating that the gene associated with the amd motif is critical for cell survival. Given this experimental complication, we did not pursue additional validation studies.
ies6 motif
The ies6 motif (Fig. 6a) includes 34 examples from 20 species of fungi. This RNA motif forms an extended hairpin structure with one small and one large internal loop. The hairpin loop conforms to a GNRA tetraloop sequence [63], which is frequently found in structured RNAs. GNRA tetraloops can form tertiary interactions with tetraloop receptor structures [64, 65]. This fact, coupled with the extensive conservation on each side of the hairpin structure suggests that the consensus motif as presented might represent only a portion of a more complex RNA architecture.
These predicted RNA structures are almost always located adjacent to a gene coding for a protein similar to the chromatin remodeling complex subunit Ies6 (or Ino80 Subunit 6), which is involved in chromatin modification, chromosome segregation and the regulation of telomere length [66,67,68]. However, the polarity of the RNA motif as depicted (Fig. 6a) is opposite to that of the ies6 mRNA, because this configuration retains a GNRA tetraloop commonly found in structured RNAs. Therefore we speculated that this motif is likely present in an antisense RNA produced from the same genomic location as the ies6 gene. To determine if both sense and antisense RNAs corresponding to the ies6 motif were produced by N. crassa cells, we conducted RT-PCR assays. As expected, RT-PCR product bands corresponding to both the sense and antisense RNA transcripts were observed by gel electrophoresis (Fig. 6b).
Antisense transcripts are found in many species where they often participate in the gene regulation. They are able to base-pair to their targets to regulate gene expression through various mechanisms including transcription attenuation, translation inhibition, primer maturation inhibition, splicing regulation, nuclear retention, and mRNA degradation and stabilization [69,70,71,72]. However, additional experiments will be required to determine if the ies6 motif is indeed part of an antisense RNA transcript, and what role this motif might play in regulation of antisense production and action.
hexA motif
The hexA motif (Fig. 7a) includes 19 examples from 10 fungal species. The predicted secondary structure includes a long hairpin interrupted by two small internal loops. In addition, a long region with sequence conservation lacking evidence of structure extends downstream. The motif is found in the putative 5´ UTRs or introns of genes annotated as encoding either hypothetical proteins (40%) or Woronin body [73, 74] protein Hex subunits (60%). Woronin bodies are fungal-specific organelles that plug the septal pores quickly to prevent a cell from losing its contents during physical damage. Hex subunit proteins are derived from differently spliced mRNA forms and are one of the major and essential components of the Woronin body [74, 75].
Intriguingly, a portion of the sequence forming the right shoulder of the hexA motif hairpin closely approximates the consensus for a 3ˊ splice site (Fig. 7a). Moreover, nucleotides within the long 3ˊ tail exhibit evidence that they code for protein. Specifically, the least conserved nucleotides in this region occur at every third position, which might correspond to the wobble position of codons. These characteristics suggest that the hexA motif RNA structure might regulate splicing to create an altered ORF sequence for translation. However, at least seven of the representatives of the hexA motif are in the opposite direction as the genes encoding Woronin body proteins, whereas at least ten are in the same orientation. This variability in orientation weakens the hypothesis that the motif might be involved in gene regulation by controlling alternative splicing of the adjacent coding region.
SART-1 motif
The SART-1 motif consensus model (Fig. 7b) is based on 12 unique examples from 11 fungal species. The secondary structure can potentially include at least two hairpins, although only P1 is supported by extensive evidence of covariation and the frequent presence of a UNCG tetraloop element. RNA hairpin loops that conform to this consensus are known to be structurally stabilizing to adjoining base-paired regions [76, 77]. Nearly all representatives are located in the same location as the 5´ UTRs of genes that produce a protein of unknown function similar to the mammalian SART-1 (squamous cell carcinoma antigen recognized by T cells) protein [78]. SART-1 is similar to the yeast Snu66 spliceosomal protein. One SART-1 RNA representative resides in the location of an annotated intron for the SART-1 gene. In mammals, the SART-1 gene encodes split ORFs that are possibly translated by a mechanism of -1 frameshifting.
Importantly, the SART-1 RNA motif is predicted to be formed by antisense transcripts of the associated gene. This hypothesis is supported by the fact that numerous G-U wobble pairs in the consensus model would otherwise become A-C mismatches if the RNA structure were formed by the sense transcript, which is unlikely. Furthermore, the commonly-occurring UNCG tetraloop structure in P1 would become a CGNA tetraloop in the sense direction, which is not known to confer the same structural benefit as a UNCG tetraloop.
AU-rich hairpin motif
Only 12 examples of the AU-rich hairpin motif (Fig. 7c) were identified from three fungal species, and eight of these representatives were from a single species, Rhizoctonia solani. Interestingly, two additional examples with considerable sequence and structural similarity were found in the bacterial genome of Orientia tsutsugamushi. These two bacterial examples were originally annotated as members of the mraW class of putative ncRNAs as reported previously [31]. However, only a small portion of the consensus sequence for mraW motif RNAs is similar to the consensus sequence derived from the fungal examples of the AU-rich hairpin motif. Therefore it is not certain whether these similarities are biologically relevant or coincidental.
The function of the bacterial mraW motif RNAs remains unknown, and unfortunately the gene associations for the fungal examples are highly variable. As a result there are no compelling clues regarding the possible biological roles of the fungal motif. If additional examples are found in the future, perhaps genomic location data might provide the insight necessary to better formulate testable hypotheses.
Atypical snoRNA motif
The atypical snoRNA motif (Fig. 7d) is represented by 81 examples from 52 fungal species. Most of the representatives are located in introns, with the exception of seven that appear to reside apart from introns. For example, in the yeast Schizosaccharomyces pombe (NC_003424.3), the motif resides in the intron of the cpc2 gene from nucleotides 2440331 to 2440419. In most instances, this RNA motif is associated with genes encoding guanine nucleotide-binding proteins (G proteins), and more rarely are located near several genes for proteins of unknown function or for sterol 24-C-methyltransferase. G proteins are important signal transduction components whose functions are well established in a diversity of signaling pathways [79, 80], although the functions of the specific G proteins associated with this ncRNA motif are unknown.
These fungal ncRNA candidates exhibit considerable similarity to certain snoRNAs, which guide chemical modifications on other RNAs, including ribosomal RNAs (rRNAs), transfer RNA (tRNAs), and small nuclear RNAs (snRNAs) [81]. One type, called box C/D snoRNAs, have a conserved C box (RUGAUGA) and D box (CUGA). Additional conserved regions called the Cˊ box and Dˊ box mimic the sequence of the C box and D box, respectively. The atypical snoRNA motif examples we identified carry two regions that closely approximate the C box (AUGAUGY) and D box (CUGA), although the apparent Cˊ box (AUGAGAC) and Dˊ box (CAGA) consensus sequences correspond to the consensus snoRNA sequences more poorly.
RT-PCR was used to evaluate the production of the A. nidulans representative of this RNA class. Only the RT-PCR product corresponding to the spliced RNA was observed (data not shown), indicating that the intron carrying the atypical snoRNA is efficiently removed from the original transcript. This result might indicate that the intron is always removed, rather than undergoing regulation by the structured element in the intron. If true, then this atypical snoRNA might have a function similar to other snoRNAs. Consistent with this hypothesis is the fact that the highly-conserved nucleotides in the guide sequence region of the atypical snoRNA are complementary to regions within 5.8S rRNA, 18S rRNA, and 28S rRNA, suggesting that members of this ncRNA class direct modifications to these rRNAs.
Group I ribozymes
Bioinformatics searches that are guided by consensus sequences and structure models for known RNA classes are likely to miss many representatives, particularly those that can vary considerably from the consensus model. Searches that can uncover conserved sequences and structures without relying on pre-existing consensus models can reveal distal variants of known ncRNA classes. As noted above, our bioinformatics search strategy already has revealed numerous additional representatives of the HDV class of self-cleaving ribozymes (Fig. 2) and members of an atypical snoRNA (Fig. 7d). Similarly, we have identified 208 examples of what appear to be previously unannotated group I self-splicing ribozymes [82] that are present in 114 fungal species (Fig. 7e).
Numerous additional examples of this motif were uncovered in bacteria by searching for sequences conforming to the resulting consensus model based on these fungal representatives (unpublished observations). Most of these newly-found representatives carry readily recognizable structural elements of group I ribozymes, including stems P3 through P7, and the conserved guanosine binding site. Since there is considerable sequence and structural variability at both the 5ˊ and 3ˊ termini of group I ribozymes, we did not further examine each representative to determine if they also carry stems P1, P2, P9, and P10 that are typical of this ribozyme class. However, these RNAs are most likely group I ribozymes that have previously escaped annotation in the genomes we analyzed.
rps0 motif
We identified many distinct types of candidate structured RNA domains in close association with fungal genes coding for ribosomal proteins. The rps0 motif is one such representative (Fig. 8a) that included 41 representatives from 25 fungal species. Ribosomal protein genes commonly use a feedback auto-regulation mechanism to regulate their expression in bacteria [83,84,85]. These systems involve the binding of the ribosomal protein to a special RNA structure commonly located in the 5ˊ UTR of its corresponding mRNA. For example, E. coli uses at least 12 distinct RNA structures to regulate the expression of numerous ribosomal proteins [86]. This general mechanism is also used by some eukaryotic species for regulating pre-mRNA splicing [87] or translation [88]. In S. cerevisiae, ribosomal protein L32 (RPL32) binds to a structured RNA formed by intronic and exonic sequences within its own mRNA to cause alternative splicing. Moreover, the spliced product can also fold into a very similar structure that still binds the RPL32 protein and inhibits translation [89, 90].
The rps0 motif identified in the current study consists of two hairpins, one of which carries two conserved 5´-GGGGAAAG sequence elements partly located on each side of an internal loop. All the representatives of this motif are located in the 3´ UTRs of genes encoding 40S ribosomal protein subunit S0 (RPS0). RPS0 proteins are required for the processing of the precursor of 18S rRNAs and the formation of active 40S ribosomal subunits [91]. Given the apparent symmetry of the rps0 RNA sequence and its location adjacent to the rpo0 gene, it seems possible that the motif might bind two or more RPS0 proteins to regulate expression of this ribosomal protein factor.
rps2 motif
The rps2 motif (Fig. 8b) is represented by 13 examples from nine fungal species belonging to four subclasses including Pichia, Candida, Lodderomyces and Clavispora. All examples are located in the 5′ UTRs of the genes encoding 40S ribosomal protein S2. Although there are few examples, there are four predicted base-pairs distributed between P1 and P2 that covary in a manner consistent with the predicted secondary structure.
rps20 motif
There are 20 representatives of the rps20 motif (Fig. 8c) derived from 13 species distributed among the Saccharomyces, Candida, Lachancea, Kazachstania and Tetrapisispora genera. All the representatives are located in the 3′ UTRs of the genes encoding 40S ribosomal protein S20. Again, given the limited number and distribution of members of this motif, we observe only two base-pairs with evidence of covariation in the proposed P3 stem. The other two predicted base-paired regions are formed by highly-conserved sequences and therefore further evidence of their formation is currently lacking.
rpl7-l8-s3 motif
The rpl7-l8-s3 motif (Fig. 8d) forms two long hairpins with internal loops. A total of 29 representatives from 25 species are located mostly in the introns of genes encoding 60S ribosomal proteins L7 and L8, as well as the 40S ribosomal protein S3. L7 and L8 proteins are required for the processing of 60S pre-rRNA, which includes the peptide bond formation center of 28S rRNA. The motif includes a well-conserved purine-rich region located between two internal loops of P2, which is similar to a previously reported ribosomal protein binding site [92]. However, this previously reported motif in S. cerevisiae, which is located in the pre-mRNA for ribosomal protein L32, is proposed to be bound by the L32 protein as an unpaired bulge. In contrast, the conserved purine-rich region of the rpl7-l8-s3 motif is predicted to be part of an extended base-paired structure that includes evidence for some nucleotide covariation, implying that this region might not exist as unpaired RNA.
rpl7 motif
The rpl7 motif (Fig. 8e) is represented by 22 examples from 17 species. It is a small motif of approximately 30 nucleotides containing some conserved nucleotides in and around a central loop. All examples of this RNA motif are located in the introns of 60S ribosomal protein L7 mRNAs. In the absence of L7, other ribosomal proteins such as L6, L14, L20, and L33 are greatly diminished [93], and so this motif might be important for the coordination of ribosomal protein production.
We also used RT-PCR to test whether splicing occurs. If so, the motif might participate in regulating splicing events, as has been observed for an RNA structure associated with S. cerevisiae ribosomal protein L32 (RPL32) [90]. RT-PCR analysis (Fig. 8e) reveals that the L7 pre-mRNA that contains the rpl7 motif produces at least one major splicing product. This result, along with the location of the rpl7 motif in introns, is consistent with the hypothesis that the rpl7 motif might regulate pre-mRNA splicing, perhaps by directly binding to the L7 ribosomal protein.
rpl30 motif
The rpl30 motif (Fig. 8f) is represented by 47 examples from 26 species. This motif is comprised of a long 5ˊ region with little evidence for structure formation, followed by a region that appears to form a large hairpin structure with a well-conserved purine-rich internal loop. Most representatives are located in the 5´ UTRs of genes encoding 60S ribosomal protein subunit L30. Notably, expression of S. cerevisiae ribosomal protein L32 is autogenously regulated by a mechanism wherein the protein binds to a purine-rich internal loop [92]. This precedence strengthens the working hypothesis that the rpl30 motif is a regulatory RNA structure that serves as a binding partner for ribosomal protein L30 for expression autoregulation.