Computational RNomics of Drosophilids
© Rose et al. 2007
Received: 15 January 2007
Accepted: 08 November 2007
Published: 08 November 2007
Skip to main content
© Rose et al. 2007
Received: 15 January 2007
Accepted: 08 November 2007
Published: 08 November 2007
Recent experimental and computational studies have provided overwhelming evidence for a plethora of diverse transcripts that are unrelated to protein-coding genes. One subclass consists of those RNAs that require distinctive secondary structure motifs to exert their biological function and hence exhibit distinctive patterns of sequence conservation characteristic for positive selection on RNA secondary structure.
The deep-sequencing of 12 drosophilid species coordinated by the NHGRI provides an ideal data set of comparative computational approaches to determine those genomic loci that code for evolutionarily conserved RNA motifs. This class of loci includes the majority of the known small ncRNAs as well as structured RNA motifs in mRNAs. We report here on a genome-wide survey using RNAz.
We obtain 16 000 high quality predictions among which we recover the majority of the known ncRNAs. Taking a pessimistically estimated false discovery rate of 40% into account, this implies that at least some ten thousand loci in the Drosophila genome show the hallmarks of stabilizing selection action of RNA structure, and hence are most likely functional at the RNA level. A subset of RNAz predictions overlapping with TRF1 and BRF binding sites [Isogai et al., EMBO J. 26: 79–89 (2007)], which are plausible candidates of Pol III transcripts, have been studied in more detail. Among these sequences we identify several "clusters" of ncRNA candidates with striking structural similarities.
The statistical evaluation of the RNAz predictions in comparison with a similar analysis of vertebrate genomes [Washietl et al., Nat. Biotech. 23: 1383–1390 (2005)] shows that qualitatively similar fractions of structured RNAs are found in introns, UTRs, and intergenic regions. The intergenic RNA structures, however, are concentrated much more closely around known protein-coding loci, suggesting that flies have significantly smaller complement of independent structured ncRNAs compared to mammals.
High-throughput transcriptome data obtained in particular using tiling arrays [1–6] and cDNA sequencing [7–9] in conjunction with detailed functional studies of individual genes have profoundly changed our picture of eukaryotic gene regulation by emphasizing multiple regulatory layers, many of which involve non-protein-coding RNAs (ncRNAs). In contrast to protein-coding genes, however, ncRNAs do not form a homogeneous group of transcripts but rather belong to a diverse array of classes with vastly different structures, functions, and evolutionary patterns [10–16].
Efficient computational methods [17, 18] have recently been developed to determine the genomic inventory of a large subgroup of ncRNAs, namely those that exhibit evolutionarily conserved secondary structures. As stabilizing selection acts to preserve structure in the presence of sequence variation, these transcripts are very likely to have discernible biological function – as opposed to being a mere byproduct of transcriptional noise  or gene regulation by transcriptional interference . The group of structured RNAs that we are considering here consequently includes the classical families of small ncRNAs (tRNAs, rRNAs, miRNAs, snRNAs, snoRNAs, RNAse P RNA, etc) as well as structured, usually regulatory, motifs associated with larger coding or non-coding transcripts, such as internal ribosomal entry sites, IRE, and SECIS signals see e.g. . The RNAz approach  has proven to produce rather high quality predictions. In particular, as part of the detailed analysis of the ENCODE regions , the verification of many unannotated RNAz predictions by means of RT-PCR has been reported, and for a substantial fraction of RNAz predictions corroborating evidence from high-throughput experiments has been obtained.
Computational screens for structured RNAs have been reported so far for mammalian [22, 23], urochordate , nematode , and yeast [26, 27] genomes. However, no comprehensive analysis of structured ncRNAs in insect genomes has been published so far, even though there is statistical evidence for an enrichment of structured RNAs within highly conserved non-coding elements of drosophilids . Drosophilids, which have been deeply sequenced by a consortium coordinated by the NHGRI [29–31], provide an ideal model system for this task, since their evolutionary divergence is comparable to those of mammals. As a consequence, large portions of their genomes are alignable, while at the same time there is substantial sequence variation. Both are necessary prerequisites for currently available ncRNA detection tools.
In addition to the statistical evidence for wide-spread structured RNAs in insects, two recent genome-wide experimental studies provide evidence of a large reservoir of novel ncRNAs in Drosophila melanogaster: Isogai et al.  mapped TRF1 and BRF binding sites in the D. melanogaster genome and showed that, unlike most other eukaryotes, TRF1/BRF binding appears responsible for the initiation of all classes of Polymerase-III (Pol III) transcription. As the known Pol III transcripts are small ncRNAs, their data suggests that drosophilids are likely to have a large set of previously unannotated small ncRNAs. A large-scale tiling array study of transcription in the early development of D. melanogaster  found that about 20% of the observed transcripts in D. melanogaster come from stand-alone intergenic or intronic sources and may constitute new types of RNAs, including a substantial fraction of ncRNAs.
We report here on a computational screen for structured RNA motifs in Drosophilids based on 12-species Pecan alignments provided by the Consortium. The detected RNAz hits are either (parts of) independently transcribed non-coding RNAs with evolutionarily conserved secondary structures, or they are structured elements that are parts of coding transcripts such as SECIS or IRE elements.
Overall statistics of the RNAz screen. Initial filtering of Pecan alignments leaves roughly 50% as input for RNAz respectively to the ncRNA prediction. The distribution of RNAz hits does not show a chromosomal bias. We counted the number of predicted loci and their overall length at two probability thresholds (p > 0.5, p > 0.9) for normal and also randomized alignments. Obtained relative frequencies (given as percentages) can be interpreted as false discovery rates (FDR). As expected, the FDR decreases with a higher RNAz p-value.
aligned DNA [Mb]
screened by RNAz [Mb]
RNAz p > 0.5
RNAz p > 0.9
FDR p > 0.5 hits
FDR p > 0.9
Sensitivity of the RNAz screen on known ncRNAs.
not in input
not in input
U6 not detected
A BLAST search of the drosophilid RNAz hits against the results of prior RNAz surveys of mammals , urochordates , and nematodes  yielded the following pattern of conservation: 167 tRNA hits and 11 snRNAs associated with the major spliceosome. Furthermore, we recover the U6atac snRNA (which was previously unannotated) and 5 microRNAs.
In order to estimate the false discovery rate (FDR), we repeated the screen with shuffled alignments as described in . Alignments are shuffled such that two alignment columns are swapped only if both their gap pattern and their sequence conservation pattern is the same. This amounts to a very "gentle" shuffling that in particular preserves pairwise sequence divergence within any given window. This gentle shuffling procedure may fail to remove the secondary structure signal in some cases because too few pairs of alignment columns satisfy the stringent conditions for shuffling. This is at least one reason why we observe 3 239 (7.6%) hits in which true and shuffled screen intersect, including 53 of the 756 annotated D. melanogaster ncRNAs, almost exclusively tRNAs. The estimated FDR of roughly 50% for p > 0.5 and 40% for the high quality set in Tab. 1 should therefore be regarded as pessimistic estimates.
The results of a second, more "vigorous" shuffling approach lead to much more optimistic estimates: shuffling of the columns without considering their sequence conservation or gap pattern reduces the estimated FDR by factor of 35 to only a 1–2%. One may argue, of course, that shuffling columns independently will change the gap pattern of the alignment (even though it still conserves pairwise sequence identities). Hence, this procedure may well underestimate the FDR. The dramatic difference in the result highlights a general problem that so far has not been solved in a satisfactory way, namely how to systematically construct randomized alignments that preserve all correlation features of the genomic background except the one under consideration. One important feature which must be mentioned at this point is dinucleotide content. Due to the stacking energy contributions in the folding model, dinucleotide content can affect folding energies and thus FDR estimates considerably. Since there is still no way of randomizing alignments preserving dinucleotide content, we cannot control for this effect. However, we found that, in contrast to mammalian genomes , there is no strong dinucleotide bias in the genomic background of D. melanogaster that effects folding energies. Therefore, our estimates from mononucleotide shuffled alignments will not differ dramatically from estimates one would obtain from controls with the same dinucleotide content.
The genomic distribution of structured RNA candidates in D. melanogaster is comparable to the observations in previous RNAz-based screens, see Fig. 1. As in the ENCODE data , the distribution of RNAz hits largely follows the patterns of sequence conservation. In the fly data, only 5'UTRs show a substantial enrichment relative to the input data. In contrast, the largest enrichment in the human ENCODE data was observed from 3'UTRs . The most striking difference between fly and human data is that the relative fraction of both intronic RNAz hits and intronic sequence conservation is twice as large in human.
In a recent article Manak and colleagues  describe widespread transcriptional activity in the D. melanogaster genome during 12 timepoints of early embryonic development detected by genomic tiling arrays. When comparing the RNAz hits to this data, we identify 4 236 (p > 0.5) and 1 713 (p > 0.9) hits that overlap a Transfrag in any of the 12 timepoints. A comparison of the fractions of RNAz hits from normal and control screen which overlap Transfrags in one, several or all timepoints yields, however, no significant enrichments (see Additional file 1 for details).
In contrast, about 40% of the D. melanogaster RNAz hits in intergenic regions are located adjacent to coding sequences. This may indicate that current annotation of the fly genome lists boundaries of protein coding genes that systematically truncate the UTRs. If this is the case, however, then we would have to interpret more than 15% of the total RNAz hits as located in UTRs. Our data could be explained if a situation similar to the minifly gene is prevalent in the fly: For this gene a recent study  described several alternative poly-A sites and multiple small ncRNAs that are processed from the alternative 3'UTRs. At least one of these ncRNAs is structured: the snoRNA H1 was also detected in our screen. In any case, the structured RNAs by RNAz are on average much more closely linked to protein coding genes in flies than in human.
On the other hand, a small fraction (≈ 10%) of the intergenic RNAz hits, i.e., the tail in Fig. 2, is located much further away from CDS than expected for random placement. This suggests the existence of a distinct class of RNAz hits with a propensity for large IGRs. Most likely, these signals correspond to independently transcribed ncRNAs.
About 20% of the unannotated transcripts observed in D. melanogaster early development arise from stand-alone intergenic or intronic sources (relative to FlyBase annotation) . Only a relatively small fraction of the novel independent transcripts (5.1% of the total transcriptional output) had intergenic origin. In comparison, more than 13% [21.9% of 60%] of the transcriptional output recorded by comparable methods from the ENCODE regions has a distal intergenic source (in relation to annotated exons) . This difference is in agreement with closer association of most RNAz hits with protein coding genes in the fly.
In a recent study, Isogai et al.  identified TRF1 and BRF binding sites using high-resolution genome tiling microarrays and provided evidence that in Drosophila the alternative TRF1/BRF complex appears responsible for the initiation of all known classes of Pol III transcription. At the p > 0.9 significance level RNAz hits are about three-fold enriched in these regions. We have therefore analyzed the distribution of RNAz hits within the experimentally determined TRF1 and BRF binding regions. As reported in , most of the sites correspond to tRNAs, 7SL RNAs, and a subset of snoRNAs. In addition to these known ncRNAs, the loci contain 197 unannotated RNAz hits, which are prime candidates for novel Pol III transcripts.
In order to identify putative microRNAs, we screened all RNAz hits with RNAmicro . This results in 607 candidates, of which 541 are unannotated so far. 176 of these signals are located in annotated CDS and are therefore most likely false positives, leaving 365 plausible microRNA candidates. The recent discovery of hundreds of new human microRNAs that are not conserved beyond primates strongly suggests that "evolution of miRNAs is an ongoing process and that along with ancient, highly conserved miRNAs, there are a number of emerging miRNAs" . In the light of these data, a large number of drosophilid-specific microRNAs does not come unexpected.
Using SnoReport , RNAz hits are classified as putative box H/ACA snoRNAs, of which 4 intersect with previously annotated snoRNAs. Taking into account that for only 22 of the 250 annotated snoRNAs the annotation distinguishes between box H/ACA (3), box C/D (18), and scaRNAs (1), the small overlap with the existing annotation is not surprising. Again, recent experimental surveys in other species, including nematodes [37, 38] and mammals  have discovered a substantial number of previously unannotated snoRNAs in these species, suggesting that the current annotation of snoRNAs in D. melanogaster is also far from complete.
Finally, 1 700 RNAz hits have direct evidence for expression through ESTs that are not related to protein coding genes, i.e., through ESTs that do not intersect with the FlyBase, RefSeq, N-SCAN, Genscan, Human Proteins gene prediction and mRNA tracks of the UCSC Table Browser.
Since the 197 RNAz hits that overlap TRF1 or BRF binding regions  are good candidates without annotation for bona fide ncRNAs, we applied structure-based clustering to this small subset of our predictions to identify common secondary structures and, hence, putative novel functional RNAs. The complete clustering tree as well as a table of the most prominent clusters is given in the Additional file 1. Since all clusters have a mean pairwise identity less than 45%, structurally related candidates are typically highly diverged at sequence level.
More than 50% of the RNAz hits are found only within the melanogaster subgroup. To interpret this result we compute the ratio of newly appearing RNAz hits and the branch length for each branch in the tree leading to D. melanogaster. We observe little variation in the data, with the exception of a reduced rate of innovation along the most recent branch. This reduction is, however, most likely a methodological artefact, since the pairwise mutation distances between D. melanogaster, D. simulans, and D. sechellia are only about 0.1 and RNAz is known to be less sensitive for highly similar sequences.
Approximately 12% of the RNAz hits are conserved throughout all drosophilids. In comparison, a screen of vertebrate genomes  found about 3% of mammalian RNAz candidates (1 000 out of 36 000) to be conserved throughout vertebrates.
A comparison of true and shuffled screens furthermore indicates a small but significant decrease of the FDR with phylogenetic age of the RNAz hit.
The present computational survey of drosophilid genomes yields about 16 000 high quality predictions. Taking into account the (very pessimistically estimated) false discovery rate of about 40%, this implies that at least some ten thousand loci in the Drosophila genome show the hallmarks of stabilizing selection acting on RNA structure, and hence are most likely functional at the RNA level. The elucidation of these functions, however, remains elusive in many cases. Here, we have studied a small subset in more detail. Almost 200 RNAz hits overlap with loci that are likely to be transcribed by Pol III, strongly suggesting that these are bona fide ncRNAs. Using structural clustering, we discovered several groups of structural similar ncRNA candidates in these regions.
This number of putative ncRNAs and a regulatory RNA element is not unexpected given that about 36 000 high quality RNAz hits have been found by a similar procedure in a screen of mammalian genomes , which was based on a comparable size of the input set comprising about 103 Mb of the human genome and a similar number of putative ncRNAs was reported using the SCFG-based evofold approach .
A comparison with the results from a similar RNAz screen of the human genome  and with an analysis of ENCODE regions , shows many similarities and several striking differences. We observe a smaller fraction of intronic and larger fraction of protein coding hits (cp. Fig. 1) in flies. A comparison of the distances between RNAz hits and their nearest annotated protein coding sequence shows that structured RNAs are concentrated much more strongly around known genes in flies than in human, even when accounting for the much more compact D. melanogaster genome. This observation agrees with recent tiling array data  which showed that a much smaller fraction of intergenic transcription is truly independent from surrounding protein coding genes in flies compared to human .
The inventory of structurally conserved RNAs is only a very small subset of the total non-coding transcriptional output, which covers most of the non-repetitive genome . The current computational approach relies on substantial sequence conservation. Indeed many of the known ncRNAs that were missed in our survey were not in the input set. In fact, RNAz explicitly requires two independent signals for stabilizing selection: (1) sequence conservation so that a good alignment can be computed as input, and (2) stabilizing selection on RNA secondary structure in the presence of sequence variation. RNAz hits are therefore subject to specific selection pressures that make it highly likely that RNAz predictions have distinctive biological function. In contrast, it has been shown recently, that in some cases, such as the bithoraxoid ncRNAs of the Drosophila bithorax complex, ncRNA transcription itself, acting in cis, represses a target gene (in this case Ubx) . In such a scenario, however, we do not expect to observe high levels of sequence conservation of the non-coding transcripts or the tell-tale substitution patterns of conserved secondary structures.
For our analysis we used the Pecan  alignment of the 12 drosophilid genomes [30, 31] of the Comparable Analysis Freeze 1 (CAF1, Feb. 2006). The alignments were downloaded from [31, 41]. We favored the Pecan alignments over two other sets of drosophilid alignments that are available at [31, 42]. Visual inspection strongly suggested that Mavid alignments [43, 44] are more biased towards protein coding regions. We did not use the Multiz alignments, because they contain three additional genomes (insects), and removing those sequences would effectively require a complete realignment in order to obtain a fair comparison between screens performed on different input alignments. The Pecan alignments comprise the D. melanogaster chromosomes 2L (22.4 Mb), 2R (20.8 Mb), 3L (23.8 Mb), 3R (27.9 Mb), 4 (1.3 Mb), and X (22.2 Mb).
The current implementation of RNAz is restricted to input alignments containing at most 6 sequences and a maximum length of 400 nt due to the training of the underlying SVM . In addition, certain restrictions apply for the fraction of gaps and on the overall base composition as a consequence of the data sets that were used to train the SVM model. The original genomic alignments thus need to be re-processed. The protocol used in this contribution closely follows that of previous RNAz-based studies:
Alignments longer than 120 nt are cut into 120 nt slices in 40 nt steps, so that subsequent slices overlap in 80 nt. This default length is motivated by the fact that many structured RNAs are less than 100 nt long. Such short signals would "drown" in the noise of longer alignments that are then mostly unstructured. On the other hand, alignments that are too short do not yield reliable signals for secondary structure conservation. In a series of filtering steps, sequences were removed from the individual alignments or alignment slices if they are (a) shorter than 50 nt, or (b) contain more than 25% gap characters, or (c) have a base composition outside the definition range of RNAz (e.g. GC content > 0.75 or < 0.25).
Alignments were discarded completely if fewer than 3 sequences were left after the filtering steps, or they did not contain a D. melanogaster sequence, since this species serves as a reference and as the basis for subsequent annotation. All preprocessing steps were performed using the script rnazWindows.pl of the current release of the RNAz package .
For alignment slices with more than 6 sequences, rnazWindows.pl selects a representative subset consisting of the D. melanogaster sequence and five additional sequences in such a way that 6 sequences are as evenly distributed in the dataset as possible and approach an average pairwise sequence identity of 80%, the optimal working range of the RNAz program. In practice that means that only a single representative from nearly identical sequences is chosen, and highly divergent sequences are excluded provided there is sufficient sequence variation in the remaining alignment. For the technical details of the procedure we refer to the documentation of the RNAz package .
Tab. 1 summarizes the initial filtering steps. Roughly 50% of the nucleotides in the Pecan alignments are still contained in the RNAz input data.
RNAz was applied to the filtered input alignment slices in both reading directions. Overlapping slices with a positive ncRNA classification probability of p > 0.5 were combined using rnazCluster.pl to a single annotation element, which we will refer to as "RNAz hit". From these data, we extract a subset of high confidence RNAz hits that contain at least one slice with a prediction confidence of p > 0.9.
Numbers of positive scored RNAz windows of the control screen.
The RNAz hits were annotated using the D. melanogaster sequence as reference. We performed the following annotation steps:
We used the coordinates of a set of D. melanogaster non-coding RNAs, publicly available as gff files at  and [31, 47] to identify already known D. melanogaster ncRNAs among our predictions. Furthermore, we computed the overlap of the RNAz hits with the CDS annotations from [31, 48] (file = dmel-all-r4.3.filtered.gff) and  (file = all_caf1_DGIL_TEX.gff).
We furthermore performed BLAST  searches using rnazBlast.pl against the Rfam (version 7.0) , Noncode (version 1.0) , ncRNAdb , FlyBase (version 2006 00.2 Beta) [54, 55], and miRBase (version 9.0) .
We furthermore used tRNAscan  to annotate tRNAs and RNAmicro  to classify putative microRNAs. RNAmicro is an SVM based classification method that evaluates both thermodynamic stability and evolutionary conservation patterns.
We used SnoReport to recognize putative snoRNAs (for technical details see ). Similar to RNAmicro, SnoReport is composed of a pre-filter, a secondary structure prediction step, and a subsequent SVM-based classificator. In brief, the prefilter searches for consecutive H (pattern: ANANNA) and ACA boxes . In the second step, the constraint folding option of RNAfold  is used to compute the secondary structure subject to the constraint that both boxes remain unpaired. If this results in an snoRNA-like secondary structure, several sequence and structure features are computed and passed to an SVM for classification. The model was trained on the set of snoRNAs that can be downloaded from the snoRNABase . C/D box and scaRNAs represented the negative and H/ACA box snoRNA sequences the positive samples. Estimated positive and negative prediction values for the model used here are 80% and 99.9%, respectively.
Structure-based clustering was performed as described in : The modified Sankoff algorithm implemented in the LocARNA program is used to compute local structural alignments and their consensus structure. The clustering tree is obtained by agglomerative clustering using LocARNA alignment scores as distance measures. To avoid that large scores influence the distance transformation we define distances by d(i, j) = max(0; q - score(i, j)), where q is here the 99% quantil of all pairwise scores. Since the procedure is computationally very demanding we have restricted this type of analysis here to a small subset of RNAz hits that are likely Pol III transcripts.
The phylogenetic relationships within drosophilids are taken from the AAA (Alignment/Analysis/Annotation of 12 related Drosophila species) web site . Branch lengths are genomic mutation distances computed from 4-fold degenerate sites in all coding regions corrected for base composition as in . In order to determine the branch in the phylogenetic tree at which an RNAz hit first appears, we determine the last common ancestor (LCA) of the sequences in the corresponding input alignment and assign the RNAz hit to the branch in the tree leading to this internal node. Due to the fact that RNAz hits are a combination of single windows and each window represents a specific selection of sequences out of an n-way alignment, we only considered those sequences for the LCA analysis which are simultaneously present at all windows.
We thank the "Drosophila 12 Genomes Consortium" for providing sequence, annotation and alignment data prior to publication. Special thanks go to Venky Iyer for assistence with data retrieval. This work was supported in part by the Bioinformatics Initiative of DFG in Germany and Austrian GEN-AU project "non coding RNA".
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.