Bcheck : a wrapper tool for detecting RNase P RNA genes
© Yusuf et al. 2010
Received: 19 February 2010
Accepted: 13 July 2010
Published: 13 July 2010
Skip to main content
© Yusuf et al. 2010
Received: 19 February 2010
Accepted: 13 July 2010
Published: 13 July 2010
Effective bioinformatics solutions are needed to tackle challenges posed by industrial-scale genome annotation. We present Bcheck, a wrapper tool which predicts RNase P RNA genes by combining the speed of pattern matching and sensitivity of covariance models. The core of Bcheck is a library of subfamily specific descriptor models and covariance models.
Scanning all microbial genomes in GenBank identifies RNase P RNA genes in 98% of 1024 microbial chromosomal sequences within just 4 hours on single CPU. Comparing to existing annotations found in 387 of the GenBank files, Bcheck predictions have more intact structure and are automatically classified by subfamily membership. For eukaryotic chromosomes Bcheck could identify the known RNase P RNA genes in 84 out of 85 metazoan genomes and 19 out of 21 fungi genomes. Bcheck predicted 37 novel eukaryotic RNase P RNA genes, 32 of which are from fungi. Gene duplication events are observed in at least 20 metazoan organisms. Scanning of meta-genomic data from the Global Ocean Sampling Expedition, comprising over 10 million sample sequences (18 Gigabases), predicted 2909 unique genes, 98% of which fall into ancestral bacteria A type of RNase P RNA and 66% of which have no close homolog to known prokaryotic RNase P RNA.
The combination of efficient filtering by means of a descriptor-based search and subsequent construction of a high-quality gene model by means of a covariance model provides an efficient method for the detection of RNase P RNA genes in large-scale sequencing data.
Bcheck is implemented as webserver and can also be downloaded for local use from http://rna.tbi.univie.ac.at/bcheck
In recent years, biological sequence databases have grown exponentially. These data include a rapidly increasing number of completely sequenced genomes as well as large-scale metagenomic data sets that await annotation. For instance, the Global Ocean Sampling Expedition (GOS) deposited more than 18G metagenomic sequences already . The analysis of these data calls for new and more efficient methods of data analysis .
Non-protein-coding RNA (ncRNA) genes are abundant in genomic sequences, playing diverse important biological roles . The genomic annotation of ncRNA genes is attracting strong research focus, in particular in the context of genome annotation [4, 5] and metagenomics [6, 7]. Methods for homology-based annotation have dramatically improved over the last years. In particular, Infernal 1.0  outperforms previous methods by orders of magnitude in speed. Nevertheless, such general purpose approaches do not reach the performance levels of customized class-specific tools, in particular tRNAscan-SE  in terms of both speed and quality. Manual strategies in some cases  reach superior results, but are too time-consuming for larger projects and in most cases are hard to generalize.
tRNAscan-SE is not a single algorithm but rather a wrapper tool that combines a series of increasingly complex and expensive filters. Similarly, the major searching strategy of Rfam  is a combination of a blast -based filter followed by Infernal. The pre-filtering at sequence level with blast, however, is not ideal in particular in applications to distant homologs . Another common approach is to apply a descriptor of sequence and structural motif to predict ncRNA homologs. The descriptor construction is a manual process requiring expert knowledge. Several descriptor languages have been developed, e.g. RNAmot , PatScan , HyPaL , RNAMotif , and rnabob , which is also used here.
RNase P RNA, possibly a remnant of the RNA world , is an important ribozyme involved in the processing of pre-tRNAs . Its gene is usually designated as rnpB in bacteria. A variable number of protein components  facilitate substrate binding . RNase P RNA exists in almost all organisms, but is absent in human mitochondrions . So far, there is compelling evidence for the loss of RNase P RNA only in a single organism, the archaeon Nanoarchaeum equitans . It is not unlikely, however, that plants, red algae, and heterokonts , some bacteria (e.g. Aquifex aeolicus [24, 25]) and some additional archaea (e.g. Pyrobaculum aerophilum ) have lost their RNase P RNA. The archaeon Methanothermobacter thermoautotrophicus may be a transition towards the loss of RNase P RNA, which is catalytically inactive in this organism but can be "repaired" by a few substitutions .
The RNase P RNA structures can be broadly assigned to five subfamilies: the bacterial types A and B (bacA and bacB), the archaeal types A and M (arcA and arcM), and a single eukaryote group (nucA) . In addition, two eukaryotic subtypes in fungi (fugA & fugB) can be identified . The types arcA and bacA, which have been identified as ancestral states , cover the majority of microbial rnpB genes, forming diverse sets in terms of both sequence and structure variation. The derived types arcM and bacB, in contrast, have more uniform members. The diversity is largest among eukaryotic RNase P RNA genes. In eukaryotes, RNase P RNA is transcribed by polymerase III . The human promoter elements were described recently  to contain TATA-box, PSE, Oct and SP1/SPH elements within 100 nt upstream of the transcription initiation site. A comparison of all eukaryotic promoter elements showed weak similarities only in the TATA box.
In this contribution we are concerned with the detection of RNase P RNA genes in genomic data from all domains of life. In , a pattern matching based pipeline for efficient rnpB gene prediction has been proposed. It is not applicable to large-scale database searches in practice, however. Here, we present Bcheck, a wrapper, to perform efficient rnpB gene prediction by combining the fast filtering with rnabob  and the sensitive validation by Infernal. The construction of such a method entails two tasks: the design of an efficient yet sensitive descriptor model (DM) that acts as a filter, and the derivation of a sensitive statistics covariance model (CM). Both components are based on a careful analysis of published RNase P RNA sequences and structures. The success of Bcheck depends on the efficiency and predictive power of both models, as well as a sensible wrapping algorithm that optimizes the interplay of DM and CM.
The construction of effective models of RNase P RNA genes is a non-trivial task because of the lack of strong family-specific conservation. Our strategy was to first classify the training sequences into the seven sub-families identified in the literature: arcA, arcM, bacA, bacB, nucA, fugA and fugB. The training set consists of sequences from the RNase P RNA Database  with intact secondary structures annotated, and additional RNase P RNA sequences from the Rfam and from two recent publications [23, 36]. A set of randomized decoys as well as randomized genomic sequences were constructed using ushuffle  in order to determine the false positive rates.
The training of both DM and CM requires structural alignments, whose quality is crucial for both the automatic learning procedures of the Infernal CMs and the manual construction of the DMs. We adopted a multi-step strategy: The RNase P RNA sequences were first divided into structural elements, then base-pairing regions were structure-aligned manually, and loop regions were sequence-aligned by means of MUSCLE . Local alignments were then recombined into a "raw" global alignment for each subfamily. These alignments contained errors, mainly caused by local foldings which were not fitting to the conservation patterns shared by the majority of members. We applied RNAfold [39, 40] to check the thermodynamic plausibility of local structure elements. Construction of DMs started with these alignments. In the course of DM construction, outliers were temporarily removed from the alignments, searched with the preliminary DMs, which provided additional information to guide the re-insertion of the outlier into the alignment.
Since efficiency was the major focus of DM training, we focused on selected features of local regions. To gain consensus sequence information, each alignment column was summarized and assigned with standard, "ambiguous" IUPAC nucleotides (taking into account every nucleotide appearing in the column) or the gap character (whenever the column contained at least one gap). The sequence was edited to take established structural knowledge into account. The resulting consensus sequence was then annotated in the alignment.
For the subfamilies arcA, bacA, and nuc with strong variation, we constructed two variants, "DM selective" and "DM general", with different parameter settings. The selective DMs miss a few aberrant RNase P RNA genes, while the "DM general" models have a larger false positive rate. For each subfamily, only one CM is needed and automatically generated based on global structural alignment using the tools of the Infernal package.
To distinguish functional copy and pseudogene of eukaryotes, we analyzed their promoter regions. For this purpose we aligned 100 nt upstream of Polymerase III transcripts of the same organism and compared the RNase P RNA gene predictions. While the absence of a recognizable promoter signal does not prove that the RNase P RNA gene is not functional, we have adopted a conservative policy and include only bcheck results for which the presence of a promoter signal provides additional confirmation.
Summary of predicted microbial rnpB for GenBank genome data set.
The GenBank files contained annotated rnpB genes for 365 bacteria and 22 archaea, all of which were among the Bcheck predictions. We then compared start-end positions of Bcheck predictions and GenBank annotations. Only 25% of the annotations agree within a discrepancy of 5 nt or less at both ends.
Inspection of sequences and predicted secondary structures shows that the published sequences are in general incomplete, lacking nucleotides at the 5' end of P4 and 3' end of P4: At the 5' end, 66% known annotations miss flanking regions of P4, ranging from 30 to 90 nucleotides. At the 3' end, 56% known annotations miss flanking regions of P4', ranging from 10 to 20 nucleotides. A few of the GenBank annotations, furthermore, have promoter or terminator sequences included. Bcheck thus provides a substantial improvement also of the existing annotations in most cases.
Evaluation of the five major discrepancies between GenBank annotation and Bcheck results.
Rfam CM scores
R. typhi wilmington
P1, P3, P9, P10
The subfamily distribution of microbial rnpB. No hit was obtained with any of the eukaryotic DMs.
The GOS metagenomic sequences were obtained from the CAMERA project . Due to the taxonomic uncertainty of the GOS data set, all models of archaea, bacteria and eukaryotes were applied to search over 10 million sequences comprising about 18 G. No hit was produced by any of the eukaryotic models.
In total, Bcheck predicted 4675 rnpB genes with median E -values of 10-78. In 211 cases two models overlap. In these cases there is a clear difference in E -values, so that the assignment to domains is unambiguous in all positive cases. After duplication removal, 2909 rnpB sequences are unique, 2857 of which belong to bacA, 49 to arcA, 3 to bacB, but none for arcM, see Table 3.
Summary of predicted RNase P RNA genes in eukaryotes. "Known" refers to organisms with annotated RNase P RNA genes, whereas "unknown" refers to organisms with no annotated RNase P RNA genes.
Strong promoter signals were identified for Meloidogyne hapla and Monosiga brevicollis, supporting that these candidates are functional copies. For 36 metazoan genomes, Bcheck made multiple predictions. In at least 16 cases, additional predictions are highly similar to the functional copy. In genome sequence assembly, the reads originating from different copies of a repeat appear identical and cause assembly errors.
Therefore these predictions seem to be due to assembly errors rather than constituting true paralogs. In the other 20 cases differences in the flanking regions and the RNase P RNA gene itself indicate that we see the results of gene duplications. Even though the promoter might be specific for an organism, it may differ from other polymerase III transcripts within one species. In each of these cases, a presumably functional RNase P RNA like promoter structure was found for only one of the copies. Similar duplication patterns are observed in closely related primate, fish and rodent. For instance, both Homo sapiens and Pan troglodytes have functional copies on chromosome 14 and a pseudogene on chromosome 4. Among teleosts, both Danio rerio and Gasterosteus aculeatus have their functional copies and pseudogenes on the chromosome 2. In rodent family, Rattus norvegicus and Mus musculus have the pseudogenes spreading on at least 4 different chromosomes.
Novel RNase P RNA genes were detected by Bcheck in many fungi, which had not been analyzed in much detail so far. Only eight sequences which were reported before were not recognized by Bcheck. In 119 sequences Bcheck found 37 novel RNase P RNA genes. The remaining 82 genomes are either unfinished drafts, so that the RNase P RNA is not contained in the data, or they belong to clades where RNase P RNA may be absent. In plants, red algae, and heterokonts RNase MRP RNA, an ancient paralog of RNase P RNA, is well described [23, 34]. One may speculate that it substitutes for RNase P RNA in these clades, in particular given that multiple copies of RNase MRP RNA are present in plant genomes. A recent, detailed overview of the evolution of RNase P RNA and its associated proteins can be found in [23, 44]. An incomplete genome assembly explains e.g. the deviant RNase P RNA in the genome of the elephant (Loxodonta africanus), which shows a canonical sequence interrupted by a run of Ns in the latest assembly (Loxafr3.0). On the other hand, we suspect that we missed the RNase P RNA in some fungi and in some of the basal eukaryotes because of highly divergent sequence and secondary structure.
Bcheck was written in Python (version 2.5.2). Input consists of DNA or RNA sequences in fasta format, predictions are output with fasta format or with secondary structure annotated. Besides the default search algorithm, Bcheck also gives the option for searching with CM only. However, CM-only search is at least 100 times slower.
We set up a Bcheck webserver to facilitate online RNase P RNA gene prediction. A searchable rnpB database was developed, including genes for 1005 microbial organisms, 147 eukaryote organisms and 4756 GOS sample sequences. The predicted pseudogenes for eukaryote organisms are also included. The "rnpB database" uses a hierarchical tree structure, consisting of 5 tables, implementing preorder tree traversal algorithm to process the query efficiently. blast is also offered in the server for homology search against the database compromising 777 unique rnpB genes. The server can be accessed at http://rna.tbi.univie.ac.at/bcheck. The Bcheck -pipeline can also be downloaded from the same location for local use in a Linux environment.
The rapidly increasing size of sequence databases requires efficient tools for data analysis. In particular, homology annotation of small ncRNAs, with their short and often poorly conserved sequences poses a severe problem for large-scale annotation. Here, we describe Bcheck, an efficient pipeline to determine RNase P RNA genes across all three domains of life. In order to deal with the high variability of the RNase P RNA sequences and structures, we employ descriptor-based models specific for sub-families instead of a single pattern to construct more efficient filters. In the second step, improved covariance models are used to validate the candidates from the DM step and to determine nearly exact gene boundaries.
With Bcheck, we were able to determine the RNase P RNA genes in 59 out of 68 archaea, 946 out of 956 bacteria, and 147 out of 237 eukaryotes. 61% of the prokaryotic sequences and 25% of eukaryotic results were not annotated previously. The quality of the predicted rnpB genes is much better than a large fraction of the - usually blast -based - annotation available through GenBank. The size and diversity of eukaryote genomes brings with it a particular challenge in finding RNase P RNA genes, because this diversity is reflected in many aberrant features of the RNase P RNA itself. Using the fungi-specific DMs, we uncovered 32 previously unannotated sequences, which are downloadable from the Bcheck server. As in previous studies, we did not find RNase P RNA candidates in plants and Heterokonta.
Since Bcheck is more than 100 times faster than the direct application of Infernal (version 1.0), it is suitable in particular as a tool to screen large high-throughput sequencing data. With only a handful of false negatives (10 out of 1005 prokaryotes), Bcheck provides a highly efficient way to annotate newly sequenced genomes. A particular strength of Bcheck is its applicability to metagenomics data.
At present, Bcheck models were built on the conserved sequence and secondary structure features of a large sample of RNase P RNA genes. Conceivably, the predictive power of the pipeline could be improved further by including additional information. For instance, promoter and terminator regions might be utilized. A recent survey for 7SK RNAs capitalized largely on the conserved features of the characteristic pol-III promoter signals of this ncRNA class . A similar strategy might allow a further relaxation of the DM pattern in favor of a second filter utilizing the promoter and terminator motifs.
We are grateful to Prof. Dr. Thomas Rattei (Department of Microbial Ecology, University of Vienna) for making the GOS dataset available to us in convenient form, and Dipl.-Inf. Henry Wirth (Interdisciplinary Centre for Bioinformatics, University of Leipzig) for producing the boxplot figure. This work has been funded by the Austrian GEN-AU projects "bioinformatics integration network III", and in part supported by Landesstipendium Sachsen.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.