RNRdb, a curated database of the universal enzyme family ribonucleotide reductase, reveals a high level of misannotation in sequences deposited to Genbank
© Lundin et al; licensee BioMed Central Ltd. 2009
Received: 14 July 2009
Accepted: 8 December 2009
Published: 8 December 2009
Ribonucleotide reductases (RNRs) catalyse the only known de novo pathway for deoxyribonucleotide synthesis, and are therefore essential to DNA-based life. While ribonucleotide reduction has a single evolutionary origin, significant differences between RNRs nevertheless exist, notably in cofactor requirements, subunit composition and allosteric regulation. These differences result in distinct operational constraints (anaerobicity, iron/oxygen dependence and cobalamin dependence), and form the basis for the classification of RNRs into three classes.
In RNRdb (Ribonucleotide Reductase database), we have collated and curated all known RNR protein sequences with the aim of providing a resource for exploration of RNR diversity and distribution. By comparing expert manual annotations with annotations stored in Genbank, we find that significant inaccuracies exist in larger databases. To our surprise, only 23% of protein sequences included in RNRdb are correctly annotated across the key attributes of class, role and function, with 17% being incorrectly annotated across all three categories. This illustrates the utility of specialist databases for applications where a high degree of annotation accuracy may be important. The database houses information on annotation, distribution and diversity of RNRs, and links to solved RNR structures, and can be searched through a BLAST interface. RNRdb is accessible through a public web interface at http://rnrdb.molbio.su.se.
RNRdb is a specialist database that provides a reliable annotation and classification resource for RNR proteins, as well as a tool to explore distribution patterns of RNR classes. The recent expansion in available genome sequence data have provided us with a picture of RNR distribution that is more complex than believed only a few years ago; our database indicates that RNRs of all three classes are found across all three cellular domains. Moreover, we find a number of organisms that encode all three classes.
Ribonucleotide reductases (RNRs) form a universal enzyme family that catalyse the reduction of ribonucleotides to their corresponding deoxyribonucleotides. Ribonucleotide reduction provides the sole biological means for de novo synthesis of the building blocks of DNA, making it an essential cellular function. Sequence and structural data indicate that ribonucleotide reduction has evolved only once during evolution [1, 2] and all RNRs make use of a common thiyl radical-based mechanism for catalysis .
General characteristics of RNR classes
α or α2
Ia: NrdA, NrdB
Ib: NrdE, NrdF
specific activase: NrdG
Tyr122 (in β)
Gly580 (in α)
Bacteriophage; one eukaryotic virus
Given that the propensity to synthesise deoxyribonucleotides is essential for DNA replication, the operational differences between classes of RNR suggest that the type of RNR any given organism carries will have an impact on the environmental conditions in which that organism can grow and reproduce. The effect on the biochemistry of ribonucleotide reduction of environmental parameters such as iron-, cobalt- and oxygen-availability may thus impact our understanding of the adaptability of microorganisms to a range of environments. An overview of the distribution and diversity of RNR classes -- particularly among microbes -- is therefore of interest in delimiting environmental range. To facilitate progress in these areas, we have established the Ribonucleotide Reductase database (RNRdb), a manually annotated and curated data source for annotation and comparative investigations centred on RNR biology.
Construction and content of RNRdb
Distribution of RNR proteins
The alignment of candidates to known experimentally annotated RNR sequences also provides an initial indication of potential presence of self-splicing introns and inteins in the RNRs. Putative intein sequences within candidate RNR sequences are manually curated with the aid of the BLAST function of the InBase database (The Intein Database and Registry, http://www.neb.com/neb/inteins.html) . Candidate selfsplicing intron sequences are identified within RNR genes by manual secondary structure folding of the presumed intronic RNA according to the conventional folding suggested by Cech et al. .
Instead of using a release scheme for database content, the database is continuously updated with new sequences. In contrast, the database user interface is under a release scheme, and is currently at version 1.3. On each page, the date when data was last inserted or corrected is displayed together with the version number of the interface. At the time of writing (July 2009), the database contains over 2000 cellular organisms and viruses and over 9000 protein sequences (Table 2). The main sequence data source is GenBank, but this is augmented at times with other data sources when quality sequences are available that have not yet been uploaded to GenBank. At the time of writing, we have downloaded and screened additional sequence data from the Joint Genome Institute http://www.jgi.doe.gov/, the Broad Institute http://www.broad.mit.edu/ and the University of Tokyo Cyanidioschyzon merolae Genome Project database http://merolae.biol.s.u-tokyo.ac.jp/ in addition to data from GenBank.
Structures for representatives for all RNR proteins except the class III activase, NrdG, and the regulatory protein NrdR, have been solved. RNRdb contains annotations and descriptions for all published RNR structures, together with links to the structure files in Protein Data Bank http://www.rcsb.org/pdb/.
Each RNRdb entry contains the full amino acid sequence and cross-references to the source databases for sequence and protein structure information, as well as genomic location, when known. In addition to the classification of each sequence (by class and subunit), additional attributes are listed, which enables retrieval of proteins with solved structures, experimentally derived mutational data (including the corresponding PubMed references), and presence of intervening self-splicing sequences, i.e. group I and II introns and inteins; these are cross referenced when applicable. The system for managing attributes is flexibly implemented, allowing new classification attributes to be added during curation.
Each sequence is linked to a source organism or virus record, which in turn is linked to its full NCBI taxonomy hierarchy allowing filtering of sequences based on taxonomy (see below). Organisms and viruses with fully sequenced genomes are labelled, making it possible to establish whether, for any given organism, the list of annotated RNRs is based on complete or incomplete genome sequence data. RNRdb also contains information about genomes that lack RNRs (determined through candidate screening of complete genome sequences, as described above). As of July 2009 there are only five such cases among cellular organisms, three bacteria (Borrelia burgdorferi [23, 24], Buchnera aphidicola str. Cc  and Ureaplasma urealyticum [26, 27]) and two eukaryotes (Entamoeba histolytica [28, 29] and Giardia lamblia [30, 31]). These are all parasites or obligate intracellular endosymbionts, and absence of RNRs indicates that all must rely on salvage of hostderived deoxyribonucleotides.
Utility of RNRdb
Motivation for building a specialist RNR database
The RNRdb homepage contains tabs to a short introduction on RNRs ("About RNRs"), a glossary, and a list of key literature references ("Bibliography"). We have focused the user interface of the database on tools for exploration of RNR diversity and distribution and to serve as an annotation resource for specialists. There are four main access points to RNR sequences in the database:
i) The "RNRs by organism" page presents the entire database in tabular format. By scrolling down the page or using the browser to search for text strings, the user can explore the distribution of RNRs in the three cellular domains as well as among viruses. Proteins with additional attributes (solved structure, mutagenised forms, selfsplicing introns and inteins) and fully sequenced genomes are indicated by red superscript abbreviations; clicking on these superscripts links to explanations. Following an organism or protein hyperlink presents the user with all RNR sequences for that organism or the chosen protein respectively, together with classification information and cross-references to other databases.
ii) The "Search" page enables searches of the database at any taxonomic level, ranging from all cellular domains and viruses down to single species. Furthermore, organisms possessing or lacking particular RNR classes and/or the NrdR regulator can be retrieved. It is also possible to retrieve proteins with specific annotated attributes (e.g. inteins) or to restrict the search to completely sequenced genomes. All three aspects of searches can also be combined, allowing searches for, e.g., enterobacterial class I RNRs for which solved structures exist.
iii) The "BLAST"  page permits searches of RNRdb using either protein (blastp) or DNA sequence (blastx) data. The BLAST search interface can be used to annotate unknown sequences, or to investigate annotations in other databases.
To facilitate data acquisition for comparative analyses, sequences (including those returned by a specific search) can be downloaded in FASTA or NEXUS format via the protein detail pages. Subsets of sequence data from returned searches or from the "RNRs by organism" page can also be selected manually via checkboxes and downloaded as above.
Our knowledge of the distribution of ribonucleotide reductases has expanded rapidly over the last few years. Until recently, the distribution of the three classes was considered rather limited, and, as a domain, only bacteria were thought to possess the full gamut of classes. Class I RNRs were thought to be absent from archaea, and no sequences for classes II and III were known from eukaryotes. While whole genome data have expanded this picture, annotations in other public databases (mainly GenBank) are often uninformative as regards RNR class and subunit type (meaning that this has to be checked manually) (Table 3). Moreover, a number of genomes carry clear misannotations, and protein family databases do not always correctly categorise RNRs. Searching Pfam , for instance, returns adequate descriptions of two of the structural domains ("ATP cone" and "Glycine radical") of the Escherichia coli K12 catalytic subunit class III protein, but no family or structural domain with clear reference to RNRs. Searching with the E. coli K12 class Ia catalytic subunit protein sequence and the Thermotoga maritima class II protein sequence returns in both cases the "Ribonucleotide reductase, all-alpha domain" and the "Ribonucleotide reductase, barrel domain" family. It is thus difficult to tell from Pfam searches that the class III sequence is an RNR and that the class I and class II sequences are from different classes. Although other protein family databases have broader coverage (e.g. Pfam , InterPro  and PhyloFacts ), our approach with HMM profiles followed by manual curation yields more accurate descriptions (unpublished observations).
RNRdb thus offers a first clear overview of the distribution of the three classes of ribonucleotide reductase. The data curated in RNRdb make it clear that all three classes of ribonucleotide reductase are found in all three organismal domains. Around half of all sequenced species carry genes for only one class of RNR, but among those with more than one RNR class, two eukaryotes, one archeon, and 54 (7.5%) of the fully sequenced bacterial genomes harbour genes for all three RNR classes (Table 4). The varying environmental and biochemical conditions under which each class of RNR can synthesise deoxyribonucleotides (Table 1), the complex distribution of the three classes across genomes, and the frequent presence of more than one complete set of RNR genes per genome suggests a role for horizontal gene transfer in forming this distribution. Evolutionary genomic analyses support this view (Lundin et al., in prep.).
Conclusion and future directions
The diverse distribution of ribonucleotide reductases was poorly appreciated prior to the genomic era in biological research. Prior to the establishment of RNRdb this information was difficult to navigate due to incomplete and misleading annotation regarding class membership and subcomponent information in databases. We demonstrate that manual curation of protein sequences leads to significant improvements over existing annotations, and that there is therefore value in generating such annotation sets. Indeed, there are ongoing efforts to try and integrate such approaches to large-scale annotation .
Our plans for the next major release of the database, RNRdb 2.0, include tools to enable users to explore sequence diversity within the components of different RNR classes. Specifically, we are developing tools to complement the current BLAST search feature with a service that matches user submitted sequences to our set of HMM profiles, allowing a more precise and fine-grained annotation.
Interestingly, ribonucleotide reductases are the most abundant enzyme family identified in metagenomic sequencing projects , and the potential utility of relating the biochemical attributes of RNRs to environmental parameters such as oxygen levels or iron availability is clear. RNRdb 2.0 will therefore also include sequences from environmental samples and other sources where the identity of the organism cannot be established. Our vision for RNRdb 2.0 is a database where the user can explore sequence space to analyse not only which classes exist in different taxa, but also in which organisms and environments subtypes of RNR genes occur. We will continue to expand the content and scope of RNRdb, in order to further deepen our understanding of this fascinating enzyme, and to explore its utility in the metagenomic analyses of diverse microbial environments.
Availability and requirements
RNRdb is freely available at http://rnrdb.molbio.su.se.
The authors wish to thank Pernilla Larsson Birgander, Ernst Furrer and Margareta Sahlin for their valuable help in the initial curation process, and David Nord for help with structural identification of group I introns. AMP and BMS gratefully acknowledge funding from the Swedish Research Council. AMP is a Royal Swedish Academy of Sciences Research Fellow supported by a grant from the Knut & Alice Wallenberg Foundation.
- Torrents E, Aloy P, Gibert I, Rodríguez-Trelles F: Ribonucleotide reductases: divergent evolution of an ancient enzyme. J Mol Evol. 2002, 55: 138-52. 10.1007/s00239-002-2311-7.View ArticlePubMedGoogle Scholar
- Poole AM, Logan DT, Sjöberg B-M: The evolution of the ribonucleotide reductases: much ado about oxygen. J Mol Evol. 2002, 55: 180-96. 10.1007/s00239-002-2315-3.View ArticlePubMedGoogle Scholar
- Stubbe J, Donk van Der W: Protein radicals in enzyme catalysis. Chem Rev. 1998, 98: 705-762. 10.1021/cr9400875.View ArticlePubMedGoogle Scholar
- Jordan A, Reichard P: Ribonucleotide reductases. Annu Rev Biochem. 1998, 67: 71-98. 10.1146/annurev.biochem.67.1.71.View ArticlePubMedGoogle Scholar
- Nordlund P, Reichard P: Ribonucleotide reductases. Annu Rev Biochem. 2006, 75: 681-706. 10.1146/annurev.biochem.75.103004.142443.View ArticlePubMedGoogle Scholar
- Torrents E, Sahlin M, Sjöberg B-M: The ribonucleotide reductase family - genetics and genomics. I. Ribonucleotide reductase. 2008, redigerad av Andersson KK NovaScience Publishers, 17-78.Google Scholar
- Jiang W, Yun D, Saleh L, et al: A manganese(IV)/iron(III) cofactor in Chlamydia trachomatis ribonucleotide reductase. Science. 2007, 316: 1188-1191. 10.1126/science.1141179.View ArticlePubMedGoogle Scholar
- Voevodskaya N, Lendzian F, Ehrenberg A, Gräslund A: High catalytic activity achieved with a mixed manganese-iron site in protein R2 of Chlamydia ribonucleotide reductase. FEBS Lett. 2007, 581: 3351-3355. 10.1016/j.febslet.2007.06.023.View ArticlePubMedGoogle Scholar
- Jordan A, Åslund F, Pontis E, Reichard P, Holmgren A: Characterization of Escherichia coli NrdH. A glutaredoxin-like protein with a thioredoxin-like activity profile. J Biol Chem. 1997, 272: 18044-50. 10.1074/jbc.272.29.18044.View ArticlePubMedGoogle Scholar
- Stehr M, Schneider G, Åslund F, Holmgren A, Lindqvist Y: Structural basis for the thioredoxin-like activity profile of the glutaredoxin-like NrdH-redoxin from Escherichia coli. J Biol Chem. 2001, 276: 35836-41. 10.1074/jbc.M105094200.View ArticlePubMedGoogle Scholar
- Roca I, Torrents E, Sahlin M, Gibert I, Sjöberg B-M: NrdI essentiality for class Ib ribonucleotide reduction in Streptococcus pyogenes. J Bacteriol. 2008, 190: 4849-4858. 10.1128/JB.00185-08.PubMed CentralView ArticlePubMedGoogle Scholar
- Cotruvo JA, Stubbe J: NrdI, a flavodoxin involved in maintenance of the diferric-tyrosyl radical cofactor in Escherichia coli class Ib ribonucleotide reductase. Proc Natl Acad Sci USA. 2008, 105: 14383-14388. 10.1073/pnas.0807348105.PubMed CentralView ArticlePubMedGoogle Scholar
- Tamarit J, Mulliez E, Meier C, Trautwein A, Fontecave M: The anaerobic ribonucleotide reductase from Escherichia coli. The small protein is an activating enzyme containing a [4fe-4s](2+) center. J Biol Chem. 1999, 274: 31291-31296. 10.1074/jbc.274.44.31291.View ArticlePubMedGoogle Scholar
- Sofia HJ, Chen G, Hetzler BG, Reyes-Spindola JF, Miller NE: Radical SAM, a novel protein superfamily linking unresolved steps in familiar biosynthetic pathways with radical mechanisms: functional characterization using new analysis and information visualization methods. Nucleic Acids Res. 2001, 29: 1097-106. 10.1093/nar/29.5.1097.PubMed CentralView ArticlePubMedGoogle Scholar
- Rodionov DA, Gelfand MS: Identification of a bacterial regulatory system for ribonucleotide reductases by phylogenetic profiling. Trends Genet. 2005, 21: 385-9. 10.1016/j.tig.2005.05.011.View ArticlePubMedGoogle Scholar
- Grinberg I, Shteinberg T, Gorovitz B, et al: The Streptomyces NrdR transcriptional regulator is a Zn ribbon/ATP cone protein that binds to the promoter regions of class Ia and class II ribonucleotide reductase operons. J Bacteriol. 2006, 188: 7635-44. 10.1128/JB.00903-06.PubMed CentralView ArticlePubMedGoogle Scholar
- Torrents E, Grinberg I, Gorovitz-Harris B, et al: NrdR controls differential expression of the Escherichia coli ribonucleotide reductase genes. J Bacteriol. 2007, 189: 5012-21. 10.1128/JB.00440-07.PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14: 755-763. 10.1093/bioinformatics/14.9.755.View ArticlePubMedGoogle Scholar
- HMMER. [http://hmmer.janelia.org/]
- Gilks WR, Audit B, De Angelis D, Tsoka S, Ouzounis CA: Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics. 2002, 18: 1641-9. 10.1093/bioinformatics/18.12.1641.View ArticlePubMedGoogle Scholar
- Perler FB: InBase: the Intein Database. Nucleic Acids Res. 2002, 30: 383-4. 10.1093/nar/30.1.383.PubMed CentralView ArticlePubMedGoogle Scholar
- Cech TR, Damberger SH, Gutell RR: Representation of the secondary and tertiary structure of group I introns. Nat Struct Biol. 1994, 1: 273-80. 10.1038/nsb0594-273.View ArticlePubMedGoogle Scholar
- Boursaux-Eude C, Margarita D, Gilles AM, Barzu O, Saint Girons I: Borrelia burgdorferi uridine kinase: an enzyme of the pyrimidine salvage pathway for endogenous use of nucleotides. FEMS Microbiol Lett. 1997, 151: 257-261. 10.1111/j.1574-6968.1997.tb12579.x.View ArticlePubMedGoogle Scholar
- Fraser CM, Casjens S, Huang WM, et al: Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Nature. 1997, 390: 580-586. 10.1038/37551.View ArticlePubMedGoogle Scholar
- Perez-Brocal V, Gil R, Ramos S, et al: A small microbial genome: the end of a long symbiotic relationship?. Science. 2006, 314: 312-313. 10.1126/science.1130441.View ArticlePubMedGoogle Scholar
- Kosinska U, Carnrot C, Eriksson S, Wang L, Eklund H: Structure of the substrate complex of thymidine kinase from Ureaplasma urealyticum and investigations of possible drug targets for the enzyme. FEBS J. 2005, 272: 6365-6372. 10.1111/j.1742-4658.2005.05030.x.View ArticlePubMedGoogle Scholar
- Glass JI, Lefkowitz EJ, Glass JS, et al: The complete sequence of the mucosal pathogen Ureaplasma urealyticum. Nature. 2000, 407: 757-762. 10.1038/35037619.View ArticlePubMedGoogle Scholar
- Hassan HF, Coombs GH: Purine-metabolising enzymes in Entamoeba histolytica. Mol Biochem Parasitol. 1986, 19: 19-26. 10.1016/0166-6851(86)90061-7.View ArticlePubMedGoogle Scholar
- Loftus B, Anderson I, Davies R, et al: The genome of the protist parasite Entamoeba histolytica. Nature. 2005, 433: 865-868. 10.1038/nature03291.View ArticlePubMedGoogle Scholar
- Baum KF, Berens RL, Marr JJ, Harrington JA, Spector T: Purine deoxynucleoside salvage in Giardia lamblia. J Biol Chem. 1989, 264: 21087-21090.PubMedGoogle Scholar
- Adam RD: The biology of Giardia spp. Microbiol Rev. 1991, 55: 706-732.PubMed CentralPubMedGoogle Scholar
- Finn RD, Tate J, Mistry J, et al: The Pfam protein families database. Nucleic Acids Res. 2008, 36: D281-8. 10.1093/nar/gkm960.PubMed CentralView ArticlePubMedGoogle Scholar
- Mulder NJ, Apweiler R, Attwood TK, et al: New developments in the InterPro database. Nucleic Acids Res. 2007, 35: D224-8. 10.1093/nar/gkl841.PubMed CentralView ArticlePubMedGoogle Scholar
- Krishnamurthy N, Brown DP, Kirshner D, Sjölander K: PhyloFacts: an online structural phylogenomic encyclopedia for protein functional and structural classification. Genome Biol. 2006, 7: R83-10.1186/gb-2006-7-9-r83.PubMed CentralView ArticlePubMedGoogle Scholar
- Aravind L, Wolf YI, Koonin EV: The ATP-cone: an evolutionarily mobile, ATPbinding regulatory domain. J Mol Microbiol Biotechnol. 2000, 2: 191-194.PubMedGoogle Scholar
- Bork P, Koonin EV: Predicting functions from protein sequences--where are the bottlenecks?. Nat Genet. 1998, 18: 313-8. 10.1038/ng0498-313.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, et al: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Overbeek R, Begley T, Butler RM, et al: The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005, 33: 5691-702. 10.1093/nar/gki866.PubMed CentralView ArticlePubMedGoogle Scholar
- Gilbert JA, Field D, Huang Y, et al: Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities. PLoS ONE. 2008, 3: e3042-10.1371/journal.pone.0003042.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.