Retrieving sequences of enzymes experimentally characterized but erroneously annotated : the case of the putrescine carbamoyltransferase
© Naumoff et al; licensee BioMed Central Ltd. 2004
Received: 08 April 2004
Accepted: 02 August 2004
Published: 02 August 2004
Annotating genomes remains an hazardous task. Mistakes or gaps in such a complex process may occur when relevant knowledge is ignored, whether lost, forgotten or overlooked. This paper exemplifies an approach which could help to ressucitate such meaningful data.
We show that a set of closely related sequences which have been annotated as ornithine carbamoyltransferases are actually putrescine carbamoyltransferases. This demonstration is based on the following points : (i) use of enzymatic data which had been overlooked, (ii) rediscovery of a short NH2-terminal sequence allowing to reannotate a wrongly annotated ornithine carbamoyltransferase as a putrescine carbamoyltransferase, (iii) identification of conserved motifs allowing to distinguish unambiguously between the two kinds of carbamoyltransferases, and (iv) comparative study of the gene context of these different sequences.
We explain why this specific case of misannotation had not yet been described and draw attention to the fact that analogous instances must be rather frequent. We urge to be especially cautious when high sequence similarity is coupled with an apparent lack of biochemical information. Moreover, from the point of view of genome annotation, proteins which have been studied experimentally but are not correlated with sequence data in current databases qualify as "orphans", just as unassigned genomic open reading frames do. The strategy we used in this paper to bridge such gaps in knowledge could work whenever it is possible to collect a body of facts about experimental data, homology, unnoticed sequence data, and accurate informations about gene context.
As a consequence of the deluge of completely sequenced genomes belonging to a large array of species, one can expect to identify many homologues of enzymes which have been previously well studied at the experimental level. This seems to be the general rule and the public sequence databanks (DDBJ/EMBL/GenBank) are now inundated by putative amino acid sequences which have been annotated uniquely by the widely used two-step process : (1) detection of a homologous relationship by a pairwise sequence similarity search at the level of primary structure and (2) inference of functional similarity from this detected homology.
However, the opposite might be true. For various reasons one can either miss or misinterpret the actual function of a putative protein when annotating by homology, resulting in a wrong function transfer. Several studies have already emphasized this point (see, for example, [1–3]). On the other hand, beside these now well identified errors which are often due to automatic processes, more subtle mistakes may occur when some of the numerous effects of divergent evolution are overlooked. In particular, one of the insufficiently appreciated problems of functional assignment is that homologous proteins might catalyse different biochemical reactions. Here, we discuss an instance of erroneous annotation (misannotation) in genes of nitrogen metabolism which to our knowledge has not yet been brought up. We explain why this is so and draw attention to the fact that similar cases must actually be rather frequent.
Results and Discussion
Annotating distant carbamoyltransferases
Our group ([4, 5]) is presently involved in deciphering the evolutionary relationships between two ubiquitous and essential proteins, aspartate carbamoyltransferase (ATCase, EC 188.8.131.52) which catalyses the first committed step of de novo pyrimidine biosynthesis and ornithine carbamoyltransferase (OTCase, EC 184.108.40.206) which plays a crucial role in both anabolism and catabolism of arginine.
The YgeW protein encoded by Escherichia coli and its close homologue from Clostridium botulinum are both located on a long branch emerging at the basis of this OTCases tree. YgeW is annotated as belonging to the ATCase/OTCase family (see, for example, the SwissProt knowledgebase ). On the branch which is next to the root we find the sequence of a protein which has been reported to be essential for arginine biosynthesis in the anaerobic bacterium Bacteroides fragilis . This protein has been crystallized and characterized as a carbamoyltransferase-like protein since it does not display OTCase activity in vitro . Indeed, several of its residues have been substituted in sites which are viewed as crucial for OTCase activity. Moreover, Dashuang et al.  indicated that a similar protein has been found in Xylella fastidiosa. Our phylogenetic data are in agreement with this observation since the protein annotated as OTCases in two strains of X. fastidiosa and its close relative present in two species of the Xanthomonas genus are found to branch close to that of B. fragilis. Therefore, the functional identification of these different UTC is certainly not straightforward and requires further investigations.
Furthermore, it occurred to us that, more than thirty years ago, another carbamoyltransferase was discovered by Roon and Barker . A putrescine carbamoyltransferase (PTCase, EC 220.127.116.11) was found to be synthesized by the Gram-positive bacterium Streptococcus (now Enterococcus) faecalis when it was grown on agmatine but not arginine as primary energy source. This PTCase was easily separated from the OTCase synthesized by the same organism grown on arginine . This putrescine carbamoyltransferase had further been studied by V. Stalon's group ([9–12]). Two features of this study – which had apparently been overlooked in recent genome annotations – appear now to be crucial for the interpretation of the data shown on Fig. 1. (1) Wargnies et al.  showed that the putrescine carbamoyltransferase of E. faecalis had a weak but unambiguous OTCase activity (7.4% in terms of Vmax, with KM for putrescine and L-ornithine of 0.029 mM and 13.0 mM respectively); (2) A short NH2-terminal sequence (29 residues) was published ten years later . Since the complete genome of E. faecalis has now been sequenced , we could identify that one of the three putative ornithine carbamoyltransferases encoded by this genome, the open reading frame EF0732 annotated as ArgF-2 , is actually the putrescine carbamoyltransferase previously studied by the group of Stalon.
A family of putrescine carbamoyltransferases
The seven ornithine carbamoyltransferases sequences which have been reannotated as putrescine carbamoyltransferases.
Pediococcus pentosaceus a
Scaffold 18 Gene 459
Lactobacillus brevis a
Scaffold 15 Gene 476
Lactobacillus sakei b
Gene context, another tool for gene reannotation
Thus, these four clusters of genes appear to encode the full set of enzymes which are expected to form the catabolic agmatine deiminase pathway . Agmatine deiminase, PTCase and carbamate kinase were already known to become coinduced by agmatine in E. faecalis when it is used as sole energy source , strongly suggesting that these gene clusters are functional operons. In Pseudomonas aeruginosa PAO1, the homologous agmatine deiminase is encoded by the gene aguA belonging to an operon aguBA induced by agmatine and N-carbamoylputrescine ([16, 17]) but in this species N-carbamoylputrescine is converted by a N-carbamoylputrescine amidohydrolase (EC 18.104.22.168, the aguB product) into putrescine and CO2 + ammonium rather than into putrescine and carbamoylphosphate. More recently, a similar pathway for polyamine biosynthesis has been identified by homology in higher plants . In the alternative pathway corresponding to the analogous sets of genes shown on Fig. 3, it is thus a PTCase which catalyzes the second step and catabolically converts N-carbamoylputrescine to putrescine and carbamoylphosphate which can further be used to synthesize ATP via carbamate kinase .
Furthermore, when we compare the clusters of genes shown in Fig. 3 to those surrounding gene argF or arcB, encoding the catabolic OTCase functioning in the arginine deiminase (ADI) pathway present in many microbial genomes (see [20–22]), we observe a very similar distribution, namely a transcriptional regulator, the arginine deiminase (EC 22.214.171.124), an arginine/ornithine antiporter and a (sometimes two) carbamate kinase. There is thus a very close analogy between the set of genes encoding the enzymes catalyzing the different steps of the agmatine deiminase pathway found in these different Firmicutes and that encoding the enzymes catalyzing the different steps of the arginine deiminase pathway.
Genome annotation requires both reliable tools for identifying gene function and manual expertise. The frustration due to the high percentage of orphan genes found in all genomes is often compounded with another – more vicious – problem which may occur when a very strong sequence similarity is obscuring the actual functional identity of another kind of orphan. The analysis described in this paper illustrates the difficulty in identifying such a potential source of misannotation and delineates at least two fundamental parameters which must be considered especially when the results appear to be straightforward. First, one must keep in mind that proteins sharing a high level of identical residues may have different functions. A routine step for challenging the functional annotation of any putative coding sequence should be a phylogenetic analysis. Any CDS found to branch far from its homologues in an evolutionary tree, as observed in the case of the carbamoyltransferases (Fig. 1), should be examined with caution before assigning it a putative function. Another example of the usefulness of the phylogenetic approach to correct misannotations can be found in a comparative analysis of ureohydrolases .
The second parameter which must be considered is the striking lack of information in the various public databases. For example, in the case studied here (the putrescine carbamoyltransferase EC 126.96.36.199) it is reported that there is no sequence available in various first-rate databases specialized in enzymatic and/or metabolic data such as ENZYME , BRENDA , KEGG , BIOCYC , etc...as well as in the Gene Ontology (GO:0050231) Consortium . A significant part of this deficit of information appears to be due to not correlating biochemical data [8–11] previously published and well recorded in BRENDA , for example, with the incomplete amino acid sequence which was not taken into account although it had been published by the same group  who studied this enzyme.
The specific point we would like to stress in this paper reflects a more general gap – which is widely ignored – between the enormous quantity of information buried in the sequence data and the refined knowledge built up over several decades of studies on gene regulation and protein biochemistry (recorded in  to ). In this respect, experimentally studied proteins not correlated with sequence data also qualify as "orphans". In the present case, such a resulting gap in knowledge could be bridged only because we used the experimental approach detailed in this paper. After being alerted by the unusual topology of the phylogenetic tree (Fig. 1) and the rediscovery of the partial sequence , a confirmation of the reannotation as PTCases was obtained when considering their signature (Fig. 2) and the gene context (Fig. 3) which differentiate them clearly from their OTCase homologues.
The strategy we used to identify such orphan sequences could work in any other case where it is possible to collect a body of facts about experimental data, homology, unnoticed sequence data, and accurate informations about gene context. Note that we incidentally used such a strategy to annotate the genes encoding a putative agmatine deiminase in the genomes listed in Fig. 3. It is highly probable that this approach can be applied to many similar cases. Therefore, our community should feel encouraged to dig in old lab books, unpublished data buried in doctoral thesis and similar documents, in order to retrieve information crucial for correct genome annotation. Moreover, it becomes urgent to design new approaches in order to efficiently explore what has been called the "bibliome" . This could help to bridge important gaps in knowledge – such as exemplified in this paper- which lead to numerous errors in genome annotation. Accordingly, it would become possible to (re)annotate conserved hypothetical proteins for which there is an apparent lack of information in the various public databases.
Near 450 carbamoyltransferases (ATCases and OTCases) sequences were collected from the public databases SwissProt, TREMBL and TREMBLNEW . To facilitate the management of these data which are continuously growing up with the onset of new completely sequenced genomes, we assemble them in a relational database (available on request). Moreover, in the case of unpublished but completely sequenced genomes, it was often possible to recover bona fide sequences from specific sequencing groups sites (Joint Genome Institute  and Sanger Institute ) using either BlastP or tBlastN queries. We retained only unpublished sequences aligning along their whole length with bona fide carbamoyltransferases and sharing no less than 30% identity with it, using at least two distantly related seeds.
Reconstructing phylogenetic trees
Rooted phylogenetic trees were derived from multiple alignements of ATCases and OTCases using two different approaches. (1) New sequences were manually added and aligned to the previously published  multiple alignement using the BioEdit sequence alignment editor . These additions were made effortless by introducing each new sequence near its closest partner (the first hit in a routine BlastP check). This processive approach minimized the risk of introducing any bias when adding numerous new sequences. However, the soundness of this manual alignement was routinely checked using automatic programmes (both ClustalX and DARWIN, see below) to verify that we did not miss any conserved motifs. We further ascertained this multiple alignement (especially the introduction of gaps) by using the informations available from the known 3D structures of ATCases and OTCases. Maximum parsimony and distance trees were derived from this alignment using the PROTPARS and NEIGHBOR programmes of the PHYLIP package , respectively. This PHYLIP package was further used to derive confidence limits for each node of either parsimony or distance trees using a bootstrap approach (programmes SEQBOOT and CONSENSE). (2) The PhyloTree programme of the DARWIN package  allows to build a multiple alignement and to derive a distance tree which is an approximation to maximum likelihood tree since the deduced evolutionary distances are weighted by computing their variance when reconstructing the tree.
We thank the DOE (Department of Energy, USA) and the Wellcome Trust (United Kingdom) for making available unpublished sequences from genomic projects produced by different Sequencing Groups at either the JGI  or the Sanger Institute .
This work was supported by the Flanders Foundation for Joint and Fundamental Research and by the CNRS (UMR 8621). Daniil Naumoff was supported by a postdoctoral grant from the French Ministère de la Recherche.
- Brenner SE: Errors in genome annotation. Trends Genet. 1999, 15: 132-133. 10.1016/S0168-9525(99)01706-0.View ArticlePubMedGoogle Scholar
- Gerlt JA, Babbitt PC: Can sequence determine function?. Genome Biol. 2000, 1: REVIEWS0005.10-10.1186/gb-2000-1-5-reviews0005.View ArticleGoogle Scholar
- Babbitt PC: Definitions of enzyme function for the structural genomics era. Curr Opin Chem Biol. 2003, 2: 230-237. 10.1016/S1367-5931(03)00028-0.View ArticleGoogle Scholar
- Labedan B, Boyen A, Baetens M, Charlier D, Chen P, Cunin R, Durbecq V, Glansdorff N, Hervé G, Legrain C, Liang Z, Purcarea C, Roovers M, Sanchez R, Toong TL, Van de Casteele M, van Vliet F, Xu Y, Zhang YF: The evolutionary history of carbamoyltransferases: a complex set of paralogous genes was already present in the last universal common ancestor. J Mol Evol. 1999, 49: 461-473.View ArticlePubMedGoogle Scholar
- Labedan B, Xu Y, Naumoff DG, Glansdorff N: Using quaternary structures to assess the evolutionary history of proteins : the case of the Aspartate Carbamoyltransferase. Mol Biol Evol. 2004, 21: 364-73. 10.1093/molbev/msh024.View ArticlePubMedGoogle Scholar
- The SwissProt page for the YgeW protein: [http://www.expasy.org/cgi-bin/niceprot.pl?%20Q46803]
- Dashuang S, Gallegos R, De Ponte IIIJ, Morizono H, Yu X, Allewell NM, Malamy M, Tuchman M: Crystal structure of a transcarbamylase-like protein from the anaerobic bacterium Bacteroides fragilis at 2.0 A resolution. J Mol Biol. 2002, 320: 899-908. 10.1016/S0022-2836(02)00539-9.View ArticleGoogle Scholar
- Roon RJ, Barker HA: Fermentation of agmatine in Streptococcus faecalis : occurrence of putrescine transcarbamoylase. J Bacteriol. 1972, 109: 44-50.PubMed CentralPubMedGoogle Scholar
- Wargnies B, Lauwers N, Stalon V: Structure and properties of the putrescine carbamoyltransferase of Streptococcus faecalis . Eur J Biochem. 1979, 101: 143-52.View ArticlePubMedGoogle Scholar
- Simon JP, Stalon V: Enzymes of agmatine degradation and the control of their synthesis in Streptococcus faecalis . J Bacteriol. 1982, 152: 676-81.PubMed CentralPubMedGoogle Scholar
- Stalon V: Putrescine carbamoyltransferase (Streptococcus faecalis). Methods Enzymol. 1983, 94: 339-43. 10.1016/S0076-6879(83)94061-2.View ArticlePubMedGoogle Scholar
- Vander Wauven C, Simon JP, Slos P, Stalon V: Control of enzyme synthesis in the oxalurate catabolic pathway of Streptococcus faecalis ATCC 11700: evidence for the existence of a third carbamate kinase. Arch Microbiol. 1986, 145: 386-90.View ArticlePubMedGoogle Scholar
- Tricot C, De Coen JL, Momin P, Falmagne P, Stalon V: Evolutionary relationships among bacterial carbamoyltransferases. J Gen Microbiol. 1989, 135: 2453-64.PubMedGoogle Scholar
- Paulsen I, Banerjei L, Myers GSA, Nelson KE, Seshadri R, Read TD, Fouts DE, Eisen JA, Gill SR, Heidelberg JF, Tettelin H, Dodson RJ, Umayam L, Brinkac L, Beanan M, Daugherty S, DeBoy RT, Durkin S, Kolonay J, Madupu R, Nelson W, Vamathevan J, Tran B, Upton J, Hansen T, Shetty J, Khouri H, Utterback T, Radune D, Ketchum KA, Dougherty BA, Fraser CM: Role of Mobile DNA in the Evolution of Vancomycin-Resistant Enterococcus faecalis . Science. 2003, 299: 2071-2074. 10.1126/science.1080613.View ArticlePubMedGoogle Scholar
- Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: UniProt: the Universal Protein Knowledgebase. Nucleic Acids Res. 2004, 32: D115-D119. 10.1093/nar/gkh131. [http://www.expasy.uniprot.org/index.shtml]PubMed CentralView ArticlePubMedGoogle Scholar
- Nakada Y, Jiang Y, Nishijyo T, Itoh Y, Lu CD: Molecular characterization and regulation of the aguBA operon, responsible for agmatine utilization in Pseudomonas aeruginosa PAO1. J Bacteriol. 2001, 183: 6517-24. 10.1128/JB.183.22.6517-6524.2001.PubMed CentralView ArticlePubMedGoogle Scholar
- Nakada Y, Itoh Y: Identification of the putrescine biosynthetic genes in Pseudomonas aeruginosa and characterization of agmatine deiminase, N-carbamoylputrescine amidohydrolase of the arginine decarboxylase pathway. Microbiology. 2003, 149: 707-14. 10.1099/mic.0.26009-0.View ArticlePubMedGoogle Scholar
- Janowitz T, Kneifel H, Piotrowski M: Identification and characterization of plant agmatine iminohydrolase, the last missing link in polyamine biosynthesis of plants. FEBS Lett. 2003, 544: 258-61. 10.1016/S0014-5793(03)00515-5.View ArticlePubMedGoogle Scholar
- Cunin R, Glansdorff N, Piérard A, Stalon V: Biosynthesis and metabolism of arginine in bacteria. Microbiol Rev. 1986, 50: 314-352.PubMed CentralPubMedGoogle Scholar
- Sekowska A, Danchin A, Risler JL: Phylogeny of related functions: the case of polyamine biosynthetic enzymes. Microbiology. 2000, 146: 1815-28.View ArticlePubMedGoogle Scholar
- Barcelona-Andres B, Marina A, Rubio V: Gene structure, organization, expression and potential regulatory mechanisms of arginine catabolism in Enterococcus faecalis . J Bacteriol. 2002, 184: 6289-300. 10.1128/JB.184.22.6289-6300.2002.PubMed CentralView ArticlePubMedGoogle Scholar
- Zuniga M, Perez G, Gonzalez-Candelas F: Evolution of arginine deiminase (ADI) pathway genes. Mol Phylogenet Evol. 2002, 25: 429-44. 10.1016/S1055-7903(02)00277-4.View ArticlePubMedGoogle Scholar
- Bairoch A: The ENZYME database in 2000. Nucleic Acids Res. 2000, 28: 304-305. 10.1093/nar/28.1.304. [http://www.expasy.org/enzyme/]PubMed CentralView ArticlePubMedGoogle Scholar
- Schomburg I, Chang A, Ebeling C, Gremse M, Heldt C, Huhn G, Schomburg D: BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res. 2004, 32: D431-D433. 10.1093/nar/gkh081. [http://www.brenda.uni-koeln.de/]PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resources for deciphering the genome. Nucleic Acids Res. 2004, 32: D277-D280. 10.1093/nar/gkh063. [http://www.genome.ad.jp/kegg]PubMed CentralView ArticlePubMedGoogle Scholar
- Karp PD, Arnaud M, Collado-Vides J, Ingraham J, Paulsen IT, Saier MH: The E. coli EcoCyc Database: No Longer Just a Metabolic Pathway Database. ASM News. 2004, 70: 25-30. [http://www.biocyc.org/]Google Scholar
- The Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004, 32: D258-D261. 10.1093/nar/gkh036. [http://www.geneontology.org/]View ArticleGoogle Scholar
- Grivell L: Mining the bibliome: searching for a needle in a haystack?. EMBO Reports. 2002, 3: 200-203. 10.1093/embo-reports/kvf059.PubMed CentralView ArticlePubMedGoogle Scholar
- Joint Genome Institute (Department of Energy, USA): [http://www.jgi.doe.gov/JGI_microbial/html/index.html]
- Sanger Institute (Wellcome Trust, United Kingdom): [http://www.sanger.ac.uk/Projects]
- Hall TA: BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl Acids Symp Ser. 1999, 41: 95-98. [http://www.mbio.ncsu.edu/BioEdit/bioedit.html]Google Scholar
- Felsenstein J: Inferring phylogenies from protein sequences by parsimony, distance and likelihood methods. Methods Enzymol. 1996, 266: 418-27. 10.1016/S0076-6879(96)66026-1. [http://evolution.gs.washington.edu/phylip.html]View ArticlePubMedGoogle Scholar
- Gonnet GH, Hallett MT, Korostensky C, Bernardin L: Darwin v. 2.0: an interpreted computer language for the biosciences. Bioinformatics. 2000, 16: 101-103. 10.1093/bioinformatics/16.2.101. [http://cbrg.inf.ethz.ch/welcome.html]View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.