DIGAP - a Database of Improved Gene Annotation for Phytopathogens
© Gao et al; licensee BioMed Central Ltd. 2010
Received: 15 July 2009
Accepted: 21 January 2010
Published: 21 January 2010
Bacterial plant pathogens are very harmful to their host plants, which can cause devastating agricultural losses in the world. With the development of microbial genome sequencing, many strains of phytopathogens have been sequenced. However, some misannotations exist in these phytopathogen genomes. Our objective is to improve these annotations and store them in a central database DIGAP.
DIGAP includes the following improved information on phytopathogen genomes. (i) All the 'hypothetical proteins' were checked, and non-coding ORFs recognized by the Z curve method were removed. (ii) The translation initiation sites (TISs) of 20% ~ 25% of all the protein-coding genes have been corrected based on the NCBI RefSeq, ProTISA database and an ab initio program, GS-Finder. (iii) Potential functions of about 10% 'hypothetical proteins' have been predicted using sequence alignment tools. (iv) Two theoretical gene expression indices, the codon adaptation index (CAI) and the E(g) index, were calculated to predict the gene expression levels. (v) Potential agricultural bactericide targets and their homology-modeled 3D structures are provided in the database, which is of significance for agricultural antibiotic discovery.
The results in DIGAP provide useful information for understanding the pathogenetic mechanisms of phytopathogens and for finding agricultural bactericides. DIGAP is freely available at http://ibi.hzau.edu.cn/digap/.
General annotation information of the 28 plant pathogens
Genomic Length (bp)
G+C content (%)
Annotated ORFs in RefSeq
Acidovorax avenae subsp. citrulli AAC00-1
Agrobacterium tumefaciens str. C58
Agrobacterium vitis S4
Aster yellows witches'-broom phytoplasma strain AY-WB
Clavibacter michiganensis subsp. michiganensis NCPPB 382
Clavibacter michiganensis subsp. sepedonicus ATCC 33113
Candidatus Phytoplasma australiense
Candidatus Phytoplasma mali
Erwinia carotovora subsp. atroseptica SCRI1043
Leifsonia xyli subsp. xyli str. CTCB07
Mesoplasma florum L1
Onion yellows phytoplasma OY-M
Pseudomonas syringae pv. phaseolicola 1448A
Pseudomonas syringae pv. syringae B728a
Pseudomonas syringae pv. tomato str. DC3000
Ralstonia solanacearum GMI1000
Xanthomonas axonopodis pv. citri str. 306
Xanthomonas campestris pv. campestris str. 8004
Xanthomonas campestris pv. campestris str. ATCC 33913
Xanthomonas campestris pv. campestris str. B100
Xanthomonas campestris pv. vesicatoria str. 85-10
Xanthomonas oryzae pv. oryzae MAFF 311018
Xanthomonas oryzae pv. oryzae KACC10331
Xanthomonas oryzae pv. oryzae PXO99A
Xylella fastidiosa M12
Xylella fastidiosa M23
Xylella fastidiosa 9a5c
Xylella fastidiosa Temecula1
However, due to the absence of abundant experimental information, many misannotations still exist in the sequenced bacterial genomes, especially in GC-rich genomes [1–6]. Firstly, many bacterial genomes have false-positive gene identification, i.e., some open-reading frames (ORFs) are incorrectly predicted as protein-coding genes; most of them are short ORFs (<150 bp) without functional information [1–3]. Secondly, many annotated genes have wrong translation initiation sites (TISs). It is indicated that up to 60% of the annotated genes in 143 prokaryotic genomes have wrong TISs in GenBank  or RefSeq , especially in GC-rich genomes . Thirdly, a large number of function-unknown 'hypothetical proteins' are annotated in public databases, which account for 30% ~ 50% in different genomes [5, 6]. These problems are even more serious in phytopathogen genomes because most of them are GC-rich (>50%). Here, we have constructed DIGAP to correct some mistakes and provide improved annotations for these plant pathogens.
Construction and content
Finding non-coding ORFs from annotated 'hypothetical ORFs'
The method adopted here was based on the Z curve of DNA sequence , which had been successfully applied to find genes in prokaryotic and some eukaryotic genomes [3, 10–12]. In the present analysis, 21 variables are adopted, which include 9 phase-dependent single nucleotides and 12 phase-independent di-nucleotides. For details see [Additional file 1].
Relocating translation initiation sites
ProTISA is a recently constructed database, which provides experimentally confirmed and theoretically refined TISs for hundreds of prokaryotic genomes . In addition, an ab initio TIS identification program GS-Finder  was employed to refine TISs in these plant pathogens. Joint-jury method was used to make the final decision. If two of the three systems (RefSeq, ProTISA and GS-Finder) had the same TIS, then it was predicted to be the true TIS. ProTISA is a comprehensive resource, which contained conserved domain confirmed (CDC) and high similarity confirmed (HSC) information for TISs . Therefore, if the three systems predicted different TISs, the site provided by ProTISA was adopted. Five phytopathogen genomes Av 4, Cms, Cpa, Xcc 100 and Xoo 99A were not contained in ProTISA, therefore only GS-Finder was used to relocate TISs for the five genomes.
Predicting hypothetical protein functions with sequence alignment
After removing the non-coding ORFs and correcting many TISs, the third step was to predict functions for the 'hypothetical proteins'. The sequence alignment tool BLAST  was used to search public non-redundant databases. Function was predicted to a 'hypothetical protein' if the aligned homologs had definite function which occurred more than five times with sequence alignment coverage >60%, sequence identity ≥40% and E value <1e-10. Then the predicted functions were searched in NCBI PubMed , Swiss-Prot  and PDB  to find experimentally characterized homologs. If a 'hypothetical protein' had PDB (or Swiss-Prot) homologs with the same function as predicted by sequence alignment, then the function of the 'hypothetical protein' and its PDB (or Swiss-Prot entry with evidence at protein level) homolog was listed in DIGAP.
Predicting gene expression levels
Codon adaptation index (CAI) and E(g) are theoretical indices which were used to predict gene expression levels in prokaryotic genomes [19, 20]. To some extent the expression level of a gene can indicate the importance of its function. Some highly expressed genes are potential antibiotic targets in plant protection. Detailed methods to calculate CAI and E(g) values are listed in [Additional file 2]. The predicted highly expressed genes were marked with '*' in DIGAP.
Predicting potential bactericide targets and modeling their 3D structures
So far, hundreds of proteins and nucleic acids have been explored as therapeutic antibacterial targets in human and animals. Some databases, such as TTD  and DrugBank , have been constructed to provide information for the known targets in human and animal species. However, no such information is available for bacterial plant pathogens up to now. So we searched the orthologs of antibacterial targets in TTD and DrugBank, and listed all the potential bactericide targets in DIGAP. For each potential target, the protein sequence from a representative phytopathogen was selected, and homology modeling was employed to construct its 3D structure. First, similarity search was performed using BLAST against PDB to acquire the template. If there were multiple structural candidates in PDB for a certain protein, the one with inhibitor and the highest resolution was selected. Then, the 3D structure was constructed by employing the homology modeling module of Insight II software. Subsequently, molecular dynamics equilibration was performed to refine the obtained 3D structures with the consistent-valence force field (CVFF) on a SGI Origin 350 server. The models were minimized by 1000 conjugate gradient steps for equilibration, heated from 2 K to 300 K during 35 psec at temperature increment of 50 K per 5 psec, then the constant temperature and pressure algorithm was applied at 300 K for 200 psec. The velocity verlet integrator was used with an integration step of 2 fsec. Finally, the feasibility of modeled structures was evaluated by Verify3D to ensure that all the predicted structures had an acceptable 3D-1D self-compatibility score.
Utility and discussion
Refined information of the 28 plant pathogens
Number of non-coding ORFs
Number (percentage) of refined TISs
Number (percentage) of HPs assigned with functions b
Number (percentage) of PHX genes c
Number of potential drug targets
Thirdly, using sequence alignment tools BLAST , 1.4% ~ 35.3% of the 'hypothetical proteins' were assigned with functions in different phytopathogen genomes (fourth column of Table 2). All the 'hypothetical proteins' assigned with functions are marked in red in the DIGAP. Most of these proteins have high sequence identity and sequence alignment coverage to their homologs with known functions. To further confirm the reliability of the predicted functions, experimentally characterized homologs were searched in Swiss-Prot and PDB. Many PDB homologs have been identified, which possess the same functions as the predicted functions for 'hypothetical proteins'. Furthermore, PubMed references for the predicted functions of hundreds of homologs of 'hypothetical proteins' are listed in DIGAP. Some predicted functions have experimentally characterized Swiss-Prot homologs, which are listed in Table VI of DIGAP 'statistics' section http://ibi.hzau.edu.cn/digap/statistics.php#6. In total, predicted functions have been assigned to 3683 'hypothetical proteins' in these plant pathogens, and 296 of them have PDB homologs. In addition, more than 600 related references of homologs for the predicted functions are listed in DIGAP.
Finally, 54 potential bactericide targets were identified in these phytopathogens, http://ibi.hzau.edu.cn/digap/targets.php, of which 44 potential targets exist commonly in more than half of the plant pathogens with relatively high sequence identity (>30%), and might serve as promising broad-spectrum bactericide targets in plant protection. The other 10 potential targets exist only in a few genomes with low sequence similarity, which might be used as species-specific bactericide targets. 3D structures of 45 potential targets were modeled, most of which have high sequence identity with their templates in PDB. Furthermore, 25 template enzymes can provide the information of active sites and inhibitors, which are highly valuable for new bactericide discovery.
DIGAP is designed to provide improved annotations for the sequenced bacterial phytopathogen genomes, and contains 28 genomes in the current version. With the development of next-generation high-throughput genome sequencing, more bacterial plant pathogen genomes will soon be sequenced, and their improved annotations will be added to DIGAP. The improved annotations will enable a better understanding of lifestyle, metabolism and pathogenicity of these bacterial plant pathogens at molecular level, and will provide valuable resources for controlling phytopathogenic diseases.
Availability and requirements
The DIGAP database is freely available through the URL: http://ibi.hzau.edu.cn/digap.
All the refined information can be accessed by manual download.
We thank F. Li and B.-G. Ma for their help in constructing the database, and D.-D. Zhao, W.-H. Zhang, S.-Y. Wang and Y.-X. Wang in preparing the data. The present study was supported by the National Basic Research Program of China (2010CB126100) and the National High Technology Research and Development Program of China (2008AA09Z411).
- Nielsen P, Krogh A: Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics. 2005, 21: 4322-4329. 10.1093/bioinformatics/bti701.PubMedView ArticleGoogle Scholar
- Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A: On the total number of genes and their length distribution in complete microbial genomes. Trends Genet. 2001, 17: 425-428. 10.1016/S0168-9525(01)02372-1.PubMedView ArticleGoogle Scholar
- Guo FB, Ou HY, Zhang CT: ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res. 2003, 31: 1780-1789. 10.1093/nar/gkg254.PubMed CentralPubMedView ArticleGoogle Scholar
- Rudd KE: EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Res. 2000, 28: 60-64. 10.1093/nar/28.1.60.PubMed CentralPubMedView ArticleGoogle Scholar
- Bork P: Powers and pitfalls in sequence analysis: the 70% hurdle. Genome Res. 2000, 10: 398-400. 10.1101/gr.10.4.398.PubMedView ArticleGoogle Scholar
- Kolker E, Makarova KS, Shabalina S, Picone AF, Purvine S, Holzman T, Cherny T, Armbruster D, Munson RS, Kolesov G, Frishman D, Galperin MY: Identification and functional analysis of 'hypothetical' genes expressed in Haemophilus influenzae. Nucleic Acids Res. 2004, 32: 2353-2361. 10.1093/nar/gkh555.PubMed CentralPubMedView ArticleGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res. 2008, 36: D25-30. 10.1093/nar/gkm929.PubMed CentralPubMedView ArticleGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2008, 35: D61-65. 10.1093/nar/gkl842.View ArticleGoogle Scholar
- Zhang CT, Zhang R: Analysis of distribution of bases in the coding sequences by a diagrammatic technique. Nucleic Acids Res. 1991, 19: 6313-6317. 10.1093/nar/19.22.6313.PubMed CentralPubMedView ArticleGoogle Scholar
- Chen LL, Zhang CT: Gene recognition from questionable ORFs in bacterial and archaeal genomes. J Biomol Struct Dyn. 2003, 21: 99-110.PubMedView ArticleGoogle Scholar
- Zhang CT, Wang J: Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. Nucleic Acids Res. 2000, 28: 2804-2814. 10.1093/nar/28.14.2804.PubMed CentralPubMedView ArticleGoogle Scholar
- Gao F, Zhang CT: Comparison of various algorithms for recognizing short coding sequences of human genes. Bioinformatics. 2004, 20: 673-681. 10.1093/bioinformatics/btg467.PubMedView ArticleGoogle Scholar
- Hu GQ, Zheng X, Yang YF, Ortet P, She ZS, Zhu H: ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genomes. Nucleic Acids Res. 2008, 36: D114-119. 10.1093/nar/gkm799.PubMed CentralPubMedView ArticleGoogle Scholar
- Ou HY, Guo FB, Zhang CT: GS-Finder: a program to find bacterial gene start sites with a self-training method. Int J Biochem Cell Biol. 2004, 36: 535-544. 10.1016/j.biocel.2003.08.013.PubMedView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralPubMedView ArticleGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2008, 36: D13-21. 10.1093/nar/gkm1000.PubMed CentralPubMedView ArticleGoogle Scholar
- Bairoch A, Boeckmann B, Ferro S, Gasteiger E: Swiss-Prot: Juggling between evolution and stability. Brief Bioinform. 2004, 5: 39-55. 10.1093/bib/5.1.39.PubMedView ArticleGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28: 1235-1242. 10.1093/nar/28.1.235.View ArticleGoogle Scholar
- Sharp PM, Li WH: The Codon Adaptation Index - a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987, 15: 1281-1295. 10.1093/nar/15.3.1281.PubMed CentralPubMedView ArticleGoogle Scholar
- Karlin S, Mrázek J, Campbell AM: Codon usages in different gene classes of the Escherichia coli genome. Mol Microbiol. 1998, 29: 1341-1355. 10.1046/j.1365-2958.1998.01008.x.PubMedView ArticleGoogle Scholar
- Chen X, Ji ZL, Chen YZ: TTD: Therapeutic Target Database. Nucleic Acids Res. 2002, 30: 412-415. 10.1093/nar/30.1.412.PubMed CentralPubMedView ArticleGoogle Scholar
- Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M: DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008, 36: D901-906. 10.1093/nar/gkm958.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.