Database | Open | Published:
DIGAP - a Database of Improved Gene Annotation for Phytopathogens
BMC Genomicsvolume 11, Article number: 54 (2010)
Bacterial plant pathogens are very harmful to their host plants, which can cause devastating agricultural losses in the world. With the development of microbial genome sequencing, many strains of phytopathogens have been sequenced. However, some misannotations exist in these phytopathogen genomes. Our objective is to improve these annotations and store them in a central database DIGAP.
DIGAP includes the following improved information on phytopathogen genomes. (i) All the 'hypothetical proteins' were checked, and non-coding ORFs recognized by the Z curve method were removed. (ii) The translation initiation sites (TISs) of 20% ~ 25% of all the protein-coding genes have been corrected based on the NCBI RefSeq, ProTISA database and an ab initio program, GS-Finder. (iii) Potential functions of about 10% 'hypothetical proteins' have been predicted using sequence alignment tools. (iv) Two theoretical gene expression indices, the codon adaptation index (CAI) and the E(g) index, were calculated to predict the gene expression levels. (v) Potential agricultural bactericide targets and their homology-modeled 3D structures are provided in the database, which is of significance for agricultural antibiotic discovery.
The results in DIGAP provide useful information for understanding the pathogenetic mechanisms of phytopathogens and for finding agricultural bactericides. DIGAP is freely available at http://ibi.hzau.edu.cn/digap/.
Plant pathogenic bacteria are very harmful to their host plants, which can cause devastating agricultural losses in the world. The progress in bacterial genome sequencing project has enabled a better understanding of plant pathogens at the molecular level. Up to the middle of 2009, 28 strains of bacterial phytopathogen genomes have been sequenced, whose names and general annotation information are listed in Table 1. The availability of these phytopathogen genomes provides an unprecedented opportunity for the research of lifestyle and pathogenicity of plant pathogens as well as agricultural bactericide discovery.
However, due to the absence of abundant experimental information, many misannotations still exist in the sequenced bacterial genomes, especially in GC-rich genomes [1–6]. Firstly, many bacterial genomes have false-positive gene identification, i.e., some open-reading frames (ORFs) are incorrectly predicted as protein-coding genes; most of them are short ORFs (<150 bp) without functional information [1–3]. Secondly, many annotated genes have wrong translation initiation sites (TISs). It is indicated that up to 60% of the annotated genes in 143 prokaryotic genomes have wrong TISs in GenBank  or RefSeq , especially in GC-rich genomes . Thirdly, a large number of function-unknown 'hypothetical proteins' are annotated in public databases, which account for 30% ~ 50% in different genomes [5, 6]. These problems are even more serious in phytopathogen genomes because most of them are GC-rich (>50%). Here, we have constructed DIGAP to correct some mistakes and provide improved annotations for these plant pathogens.
Construction and content
The construction of DIGAP was based on the LAMP platform, i.e., an open source operation system Linux http://www.linux.org/, a stable web sever Apache http://www.apache.org, a fast database management system MySQL http://www.mysql.com and a powerful web scripting language PHP/Perl http://www.php.net, http://www.perl.org/. All the phytopathogen genomes were downloaded from NCBI RefSeq , release 33. The flowchart of the database construction is illustrated in Figure 1. Briefly, it contains the following steps.
Finding non-coding ORFs from annotated 'hypothetical ORFs'
The method adopted here was based on the Z curve of DNA sequence , which had been successfully applied to find genes in prokaryotic and some eukaryotic genomes [3, 10–12]. In the present analysis, 21 variables are adopted, which include 9 phase-dependent single nucleotides and 12 phase-independent di-nucleotides. For details see [Additional file 1].
Relocating translation initiation sites
ProTISA is a recently constructed database, which provides experimentally confirmed and theoretically refined TISs for hundreds of prokaryotic genomes . In addition, an ab initio TIS identification program GS-Finder  was employed to refine TISs in these plant pathogens. Joint-jury method was used to make the final decision. If two of the three systems (RefSeq, ProTISA and GS-Finder) had the same TIS, then it was predicted to be the true TIS. ProTISA is a comprehensive resource, which contained conserved domain confirmed (CDC) and high similarity confirmed (HSC) information for TISs . Therefore, if the three systems predicted different TISs, the site provided by ProTISA was adopted. Five phytopathogen genomes Av 4, Cms, Cpa, Xcc 100 and Xoo 99A were not contained in ProTISA, therefore only GS-Finder was used to relocate TISs for the five genomes.
Predicting hypothetical protein functions with sequence alignment
After removing the non-coding ORFs and correcting many TISs, the third step was to predict functions for the 'hypothetical proteins'. The sequence alignment tool BLAST  was used to search public non-redundant databases. Function was predicted to a 'hypothetical protein' if the aligned homologs had definite function which occurred more than five times with sequence alignment coverage >60%, sequence identity ≥40% and E value <1e-10. Then the predicted functions were searched in NCBI PubMed , Swiss-Prot  and PDB  to find experimentally characterized homologs. If a 'hypothetical protein' had PDB (or Swiss-Prot) homologs with the same function as predicted by sequence alignment, then the function of the 'hypothetical protein' and its PDB (or Swiss-Prot entry with evidence at protein level) homolog was listed in DIGAP.
Predicting gene expression levels
Codon adaptation index (CAI) and E(g) are theoretical indices which were used to predict gene expression levels in prokaryotic genomes [19, 20]. To some extent the expression level of a gene can indicate the importance of its function. Some highly expressed genes are potential antibiotic targets in plant protection. Detailed methods to calculate CAI and E(g) values are listed in [Additional file 2]. The predicted highly expressed genes were marked with '*' in DIGAP.
Predicting potential bactericide targets and modeling their 3D structures
So far, hundreds of proteins and nucleic acids have been explored as therapeutic antibacterial targets in human and animals. Some databases, such as TTD  and DrugBank , have been constructed to provide information for the known targets in human and animal species. However, no such information is available for bacterial plant pathogens up to now. So we searched the orthologs of antibacterial targets in TTD and DrugBank, and listed all the potential bactericide targets in DIGAP. For each potential target, the protein sequence from a representative phytopathogen was selected, and homology modeling was employed to construct its 3D structure. First, similarity search was performed using BLAST against PDB to acquire the template. If there were multiple structural candidates in PDB for a certain protein, the one with inhibitor and the highest resolution was selected. Then, the 3D structure was constructed by employing the homology modeling module of Insight II software. Subsequently, molecular dynamics equilibration was performed to refine the obtained 3D structures with the consistent-valence force field (CVFF) on a SGI Origin 350 server. The models were minimized by 1000 conjugate gradient steps for equilibration, heated from 2 K to 300 K during 35 psec at temperature increment of 50 K per 5 psec, then the constant temperature and pressure algorithm was applied at 300 K for 200 psec. The velocity verlet integrator was used with an integration step of 2 fsec. Finally, the feasibility of modeled structures was evaluated by Verify3D to ensure that all the predicted structures had an acceptable 3D-1D self-compatibility score.
Utility and discussion
General results of the improved annotations are listed in Table 2. Firstly, all the "hypothetical proteins" in the original RefSeq annotation are re-analyzed by using the Z curve method . About 1% ~ 3% of the 'hypothetical proteins' were recognized as non-coding ORFs in each phytopathogen genome, and are listed in the second column of Table 2. Differences between coding and non-coding sequences (positive and negative samples) can be intuitively viewed from principle component analysis (PCA). Figure 2 shows the distribution of points on the principal plane spanned by the first two principal components for At 58. The red circles denote the function-known genes, and the blue triangles denote the corresponding shuffled sequences. The recognized non-coding ORFs are represented by black stars, which distribute far from the core of the function-known genes, and close to random sequences. Figures for other plant pathogens are in the 'documents' section of the website http://ibi.hzau.edu.cn/digap/document.php?page=3. The average length of recognized non-coding ORFs is much shorter than that of the function-known genes (Table V in 'statistics' section of the website, http://ibi.hzau.edu.cn/digap/statistics.php#5). All the evidence supports that the recognized non-coding ORFs are very unlikely to encode proteins. Protein identification (PID) numbers for these non-coding ORFs are listed in Table IV in the 'statistics' section of the website http://ibi.hzau.edu.cn/digap/statistics.php#4.
Secondly, a large number of TISs were relocated, and the number and percentage for each genome is listed in the third column of Table 2. The relocated TISs are provided in the 'shift' column of the 'basic information' in DIGAP. Positive and negative numbers indicate the 3'-downstream and 5'-upstream shift of the original TISs, respectively. Most corrected TISs are both predicted by ProTISA and GS-Finder, and many of them have 5' conserved domain confirmed (CDC) and high similarity confirmed (HSC) information . In total, 0.3% ~ 49.3% TISs were relocated in different phytopathogen genomes. As an example, Figure 3 (a) and 3 (b) show the statistical caky chart and histogram of relocated TISs in At 58. It can be observed that 11.6% (11.9%) of TISs are relocated to the 5'-upstream (3'-downstream) region. Furthermore, the distribution pattern of shifted distances is similar to a normal distribution. The statistical caky charts and histograms for other plant pathogens are shown in the 'documents' section of the website http://ibi.hzau.edu.cn/digap/document.php.
Thirdly, using sequence alignment tools BLAST , 1.4% ~ 35.3% of the 'hypothetical proteins' were assigned with functions in different phytopathogen genomes (fourth column of Table 2). All the 'hypothetical proteins' assigned with functions are marked in red in the DIGAP. Most of these proteins have high sequence identity and sequence alignment coverage to their homologs with known functions. To further confirm the reliability of the predicted functions, experimentally characterized homologs were searched in Swiss-Prot and PDB. Many PDB homologs have been identified, which possess the same functions as the predicted functions for 'hypothetical proteins'. Furthermore, PubMed references for the predicted functions of hundreds of homologs of 'hypothetical proteins' are listed in DIGAP. Some predicted functions have experimentally characterized Swiss-Prot homologs, which are listed in Table VI of DIGAP 'statistics' section http://ibi.hzau.edu.cn/digap/statistics.php#6. In total, predicted functions have been assigned to 3683 'hypothetical proteins' in these plant pathogens, and 296 of them have PDB homologs. In addition, more than 600 related references of homologs for the predicted functions are listed in DIGAP.
Finally, 54 potential bactericide targets were identified in these phytopathogens, http://ibi.hzau.edu.cn/digap/targets.php, of which 44 potential targets exist commonly in more than half of the plant pathogens with relatively high sequence identity (>30%), and might serve as promising broad-spectrum bactericide targets in plant protection. The other 10 potential targets exist only in a few genomes with low sequence similarity, which might be used as species-specific bactericide targets. 3D structures of 45 potential targets were modeled, most of which have high sequence identity with their templates in PDB. Furthermore, 25 template enzymes can provide the information of active sites and inhibitors, which are highly valuable for new bactericide discovery.
DIGAP is supported with a user-friendly designed web interface, so that users can easily get the desired information at any time. Figure 4(a) ~ (d) show some frequently used webpage. As shown in Figure 4(a), users can make a quick search by using gene name, DIGAP_ID, PID and gene function. Figure 4(b) illustrates an example of a phytopathogen annotation, the 'hypothetical proteins' assigned with functions are marked in red in the database. Users can click DIGAP_ID to obtain the detailed annotation information. Figure 4(c) shows the BLAST search webpage. Users can query nucleotide or protein sequences, and the BLAST generates a list of hits which are organized according to the sequence identity between query and object sequences. Figure 4(d) exhibits the potential bactericide targets, which includes the information of PDB template, inhibitor and modeled structure.
DIGAP is designed to provide improved annotations for the sequenced bacterial phytopathogen genomes, and contains 28 genomes in the current version. With the development of next-generation high-throughput genome sequencing, more bacterial plant pathogen genomes will soon be sequenced, and their improved annotations will be added to DIGAP. The improved annotations will enable a better understanding of lifestyle, metabolism and pathogenicity of these bacterial plant pathogens at molecular level, and will provide valuable resources for controlling phytopathogenic diseases.
Availability and requirements
The DIGAP database is freely available through the URL: http://ibi.hzau.edu.cn/digap.
All the refined information can be accessed by manual download.
Nielsen P, Krogh A: Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics. 2005, 21: 4322-4329. 10.1093/bioinformatics/bti701.
Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A: On the total number of genes and their length distribution in complete microbial genomes. Trends Genet. 2001, 17: 425-428. 10.1016/S0168-9525(01)02372-1.
Guo FB, Ou HY, Zhang CT: ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res. 2003, 31: 1780-1789. 10.1093/nar/gkg254.
Rudd KE: EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Res. 2000, 28: 60-64. 10.1093/nar/28.1.60.
Bork P: Powers and pitfalls in sequence analysis: the 70% hurdle. Genome Res. 2000, 10: 398-400. 10.1101/gr.10.4.398.
Kolker E, Makarova KS, Shabalina S, Picone AF, Purvine S, Holzman T, Cherny T, Armbruster D, Munson RS, Kolesov G, Frishman D, Galperin MY: Identification and functional analysis of 'hypothetical' genes expressed in Haemophilus influenzae. Nucleic Acids Res. 2004, 32: 2353-2361. 10.1093/nar/gkh555.
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res. 2008, 36: D25-30. 10.1093/nar/gkm929.
Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2008, 35: D61-65. 10.1093/nar/gkl842.
Zhang CT, Zhang R: Analysis of distribution of bases in the coding sequences by a diagrammatic technique. Nucleic Acids Res. 1991, 19: 6313-6317. 10.1093/nar/19.22.6313.
Chen LL, Zhang CT: Gene recognition from questionable ORFs in bacterial and archaeal genomes. J Biomol Struct Dyn. 2003, 21: 99-110.
Zhang CT, Wang J: Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. Nucleic Acids Res. 2000, 28: 2804-2814. 10.1093/nar/28.14.2804.
Gao F, Zhang CT: Comparison of various algorithms for recognizing short coding sequences of human genes. Bioinformatics. 2004, 20: 673-681. 10.1093/bioinformatics/btg467.
Hu GQ, Zheng X, Yang YF, Ortet P, She ZS, Zhu H: ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genomes. Nucleic Acids Res. 2008, 36: D114-119. 10.1093/nar/gkm799.
Ou HY, Guo FB, Zhang CT: GS-Finder: a program to find bacterial gene start sites with a self-training method. Int J Biochem Cell Biol. 2004, 36: 535-544. 10.1016/j.biocel.2003.08.013.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.
Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2008, 36: D13-21. 10.1093/nar/gkm1000.
Bairoch A, Boeckmann B, Ferro S, Gasteiger E: Swiss-Prot: Juggling between evolution and stability. Brief Bioinform. 2004, 5: 39-55. 10.1093/bib/5.1.39.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28: 1235-1242. 10.1093/nar/28.1.235.
Sharp PM, Li WH: The Codon Adaptation Index - a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987, 15: 1281-1295. 10.1093/nar/15.3.1281.
Karlin S, Mrázek J, Campbell AM: Codon usages in different gene classes of the Escherichia coli genome. Mol Microbiol. 1998, 29: 1341-1355. 10.1046/j.1365-2958.1998.01008.x.
Chen X, Ji ZL, Chen YZ: TTD: Therapeutic Target Database. Nucleic Acids Res. 2002, 30: 412-415. 10.1093/nar/30.1.412.
Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M: DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008, 36: D901-906. 10.1093/nar/gkm958.
We thank F. Li and B.-G. Ma for their help in constructing the database, and D.-D. Zhao, W.-H. Zhang, S.-Y. Wang and Y.-X. Wang in preparing the data. The present study was supported by the National Basic Research Program of China (2010CB126100) and the National High Technology Research and Development Program of China (2008AA09Z411).
L-LC designed the database, NG and WW established the database. NG, H-FJ, J-WC, BG and LZ collected the data and performed the calculation. All authors analyzed the data. L-LC, S-CZ. and H-YZ wrote the paper. All authors read and approved the final manuscript.