Construction
The construction of DIGAP was based on the LAMP platform, i.e., an open source operation system Linux http://www.linux.org/, a stable web sever Apache http://www.apache.org, a fast database management system MySQL http://www.mysql.com and a powerful web scripting language PHP/Perl http://www.php.net, http://www.perl.org/. All the phytopathogen genomes were downloaded from NCBI RefSeq [8], release 33. The flowchart of the database construction is illustrated in Figure 1. Briefly, it contains the following steps.
Content
Finding non-coding ORFs from annotated 'hypothetical ORFs'
The method adopted here was based on the Z curve of DNA sequence [9], which had been successfully applied to find genes in prokaryotic and some eukaryotic genomes [3, 10–12]. In the present analysis, 21 variables are adopted, which include 9 phase-dependent single nucleotides and 12 phase-independent di-nucleotides. For details see [Additional file 1].
Relocating translation initiation sites
ProTISA is a recently constructed database, which provides experimentally confirmed and theoretically refined TISs for hundreds of prokaryotic genomes [13]. In addition, an ab initio TIS identification program GS-Finder [14] was employed to refine TISs in these plant pathogens. Joint-jury method was used to make the final decision. If two of the three systems (RefSeq, ProTISA and GS-Finder) had the same TIS, then it was predicted to be the true TIS. ProTISA is a comprehensive resource, which contained conserved domain confirmed (CDC) and high similarity confirmed (HSC) information for TISs [13]. Therefore, if the three systems predicted different TISs, the site provided by ProTISA was adopted. Five phytopathogen genomes Av 4, Cms, Cpa, Xcc 100 and Xoo 99A were not contained in ProTISA, therefore only GS-Finder was used to relocate TISs for the five genomes.
Predicting hypothetical protein functions with sequence alignment
After removing the non-coding ORFs and correcting many TISs, the third step was to predict functions for the 'hypothetical proteins'. The sequence alignment tool BLAST [15] was used to search public non-redundant databases. Function was predicted to a 'hypothetical protein' if the aligned homologs had definite function which occurred more than five times with sequence alignment coverage >60%, sequence identity ≥40% and E value <1e-10. Then the predicted functions were searched in NCBI PubMed [16], Swiss-Prot [17] and PDB [18] to find experimentally characterized homologs. If a 'hypothetical protein' had PDB (or Swiss-Prot) homologs with the same function as predicted by sequence alignment, then the function of the 'hypothetical protein' and its PDB (or Swiss-Prot entry with evidence at protein level) homolog was listed in DIGAP.
Predicting gene expression levels
Codon adaptation index (CAI) and E(g) are theoretical indices which were used to predict gene expression levels in prokaryotic genomes [19, 20]. To some extent the expression level of a gene can indicate the importance of its function. Some highly expressed genes are potential antibiotic targets in plant protection. Detailed methods to calculate CAI and E(g) values are listed in [Additional file 2]. The predicted highly expressed genes were marked with '*' in DIGAP.
Predicting potential bactericide targets and modeling their 3D structures
So far, hundreds of proteins and nucleic acids have been explored as therapeutic antibacterial targets in human and animals. Some databases, such as TTD [21] and DrugBank [22], have been constructed to provide information for the known targets in human and animal species. However, no such information is available for bacterial plant pathogens up to now. So we searched the orthologs of antibacterial targets in TTD and DrugBank, and listed all the potential bactericide targets in DIGAP. For each potential target, the protein sequence from a representative phytopathogen was selected, and homology modeling was employed to construct its 3D structure. First, similarity search was performed using BLAST against PDB to acquire the template. If there were multiple structural candidates in PDB for a certain protein, the one with inhibitor and the highest resolution was selected. Then, the 3D structure was constructed by employing the homology modeling module of Insight II software. Subsequently, molecular dynamics equilibration was performed to refine the obtained 3D structures with the consistent-valence force field (CVFF) on a SGI Origin 350 server. The models were minimized by 1000 conjugate gradient steps for equilibration, heated from 2 K to 300 K during 35 psec at temperature increment of 50 K per 5 psec, then the constant temperature and pressure algorithm was applied at 300 K for 200 psec. The velocity verlet integrator was used with an integration step of 2 fsec. Finally, the feasibility of modeled structures was evaluated by Verify3D to ensure that all the predicted structures had an acceptable 3D-1D self-compatibility score.