Consistency of gene starts among Burkholderia genomes
© Dunbar et al; licensee BioMed Central Ltd. 2011
Received: 27 August 2010
Accepted: 22 February 2011
Published: 22 February 2011
Evolutionary divergence in the position of the translational start site among orthologous genes can have significant functional impacts. Divergence can alter the translation rate, degradation rate, subcellular location, and function of the encoded proteins.
Existing Genbank gene maps for Burkholderia genomes suggest that extensive divergence has occurred--53% of ortholog sets based on Genbank gene maps had inconsistent gene start sites. However, most of these inconsistencies appear to be gene-calling errors. Evolutionary divergence was the most plausible explanation for only 17% of the ortholog sets. Correcting probable errors in the Genbank gene maps decreased the percentage of ortholog sets with inconsistent starts by 68%, increased the percentage of ortholog sets with extractable upstream intergenic regions by 32%, increased the sequence similarity of intergenic regions and predicted proteins, and increased the number of proteins with identifiable signal peptides.
Our findings highlight an emerging problem in comparative genomics: single-digit percent errors in gene predictions can lead to double-digit percentages of inconsistent ortholog sets. The work demonstrates a simple approach to evaluate and improve the quality of gene maps.
Identification of gene boundaries--the first step in genome annotation--provides the foundation for subsequent comparative genomics. Unfortunately, errors occur. When gene-coding regions are identified, one of a multitude of possible translational start sites must be selected. Gene-finding algorithms such as Glimmer , Genemark  and Prodigal  score each possible start site based on multiple features (e.g. start codon identity and upstream ribosome binding site), but the highest scoring site is not always the true site used in vivo. For example Glimmer 3.02 , Genemark 2.6  and Prodigal 1.20  predict incorrect start sites for 9%, 5.5%, and 3.5% of 884 Escherichia coli genes with experimentally validated gene starts . Genome-specific features such as %GC content can substantially reduce the performance of gene prediction algorithms . Even when the accuracy per genome is high, the aggregation of errors among groups of genomes can produce a large fraction of flawed results and significantly undermine comparative analyses.
Erroneously chosen start sites have a dual impact. The N-terminus of the encoded peptide sequence and the length of the upstream intergenic region are both altered. These errors undermine subsequent informatics such as the similarity of orthologous genes and regulatory regions, predicted operon structure, and the prediction of regulatory motifs. In extreme cases, incorrect gene start sites abolish intergenic spaces, sometimes resulting in spurious gene overlaps. Nearly a thousand examples of spurious gene overlaps were documented in 338 bacterial genomes, of which Burkholderia thailandensis E264 was the worst case . It is likely that a much greater number of less conspicuous errors exist. For example in a preliminary study, we noted that 47% of 116 pairs of orthologous transcription factors from B. thailandensis and B. pseudomallei had inconsistent start sites, despite the high average amino acid identity (93%) between the pairs. We also noted numerous inconsistencies in gene start sites for orthologous genes between three strains of Burkholderia pseudomallei despite the fact that the genome sequences were annotated by the same genome center and released within an 18-month period between 2005 and 2007. Thus, errors in gene start sites might be more common than suggested by spurious overlaps.
To assess the potential extent and impact of gene-calling errors, we performed a genome-wide analysis of orthologs across the Burkholderia genus. We focused on five species representing major clades within the genus. The most distantly related species have 97% 16S rRNA sequence similarity and a median amino acid identity of 71.4% between orthologous proteins. Potential gene-calling errors were identified by inconsistency in gene start positions among orthologs. Inconsistencies will, of course, represent a mixture of true biological variation and gene-calling errors. The two possibilities can be conclusively distinguished by genome-wide experimental validation [5–7]. However, since experimental validation is not currently practical for most genome sequences, conservative heuristics are needed that can parse inconsistencies into probable errors versus plausible evolutionary divergence. Toward this end, we examined the nature of inconsistencies in gene start sites among Burkholderia orthologs. Our findings suggest that probable errors can be distinguished. This distinction enables substantial improvements in gene maps, and identification of plausible biological variation with possible functional consequences.
Consistency among orthologs
In the original Genbank records, the average number of genes per genome for the five species used in this study was 6999. From these, 2681 ortholog sets were identified containing a gene from each genome. DNA sequence alignments showed that only 47% of the sets had consistent (i.e. aligned) start sites. Given that this level of inconsistency might arise from gene-calling errors, we implemented a comparative genomics approach to assess whether consistency could be achieved. For this approach, we required a list of all possible start sites for each gene, generated in a consistent manner for the five target genomes. We also needed quality scores for each possible start site. The Prodigal gene-calling algorithm conveniently provided these data and had the added advantages of being self-tuning, easy to implement, and specifically designed for improved performance with GC-rich genomes such as the 68% GC Burkholderia genomes .
Prodigal predicted an average of 7026 genes per genome. A total of 2801 Prodigal ortholog sets were identified; 65% had consistent start sites. The same percentage was obtained when restricting the analysis to the 2659 ortholog sets containing equivalent Genbank and Prodigal genes. The substantial improvement in ortholog consistency arising simply from use of a different gene-calling algorithm illustrates the imperfection of gene calling and motivates the notion that inconsistent gene starts among closely related orthologs may often represent gene-calling errors.
Corrected gene starts by genome and number of corrections per ortholog set
Revisions per ortholog set
To gain further insight, we revised the gene start sites for groups 1 to 5 to impose consistency, and then examined how the revisions altered the lengths, similarities, and subcellular locations of encoded proteins, as well as changes in the number and similarity of upstream intergenic regions as described below.
Protein sequence identity for ortholog sets before and after modification of gene start sites
Revisions per ortholog set
Eliminating probable gene-calling errors in the Genbank gene maps increased the number of ortholog sets for which a complete set of upstream IGRs could be obtained. A complete set was defined as a set containing five IGR sequences, one from every genome. Inconsistent gene start sites sometimes abolish an intergenic region in one or more genomes, yielding incomplete sets. For the Genbank ortholog sets, 70.0% had complete sets of upstream IGRs, whereas 83.9% (2351 of 2801) of the Prodigal ortholog sets had extractable IGRs. Additional revisions for the 945 Prodigal orthologs with inconsistent start sites added only 19 complete IGR sets (adopting corrections in groups 1 to 4), boosting the total to 84.6%. If group 5 revisions were included, 106 additional IGR sets would occur. In total, correcting probable gene start-site errors in the Genbank gene maps yielded a 21% improvement in the fraction of ortholog sets with extractable upstream IGRs. Thus, refining gene maps can have a dramatic impact on regulatory genomics, which depends on the capacity to extract and compare IGRs upstream of orthologous genes.
If inconsistent gene start sites reflect true divergence rather than gene calling error, evolutionary divergence might occur more broadly in the local gene neighborhood. To explore this, we determined if the adjacent gene upstream of ortholog sets in groups 1 to 5 was consistent (i.e. orthologous) across genomes. Consistency indicates conserved gene blocks. Inconsistencies imply recombination events that alter the gene neighborhood. This analysis did not reveal any significant trend. About 29% of the 1807 Prodigal ortholog sets with aligned starts had inconsistent upstream genes. For groups 1 to 5, the percentage of ortholog sets with a non-conserved gene neighborhood ranged from 21 to 44%. The only notable finding was that 71% of the 49 ortholog sets that had no common start sites showed divergence in the upstream gene neighborhood, substantially higher than the percentage for all other groups of ortholog sets. A qualitatively similar outcome was obtained when using upstream and downstream genes jointly to define conserved gene blocks, instead of only the upstream gene.
For twenty ortholog sets sampled from groups 1 to 5, we manually assessed whether the revised gene start sites were compatible with the beginning of a protein-coding sequence as indicated by GC skew profiles [8, 9]. In organisms with high GC content, the wobble position in coding sequence tends to exhibit an elevated GC bias . This bias can reveal potential coding sequences and correct reading frames . In ideal cases, the GC skew clearly increases as coding sequence begins, which can indicate the general area where a gene start site is likely to occur. Owing to lack of automation, we examined only the first twenty ortholog sets in [Additional file 1]. The sets included the first twenty orthologous genes from B. thailandensis chromosome I (Bth_I0003 to Bth_I0088). For the eight sets representing groups 1 to 4, the revised gene start sites were as good or better than the original gene start sites in terms of compatibility with the GC skew profile (data not shown). In contrast, revised start sites for the twelve ortholog sets from group 5 appeared to interrupt protein-coding sequence (data not shown). These results provide further evidence that the ortholog sets in group 5 represent true evolutionary divergence in the location of gene start sites.
Small error rates in gene calling can have a large aggregate effect on comparative genomics. For five species spanning the genus Burkholderia, we found Genbank records contained substantial inconsistencies in gene start sites for predicted orthologs, most of which appeared to represent gene-calling error. Burkholderia thailandensis, B. pseudomallei, B. ambifaria, B. vietnamiensis, and B. xenovorans had 2681 genes in common. Of these ortholog sets, about 53% had inconsistent gene start sites (Figure 1). The percentage was reduced to 35% simply by using Prodigal, a different gene-finding algorithm designed for better performance with GC-rich genomes like the 68% GC Burkholderia genomes . Using comparative genomics, the inconsistency of gene starts among orthologs could be credibly reduced to 17 to 25%. Our findings illustrate the aggregate impact of probable gene-calling errors and demonstrate a facile approach to distinguish probable error from evolutionary divergence.
To correct probable errors in the Genbank gene maps, we combined the use of a different gene-calling algorithm and a comparative genomics approach. Sole use of comparative genomics is possible. However, a list of alternative start sites and their quality scores provides a quantitative foundation for choosing between possible common start sites. The Prodigal software conveniently provides a list of alternative starts and their quality scores. Fortuitously, Prodigal also made substantial improvements to the Burkholderia gene maps, reducing the number of revisions that would otherwise have been made by comparative genomics alone. Uniform application of any gene finding algorithm to a set of genome sequences seems likely to improve consistency. However, substantial differences in the performance of gene-calling algorithms do exist. For example, the average error rates (i.e. incorrect gene start site predictions) for Glimmer3 and Prodigal with GC-rich genomes are 9.3 and 3.9%, respectively . Our findings are consistent with the reported error rates.
The 53% inconsistency rate we observed for Burkholderia ortholog sets from the original Genbank gene maps implies an intrinsic gene-calling error rate of about 14% (= 1-(1-.53)1/5)) per Burkholderia genome. This rate is close to the 13.1% Glimmer3 error rate reported for the 68% GC-rich Halobacterium salinarum genome . A similar calculation using the 35% inconsistency among ortholog sets from Prodigal gene maps for the five Burkholderia species yields a Prodigal error rate of 8.3% per Burkholderia genome. The reported error rate for Prodigal with GC-rich genomes (greater than 65% GC) ranges from 3.1 to 6.4%. Thus, a substantial fraction of the 945 Prodigal ortholog sets with inconsistent gene starts is likely to represent gene-calling error. It is not surprising that gene finding algorithms make errors. At present, identification of gene start sites includes a probabilistic component. Because gene start sites do not have a definitive sequence signature, true start sites will sometimes be obscured by noise. This underscores the value of a comparative genomics approach for post-processing predicted start sites. A noisy signal in one genome can be counterbalanced by clearer signals in other genomes.
Revision of probable gene start site errors tended to shorten the genes. Sequence alignments (e.g., Figure 2) routinely showed that the truncated "leader" sequence of a revised gene was not a unique insert, but instead corresponded to the upstream intergenic region of the other orthologs. If the original inconsistent start sites were correct, it would imply a complicated evolutionary scenario: conversion of upstream intergenic sequence into coding sequence, in many cases subsuming promoters and transcription factor binding sites and possibly requiring evolution of new regulatory elements further upstream. The most appealing null model for closely related species is the simplest one--divergence of orthologous genes is minimal, unless compelling evidence suggests otherwise. Improving the consistency of gene start sites satisfies this null model. Our revisions also improved the calculated similarity of the upstream intergenic regions, encoded proteins sequences, and the detection of signal peptides. As these observations are not truly orthogonal, experimental validation is ultimately needed to confirm that the revised gene start sites are the correct sites in vivo.
Real evolutionary divergence of gene starts can have important functional consequences. Gene expression may change if altered gene start sites require evolution of new upstream transcriptional regulatory motifs. Protein translation rates may change if the -4 to +37 region around the new start site has a different mRNA folding propensity . Changing the N-terminus of the encoded protein can also alter the rate of protein degradation  or the protein's subcellular location . It is reasonable to expect that evolutionary divergence of gene start sites would occur in some metabolic functional categories more than others. Thus, for ortholog sets in group 5--the group most likely representing evolutionary divergence--one might expect enrichment of specific COG categories. However, we found no statistically significant enrichment of COG categories in group 5 or in groups 1 to 4. The lack of enrichment of COG categories showed that neither gene-calling error nor evolutionary divergence appeared biased toward a particular type of gene or annotation.
It is important to distinguish evolutionary divergence from gene-calling errors. Gene-calling errors substantially degrade the integrity of comparative genomics, which increasingly serves as a foundation for biology and medicine. Our results demonstrate that plausible cases of evolutionary divergence can be distinguished from probable gene calling errors. This simple approach facilitates analyses of functional consequences and changes in cellular behavior that may arise from true divergence. Expansion of this comparative genomics approach can significantly improve gene maps.
Completed genomes were used for the following five species spanning the genus Burkholderia: B. thailandensis E264, B. pseudomallei 1710b, B. ambifaria MC40-6, B. vietnamiensis G4, and B. xenovorans LB400. For each organism, the default gene maps from NCBI RefSeq were obtained in October 2008 from the following files: NC_007650.ptt, NC_007651.ptt, NC_007434.ptt, NC_007435.ptt, NC_010551.ptt, NC_010552.ptt, NC_010557.ptt, NC_009254.ptt, NC_009255.ptt, NC_009256.ptt, NC_007951.ptt, NC_007952.ptt, NC_007953.ptt.
An all-versus-all BLASTP identity matrix was constructed using the following normalized sequence identity metric: (BLAST percent identity)*(BLAST alignment length)/max(Query_length, Target_length). An ortholog set was defined as the set of pan-reciprocal best BLASTP matches containing a single gene from each of the five genomes. Thus, ortholog sets containing paralogs from a genome with a tie for best match were excluded from further analysis. The standard default E-value cutoff of 10 was used for BLASTP matches. The worst BLASTP E-values (ranging from 8.5 to 10-5) occurred with 32 orthologs sets representing very short proteins.
Prodigal gene predictions
New gene predictions for each genome were obtained using the Prodigal bacterial gene prediction software , version 1.10. In addition to the selected start and stop sites for each gene, we obtained from Prodigal the list of all potential start positions and their respective quality scores. Most genes were identical to the pre-computed Prodigal gene predictions available through Genbank; however, 5% were different. We attribute the differences to two factors: (1) the version of Prodigal used by Genbank (1.20 rather than 1.10), and (2) the way Prodigal was run. In our case, for each strain we ran a separate Prodigal "training" step using a concatenation of the fasta files for all chromosomes to ensure consistent gene calling across the entire genome.
The default genes predicted for each genome in Genbank and Prodigal-predicted genes were deemed equivalent if they shared the same stop position.
For each ortholog set, a multiple sequence alignment was performed using MUSCLE  version 3.7 on extended DNA sequences, beginning 250 bases upstream of the first potential start position and finishing at the end of the coding sequence.
Selection of new gene starts
Positions within the alignment where all genomes shared a possible start position were identified. If all five Prodigal-selected start sites were aligned, no change was made to any of the gene start sites. Otherwise, shared start positions within the alignment were ranked by the average (across genomes) Prodigal quality score for each start site. The highest-ranked aligned start site was then chosen as the new start site for the set of orthologs.
Comparison of Intergenic Regions (IGR)
Intergenic sequences were extracted from genomic sequence based on either the original Genbank gene maps, the Prodigal gene maps, or the Prodigal gene maps amended with our revised start sites. Sets of synonymous IGRs were obtained as follows: for each set of orthologous genes, the upstream IGR from each genome was extracted if the length of the IGR in each of the five genomes was greater than 10 nucleotides. Pairwise comparisons of IGRs in each set were performed using BLASTN with a default cutoff of E = 10. Results were used to obtain the normalized sequence identity measure (described above) for the IGRs. If BLASTN did not return a score for a pair of IGR sequences, the normalized sequence identity for the pair was set to 0. We calculated the median of the normalized sequence identity scores for all pairs in an ortholog set as a single metric of IGR sequence identity within the set. To distinguish IGR variation arising from sequence divergence versus recombination events, we assessed the conservation of the upstream flanking gene. If the upstream gene in all five genomes belonged to the same ortholog set, the local gene context was recorded as "conserved".
Signal Peptide Detection and subcellular localization
The presence or absence of signal peptides and subcellular localization was determined using PSORTb v3.0.2 .
Data Management and Codebase
All data was managed and stored within an Oracle 11 g database. The data management and analysis pipeline were written in a combination of Java, PL/SQL and C-shell scripts.
A Laboratory Directed Research and Development grant, 20080138DR, from Los Alamos National Laboratory, supported this work. We thank two anonymous reviewers for their suggestions that improved the manuscript.
- Delcher A, et al: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007, 23: 673-679. 10.1093/bioinformatics/btm009.PubMed CentralPubMedView Article
- Lukashin A, Borodovsky M: GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 1998, 26: 1107-1115. 10.1093/nar/26.4.1107.PubMed CentralPubMedView Article
- Hyatt D, et al: Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010, 11: 119-10.1186/1471-2105-11-119.PubMed CentralPubMedView Article
- Palleja A, Harrington ED, Bork P: Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions?. BMC Genomics. 2008, 9: 335-10.1186/1471-2164-9-335.PubMed CentralPubMedView Article
- De Souza GA, Wiker HG: The Impact of Proteomic Advances on Bacterial Gene Annotation. Current Proteomics. 2009, 6 (2): 84-92.View Article
- Meinnel T, Giglione C: Tools for analyzing and predicting N-terminal protein modifications. Proteomics. 2008, 8 (4): 626-649. 10.1002/pmic.200700592.PubMedView Article
- Smollett KL, et al: Experimental determination of translational start sites resolves uncertainties in genomic open reading frame predictions - application to Mycobacterium tuberculosis. Microbiology (Reading). 2009, 155 (1): 186-197. 10.1099/mic.0.022889-0.View Article
- Bibb MJ, Findlay PR, Johnson MW: The relationship between base composition and codon usage in bacterial genes and its use for the simple and reliable identification of protein-coding sequences. Gene (Amsterdam). 1984, 30: 157-166.View Article
- Ishikawa J, Hotta K: FramePlot: a new implementation of the Frame analysis for predicting protein-coding regions in bacterial DNA with a high G+C content. FEMS Microbiol. Lett. 1991, 174: 251-253. 10.1111/j.1574-6968.1999.tb13576.x.View Article
- Kudla G: Coding-Sequence Determinants of Gene Expression in Escherichia coli. Science. 2009, Washington DC, 324 (5924): 255-258. 10.1126/science.1170160.
- Dougan DA, Truscott KN, Zeth K: The bacterial N-end rule pathway: expect the unexpected. Molecular Microbiology. 2010, 76 (3): 545-558. 10.1111/j.1365-2958.2010.07120.x.PubMedView Article
- Hegde RS, Bernstein HD: The surprising complexity of signal sequences. Trends in Biochemical Sciences. 2006, 31 (10): 563-571. 10.1016/j.tibs.2006.08.004.PubMedView Article
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32: 1792-1797. 10.1093/nar/gkh340.PubMed CentralPubMedView Article
- Yu NY, et al: PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics. 2010, 26: 1608-1615. 10.1093/bioinformatics/btq249.PubMed CentralPubMedView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.