Ambushing the ambush hypothesis: predicting and evaluating off-frame codon frequencies in Prokaryotic Genomes
© Morgens et al.; licensee BioMed Central Ltd. 2013
Received: 23 January 2013
Accepted: 19 June 2013
Published: 22 June 2013
In this paper, we address the evidence for the Ambush Hypothesis. Proposed by Seligmann and Pollock, this hypothesis posits that there exists a selection for off-frame stop codons (OSCs) to counteract the possible deleterious effects of translational frameshifts, including the waste of resources and potential cytotoxicity. Two main types of study have been used to support the hypothesis. Some studies analyzed codon usage and showed that codons with more potential to create OSCs seem to be favored over codons with lower potential; they used this finding to support the Ambush Hypothesis. Another study used 342 bacterial genomes to evaluate the hypothesis directly, finding significant excesses of OSCs in these genomes.
We repeated both analyses with newer datasets and searched for other factors that could explain the observed trends. In the first case, the relative frequency of codons with the potential to create OSCs is directly correlated with the GC content of organisms, as stop codons are GC-poor. When evaluating the frequency of OSCs directly in 1,976 bacterial genomes we also detected a significant excess. However, when comparing the excess of OSCs with similarly obtained results for the frequency of out-of-frame sense codons, some sense codons have a more significant excess than stop codons.
Two avenues of study have been used to support the Ambush Hypothesis. Using the same methods as these previous studies, we demonstrate that the evidence in support of the Ambush Hypothesis does not hold up against more rigorous testing.
KeywordsAmbush hypothesis Hidden stop codons Codon bias Markov models Translational frameshifts
If a frameshift occurs during translation, the ribosome will continue to translate the mRNA in the wrong reading frame and to extend the protein by incorporating incorrect amino acids until a stop codon is reached in the new reading frame. Such events are, at best, a waste of the cell’s resources and, at worst, could result in a cytotoxic product. Given the potential deleterious effect of translational frameshifts, Seligmann and Pollock  proposed a mechanism by which these effects can be minimized. The Ambush Hypothesis  posits that having numerous OSCs would both minimize the waste of resources and reduce the chances of any cytotoxic products from a translational frameshift, and that as such, genomes with a large number of OSCs would be selectively favored. OSCs has also been reported to affect developmental stability in vertebrates  and protein expression levels .
It has been suggested that a similar selection to minimize the effects of translational frameshifts might have affected genetic code evolution and that the amino acid assignments in the code were optimized to generate a high number of OSCs . The Ambush Hypothesis, however, predicts a continued selective force in favor of OSCs . Thus, if this hypothesis were correct we would expect to find more OSCs than expected by chance in coding regions of genomes.
Support for the Ambush Hypothesis has relied on two methods to evaluate the excesses of OSCs in contemporary genomes. The first methodology relies on the fact that some codons are more likely to generate OSCs than others, and that these codons should be favored, leading to a higher number of OSCs. This prediction was originally tested in 100 organisms  and a similar analysis expanded the dataset to over 14,000 organisms .
These studies showed mixed results. In particular, the most recent analysis  found that only six percent of organisms – 901 out of 14,468 – showed a significant positive correlation between the usage of a codon and its potential to form OSCs. However, in both studies the authors concluded that the existence of any significant positive correlations indicated selection for OSCs [1, 3].
We believe that the data used in these studies is not appropriate to test the Ambush Hypothesis. Codon usage bias is known to vary between organisms and several hypotheses have been proposed to explain these differences . The selection for OSCs predicted by the Ambush Hypothesis should be a very weak selective force as translational frameshifts affect a single protein product at a time, and regardless of the hypothesis, some level of OSCs is expected by chance. The number and strength of other selective pressures influencing codon bias makes it unlikely that codon usage can be used to evaluate the Ambush Hypothesis. In fact, as we will show, the usage of codons with high potential to form OSCs can be almost completely explained by the GC content of genomes.
The second methodology used to test the Ambush Hypothesis is more direct and compares the actual frequency of OSCs in a genome with the frequency expected by chance based on the coding properties of that genome . In a 2010 study , the authors used second and fifth-order three-periodic Markov models to calculate the expected number of hidden stops in 342 phylogenetically representative prokaryotes. Comparing these to the actual counts of OSCs in these genomes, they found a statistically significant excess of OSCs in 93% of the organisms examined. In addition, the authors tested another prediction of the Ambush Hypothesis.
Specifically, as stop codons are AT-rich, GC-rich genomes should have fewer OSCs simply by chance. If the Ambush Hypothesis were correct, the pressure to have OSCs would be stronger in GC-rich organisms; in comparison, AT-rich organisms would inherently have a higher occurrence of OSCs and therefore experience reduced selective pressure for having extra OSCs. Thus, the proportional excesses of OSCs should be larger in GC-rich organisms than in AT-rich ones. The authors tested this prediction and found a correlation between the observed excess of OSCs and GC content, as expected if the hypothesis were correct. They interpreted their results as strong evidence in favor of the Ambush Hypothesis .
However, one has to be careful when interpreting statistical significance in biological systems. While 93% is an impressive and statistically significant majority, this number is given without context. With no controls or theoretical framework to compare the 93% to, it is difficult to judge the biological significance of the results.
In this paper, we used 1,976 complete bacterial transcriptomes to reassess the evidence for the Ambush Hypothesis using both previously described methodologies. We first establish that codon usage is an irrelevant measure of the effects of the Ambush Hypothesis, as the usage of codons with high potential to form OSCs can be almost completely explained by the GC content of genomes. We then evaluate the excess of OSCs in the transcriptomes directly by comparing the observed frequency of OSCs with the expected frequency determined using Markov models. Although our results show the same highly significant statistical excess of OSCs in the organisms studied, they also show that several sense codons are present out-of-frame with similar or even greater excesses than the stop codons are. These results show that neither of the previously used approaches can be used as evidence in favor of the Ambush Hypothesis.
Results and discussion
Codon usage analysis
A previous study correlated codon usage with the potential to form OSCs for 14,468 organisms . The study used codon usage data from the CUTG (Codon Usage Tabulated from GenBank) database . This database contains codon usage information for all full-length protein gene entries present in GenBank tabulated by organism, regardless of the number of sequences for the organism. For some organisms, those with fully sequenced genomes, the database contains an accurate estimate of codon bias; for other, those with just a few sampled genes, the real codon usage might be very different from that reported in the database. This is by design; the CUTG database tries to be comprehensive and to summarize all the data available in GenBank . In the current study we chose to be conservative and included only those prokaryotic organisms that have a full transcriptome sequence available on GenBank, which limits us to 1,976 organisms.
The previous study found that 901 of 14,468 (6.2%) organisms possessed significant (p <0.05) positive correlations between codon bias and potential to form OSCs . Neither this study  nor a previous analysis of 100 organisms  present convincing evidence of their conclusive statements: that there exists a positive correlation between potential to create OSCs and codon bias.
When we tested the validity of this statement in the dataset of fully sequenced bacterial genomes, we found statistically significant (p < 0.05) positive correlations between codon usage and the propensity of that codon to form OSCs in 668 of the 1,976 genomes analyzed (34%). Assuming, as previous papers have, that a selection for OSCs influences codon usage, this could be interpreted as evidence for the Ambush Hypothesis [1, 3].
The correlation between codon usage, potential to form OSCs, and GC content can be explained by the fact that stop codons are AT-rich. Thus, codons that have the potential to form OSCs tend to be AT-rich and consequently are disproportionately prevalent in organisms with AT-rich genomes. Given that other selective forces or mutational biases determine the GC content of an organism [8, 10–12], this implies that the relationship found between a codon’s usage and its propensity to form OSCs is a poor test of the predictions of the Ambush Hypothesis, as it is easily explained by the high AT content of stop codons.
Excesses of OSCs in fully sequenced bacterial genomes
Even though it seems like selection for OSCs doesn’t affect codon usage, it is possible that it could affect the frequency with which one codon follows another in the genome. This could lead to an excess of OSCs over the expected value for a genome with the same coding properties. The authors in  tested this prediction of the model by analyzing the genomes of 342 prokaryotes.
For each organism, they counted the number of OSCs in all transcripts and compared it to the frequency of OSCs expected from randomized genomes, which were constructed to maintain the same coding properties of the original genome. To generate the randomized genomes they used both second-order and fifth-order, 3-periodic Markov models implemented in the MARKOV package of GenRGenS [4, 13]. These models generate random genomes that preserve the dinucleotide or pentanucleotide frequencies of the original genome respectively.
To consider the magnitude of the observed effect, Tse et al.  calculated that, based on their results, Yersinia entercocolitica has about 800 OSCs more than expected by chance . This number sounds high until one considers that Y. enterococolitica has over 4,000 coding sequences and that the excess comes to about 1/5 of an OSC per gene. Y. enterococolitica has 168,841 OSCs in its transcriptome, 112 more than expected under our 2–3 Markov model, which amounts to an extra 0.028 OSCs per gene. Even though these numbers are low, both results are highly statistically significant. Considering that the expected selection for OSCs should be an extremely weak force, these results seem to validate the Ambush Hypothesis.
Yet without a theoretical framework how can we determine whether a statistically significant excess is biologically meaningful? Nature provides such a framework in the form of 61 sense codons. They provide a context that can be used to evaluate the biological significance of the observed excesses of OSCs, assuming that there is no selection for off-frame sense codons.
Summary of the results of the Markov models for selected codons
2-3 Markov model
5-3 Markov model
% organisms with excessa
Average log ratio
% organisms with excessa
Average log ratio
All Stop Codons
Thus, interpreting the statistically significant excesses of OSCs as evidence of selection in favor of OSCs would impel us to interpret our data as evidence for the selection of other off-frame sense codons as well. For example, under this framework, our data provide compelling evidence of selection pressure for off-frame TGN codons, and that this selective force is stronger than that operating on OSCs (Table 1).
Perhaps the most convincing piece of evidence previously brought forth to support the Ambush Hypothesis is the positive relationship between GC content of a genome and its excess of OSCs . This is a validation of a specific prediction of the Ambush Hypothesis: In organisms with high GC content, fewer OSCs are expected by chance, thus a greater excess is needed to counterbalance the effects of translational frameshifts.
We can replicate this result with our own data, but again we must interpret this in context. Do other codons show similar relationships to GC content? In fact, they do: using both Markov models, 10 individual sense codons show a stronger (steeper slope and larger R2-values) positive relationship between GC content and off-frame occurrence than OSCs.
Taken together, our results suggest that the previous results in support of the Ambush Hypothesis were simply detecting excesses of TGA codons. While this could still be interpreted as evidence for the Hypothesis, it seems to stem from an unrelated statistically significant overrepresentation of TGN codons off-frame. We will not speculate here if this statistically significant excess of out-of-frame TGN codons is biologically relevant, but it is not consistent with the Ambush Hypothesis.
Previous papers employed the correlation between a codon’s usage and its propensity to form OSCs as evidence for the Ambush Hypothesis [1, 3]. We have shown that this data reflects only the GC content of the organism and thus cannot be used to evaluate the hypothesis.
Tse et al. found an excess of OSCs in 93% of 342 bacterial genomes using Markov models and also detected a significant positive correlation between the GC content of a genome and the OSC excess of the genome, consistent with the predictions of the Ambush Hypothesis . In an expanded dataset comprising 1,976 bacterial genomes, we observed an excess of OSCs in 83% of the organisms and demonstrated a strongly positive relationship between genome GC content and OSC excess. However, we also found that an excesses of three sense codons are present in more organisms and at a higher frequency than stop codons, and are are even more strongly correlated with GC content (Table 1). Given that the Ambush Hypothesis does not predict any selection for off-frame sense codons, we conclude that our results and the analogous results by  do not represent evidence in favor of the Hypothesis.
We retrieved the annotated transcripts of 2,023 bacterial genomes available from NCBI on August 2012 (http://ftp.ncbi.nlm.nih.gov/genomes/). We excluded from the analyses organisms that reassign stop codons, and for the remaining 1,976 organisms we removed incomplete or ambiguous transcripts.
Codon usage analysis
Previous analyses [1, 3] tried to verify the Ambush Hypothesis by correlating the usage of each codon with the number of ways in which the codon can form an OSC – from 0 to 6 ways (Figure 1). Both analyses used the CUTG (Codon Usage Tabulated from GenBank) database to determine the usage of each codon in different organisms and did a linear regression to determine the correlation coefficient between codon usage and the number of ways in which it can form an OSC.
We replicate these analyses [1, 3] using 1,976 fully sequenced bacterial genomes. Instead of using the CUTG database, we calculate the codon usage for each organism by directly analyzing the transcripts in the GenBank annotation. GC content for each organism was also calculated from the transcripts. We grouped each codon based on the number of ways it can form an OSC. Note that we do not include stop codons themselves in this analysis as none occur in-frame in the coding region. Subsequently, using the average codon usage of each genome, we calculated the correlation (Spearman’s rank coefficient) between codon bias and potential to form OSCs.
Excesses of OSCs in fully sequenced bacterial genomes
We counted the number of OSCs in the +1 and +2 frames for each organism directly from available transcripts. These corresponded to our observed values, the true number of OSCs in each organism. To produce expected values of OSCs, we used the same two Markov models described in  to create artificial sequences that preserved the fundamental structure of the genome. These models have been used before to successfully identify microbial genes [14, 15].
For each organism, we used the complete transcriptome to train our Markov model. The terminating stop codon was removed before training so that stop codons would not be inserted into our artificial genome in frame, as this does not occur biologically and would bias the calculation of the expected frequency of OSCs. The transition matrix generated this way contained the probability of each nucleotide following a given dinucleotide or pentanucleotide, which were used to parameterize the second and fifth order models.
To generate the artificial sequences, we initiate each artificial transcript using the first two or five nucleotides of the natural transcript and then use the probabilities dictated in the transition matrix to add nucleotides one by one, ending when the artificial transcript was an equivalent length to the natural transcript. We then counted the number of OSCs in the entire artificial genome. We repeated this 200 times for each organism to calculate an average expected value. Increasing the number of repetitions did not affect the average significantly. These expected counts were compared to the actual values.
We repeated this procedure using one codon at a time to calculate the overrepresentation with which each codon appeared off-frame in the transcripts of the genomes analyzed. For each organism, after calculating both the expected and the observed off-frame frequencies for the three stop codons and for all 64 codons individually, we quantified the over- or under-representation of a codon (or stop codons) using the natural log of the ratio of the number of observed off-frame codons over the average number of predicted off-frame codons. If this log-ratio was negative, then there were less off-frame codons than expected. If it was positive, then there were more off-frame codons than expected. A log ratio was used because it removes the inherent bias of ratios towards higher values.
There were minor differences between our basic analysis and that of : Before training their Markov model, Tse et al. removed any transcript less than 100 amino acids long. We found this cutoff arbitrary and included all transcripts. They also limited their analysis to one species per genus to prevent oversampling . Given the large number of organisms available to us, we did not trim our dataset, reasoning that the dangers of oversampling were less important and less likely to occur than the inadvertent introduction of bias. All analyses were performed using Python scripts. Source code is available upon request.
This work was partially funded by an HHMI grant to Pomona College. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
- Seligmann H, Pollock DD: The ambush hypothesis: hidden stop codons prevent off-frame gene reading. DNA Cell Biol. 2004, 23: 701-705. 10.1089/dna.2004.23.701.View ArticlePubMedGoogle Scholar
- Wong TY, Fernandes S, Sankhon N, Leong PP, Kuo J, Liu JK: Role of Premature Stop Codons in Bacterial Evolution. J Bacteriol. 2008, 190: 6718-6725. 10.1128/JB.00682-08.PubMed CentralView ArticlePubMedGoogle Scholar
- Singh TR, Pardasani KR: Ambush hypothesis revisited: evidence for phylogenetic trends. Comput Biol Chem. 2009, 33: 239-244. 10.1016/j.compbiolchem.2009.04.002.View ArticlePubMedGoogle Scholar
- Tse H, Cai JJ, Tsoi HW, Lam EP, Yuen KY: Natural selection retains overrepresented out-of-frame stop codons against frameshift peptides in prokaryotes. BMC Genomics. 2010, 11: 491-10.1186/1471-2164-11-491.PubMed CentralView ArticlePubMedGoogle Scholar
- Seligmann H: The ambush hypothesis at the whole-organism level: Off frame, ‘hidden’ stops in vertebrate mitochondrial genes increase developmental stability. Comput Biol Chem. 2010, 34: 80-85. 10.1016/j.compbiolchem.2010.03.001.View ArticlePubMedGoogle Scholar
- Seligmann H: Cost minimization of ribosomal frameshifts. J Theor Biol. 2007, 249 (1): 162-167. 10.1016/j.jtbi.2007.07.007.View ArticlePubMedGoogle Scholar
- Itzkovitz S, Alon U: The genetic code is nearly optimal for allowing additional information within protein-coding sequences. Genome Res. 2007, 17 (4): 405-412. 10.1101/gr.5987307.PubMed CentralView ArticlePubMedGoogle Scholar
- Salim HWM, Cavalcanti ARO: Factors Influencing Codon Usage Bias in Genomes. J Braz Chem Soc. 2008, 19: 257-262.View ArticleGoogle Scholar
- Nakamura Y, Gojobori T, Ikemura T: Codon usage tabulated from the international DNA sequence databases, status for the year 2000. Nucleic Acids Res. 2000, 28: 292-10.1093/nar/28.1.292.PubMed CentralView ArticlePubMedGoogle Scholar
- Schaaper RM, Dunn RL: Spontaneous mutation in the Escherichia coli lacI gene. Genetics. 1991, 129: 317-326.PubMed CentralPubMedGoogle Scholar
- Sargentini NJ, Smith KC: DNA sequence analysis of gamma-radiation (anoxic)-induced and spontaneous lacId mutations in Escherichia coli K-12. Mutat Res. 1994, 309: 147-163. 10.1016/0027-5107(94)90088-4.View ArticlePubMedGoogle Scholar
- Bernardi G: The neoselectionsist theory of genome evolution. PNAS. 2007, 104 (20): 8385-8390. 10.1073/pnas.0701652104.PubMed CentralView ArticlePubMedGoogle Scholar
- Ponty Y, Termier M, Denise A: GenRGenS: software for generating random genomic sequences and structures. Bioinformatics. 2006, 22: 1534-1535. 10.1093/bioinformatics/btl113.View ArticlePubMedGoogle Scholar
- Salzberg SL, Delcher AL, Kasif S, White O: Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 1998, 26 (2): 544-548. 10.1093/nar/26.2.544.PubMed CentralView ArticlePubMedGoogle Scholar
- Borodovsky M, McIninch JD, Koonin EV, Rudd KE, Medigue C, Danchin A: Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Res. 1995, 23 (17): 3554-3562. 10.1093/nar/23.17.3554.PubMed CentralView ArticlePubMedGoogle Scholar