Preferred and avoided codon pairs in three domains of life
© Tats et al. 2008
Received: 13 June 2008
Accepted: 08 October 2008
Published: 08 October 2008
Skip to main content
© Tats et al. 2008
Received: 13 June 2008
Accepted: 08 October 2008
Published: 08 October 2008
Alternative synonymous codons are not used with equal frequencies. In addition, the contexts of codons - neighboring nucleotides and neighboring codons - can have certain patterns. The codon context can influence both translational accuracy and elongation rates. However, it is not known how strong or conserved the codon context preferences in different organisms are. We analyzed 138 organisms (bacteria, archaea and eukaryotes) to find conserved patterns of codon pairs.
After removing the effects of single codon usage and dipeptide biases we discovered a set of neighboring codons for which avoidances or preferences were conserved in all three domains of life. Such biased codon pairs could be divided into subtypes on the basis of the nucleotide patterns that influence the bias. The most frequently avoided type of codon pair was nnUAnn. We discovered that 95.7% of avoided nnUAnn type patterns contain out-frame UAA or UAG triplets on the sense and/or antisense strand. On average, nnUAnn codon pairs are more frequently avoided in ORFeomes than in genomes. Thus we assume that translational selection plays a major role in the avoidance of these codon pairs. Among the preferred codon pairs, nnGCnn was the major type.
Translational selection shapes codon pair usage in protein coding sequences by rules that are common to all three domains of life. The most frequently avoided codon pairs contain the patterns nnUAnn, nnGGnn, nnGnnC, nnCGCn, GUCCnn, CUCCnn, nnCnnA or UUCGnn. The most frequently preferred codon pairs contain the patterns nnGCnn, nnCAnn or nnUnCn.
The frequencies of synonymous codons in protein coding sequences are biased and different organisms tend to use different sets of synonymous codons. In addition, other codons are juxtaposed non-randomly with each codon. These preferences are typically referred to as codon context biases. It is suggested that codon context biases are associated with translational efficiency, since codon context influences translational elongation rates . Moreover, experimental results support the observation that codon context is more strongly related to translational efficiency than single codon usage . A second important parameter that is influenced by codon context is translational accuracy. Codon context can influence both mis-sense and nonsense suppression [2–7]. In addition, codons in combination with surrounding nucleotides can form mononucleotide repeats, which may cause transcriptional [8, 9] or translational  slippage. Frameshift errors on 'hungry' codons in specific nucleotide contexts also increase under starvation conditions . Several programmed frameshifting sites have been described in the coding regions of mRNAs from different organisms (e.g. [12, 13]). Such sites are used for regulating gene expression through recoding. Nevertheless, frameshifting errors are rare events in most sequences, occurring with a frequency less than once every 10,000 codons . This means that sequences that are prone to frameshifting are successfully avoided in coding sequences. For example, it has been shown that certain heptanucleotides that are prone to frameshifts are under-represented in the coding sequences of Saccharomyces cerevisiae  and Escherichia coli .
There have been many studies analyzing codon pair biases in a limited number of species [16–21]. The main selective effects on codon context are found in the nucleotides following the codon in the 3' direction [16, 18, 19, 22]. It has been found that the specific preferred or avoided nucleotide patterns differ among species [16, 22].
The only large-scale comparative analysis to date suggested that the codon context in eukaryotes is biased because target sequences for DNA methylation and trinucleotide repeats are present at high frequencies, while in bacteria and archaea the codon context is influenced mainly by the translational machinery .
Since the structure and function of the ribosomal decoding centre are highly conserved in evolution, we could expect that avoidance of and preference for certain sequence contexts would also be conserved in the protein coding sequences of different organisms. Previous studies suggest that the effects of codon context are influenced by the physical interactions between tRNA isoacceptors in the ribosomal P- and A-sites [1, 16, 23]. It has been shown that in five gamma proteobacteria, Bacillus subtilis and two yeasts, the A-site codons decoded by the same tRNA have similar patterns of P-site codon pairing preference . In addition, it has been confirmed that E-site occupation is essential for preventing frameshift [24–26]. It was shown recently that the species specific combinations of three consecutive codons are highly biased among fungal species and even reaching to the complete vanishing of certain combinations . This study is a comparative analysis of the usage of neighboring and more distant codon pairs in 138 randomly selected organisms belonging to different domains of life. We show that certain codon context biases are conserved in the protein coding sequences of different species. Most of them are probably influenced by translational rather than DNA-related mechanisms.
To study the common rules of codon context bias we looked for codon pairs that are significantly preferred or avoided in the three domains of life. These conserved cases of biased codon pairs are most probably caused by conserved molecular mechanisms and may perhaps shed light on the mechanisms shaping the genes and genomes. In this study a preferred or avoided codon pair was designated a "conserved biased codon pair" if it was statistically significantly avoided or preferred in more than 50% of the organisms studied. This criterion is likely the reason why we found many more conserved codon pairs as opposed to other findings , where the universal rules were searched only among the first ten most conserved codon pairs in each separate domain.
The significance of the bias of each codon pair in each genome was calculated by comparing the observed and expected occurrences of that pair in the open reading frames (ORFs) of a given genome (see Methods). It is important to emphasize that the expected frequency of a codon pair represents the random co-occurrences of two codons, not the expected frequency of the corresponding hexanucleotide. This means that the significantly over-represented co-occurrence of two codons does not necessarily imply that the corresponding hexanucleotide sequences occur with high frequency.
It is known that proteins contain certain dipeptides at increased and reduced frequencies . To ensure that the effects observed at the codon pair level were not caused by avoidances or preferences of dipeptides, the expected codon pairs values were normalized to the dipeptide frequencies (see Methods). This aspect was not considered in previous studies [21, 22].
On the basis of that criterion, we found 288 neighboring codon pairs (1–2 codon pairs) that were preferred or avoided in most of the organisms studied [Additional files 1, 2]. We also tested the conservation of more distant (1–3, 1–4, 1–5) codon pairs. However, for codons 1–3 we found only one codon pair with significant bias - GGUnnnGGU - which was over-represented in 61% of the organisms studied. No conserved biases were found for more distant codon pairs. Thus, all the following analyses are based on neighboring (1–2) codon pairs.
The most straightforward method for testing the translational effects of under- and over-representation of a codon pair would be to compare its avoidances and preferences in the correct reading frame and in two other reading frames. In such a comparison, however, one cannot effectively remove the influence of single codon preferences or amino acid preferences on the avoidance/preference of codon pairs in other frames. Thus, we consider the comparison of effects in +1 and +2 reading frames biased and incorrect, so we compare the effects in the ORFeome and genome instead.
Therefore, to test whether codon pairs are biased because of translational effects or because of mechanisms operating at the DNA level, we calculated the ratio of observed and expected co-occurrences of two trinucleotides at the genome level. If the main selective force for codon pair usage were related to translational effects at the ribosome and not to biases in mechanisms operating at DNA level, the codon pair bias would be stronger in ORFeomes than in genomes. Some conserved under-represented codon pairs contained the stop-codon in +1 or +2 frame. Corresponding trimers cannot occur in one of the frames in genomic regions where they overlap ORFs. This reduced frequency of occurrence was taken into account when the ratio of observed to expected co-occurrences of trimers in genomes was determined.
The top 10 most conserved avoided codon pairs in the organisms studied
12 ↓ codon pairs
A - B
Types of patterns among conserved avoided codon pairs
% among avoided pairs
the effect is stronger in ORFeome (%)
the effect is stronger in whole genome (%)
Many of the nnUAnn type patterns contained codon pairs with stop codons in -1 frame on the sense strand (67%, 47/70). For such pairs, a -1 frameshift event would create premature translational termination. Thus, we assume that avoidance of nnUAnn codon pairs is partly related to out-frame stop codons. On the other hand, avoidance of the UA dinucleotide between two codons also has a role here, because out-frame UGA stop-codons were not observed in any of the conserved biased dicodon pairs.
Interestingly, 83% (58/70) of the nnUAnn type patterns contained out-frame UAA and UAG triplets on the antisense strand. However, there are no known mechanisms that could explain the avoidance of UAA and UAG triplets in the middle of nnUAnn type hexamers on the antisense strand.
Including the antisense strand, almost all (67/70, 95.7%) of the nnUAnn type patterns contained UAA or UAG in -1 frame, although nnUAnn could code for other hexamers containing UAU and UAC in 25% of cases. Only three of the type 1A codon pairs did not contain out-frame UAA or UAG. All three began with GGUA and did not show strong avoidance on the ORFeome level. Those three pairs may not be related to the same kind of avoidances as all other type 1A codon pairs [Additional file 4].
The second most abundant type of conserved avoided codon pair was nnGnnC (type 2A), which was more strongly avoided in ORFeomes than in genomes. The third type, type 3A contained the pattern nnCGCn. Avoidance of mononucleotide repeats such as GGGGGn, GGGGnn, nCCCCn and UUUUUU was also conserved in most organisms (Type 8A). UUUUUU was clearly avoided in ORFeomes but was significantly preferred at the genome level (Table 2).
The most conserved avoided codon pair was UUCGCA (type 6A, UUCGnn), which was under-represented in 86% of the organisms studied. It is interesting to note that this codon pair contains several clearly avoided sub-patterns (UUCGnn, nnCGCn, nnCnnA). The observed/expected ratios of this pair in the ORFeome and genome indicated that UUCGCA was similarly under-represented on both the ORFeome and genome levels (Table 1).
Interestingly, among the different avoided sub-patterns causing the conserved avoidance of codon pairs, the last nucleotide of the P-site codon in a pair was always fixed (the only exception was the pattern UnnnnU for codon pair UUUUUU).
The top 10 most conserved preferred codon pairs in the organisms studied
12 ↑ codon pairs
A - B
Types of patterns among conserved preferred codon pairs
% among avoided pairs
the effect is stronger in ORFeome (%)
the effect is stronger in whole genome (%)
As in the conserved avoided codon pairs, all the different preferred sub-patterns that caused codon pair preferences contained a fixed last nucleotide of the P-site codon in the pair.
Interestingly, five organisms showed significant bias in fewer than five of the top 20 codon pairs (Figures 2 and 3). This raised the question: do those five genomes have different sets of most-biased codon pairs, do they lack strong codon pair biases as such, or do we just lack the statistical power to detect biased codon pairs in them?
To answer this question, we created a dataset containing 150,000 randomly sampled codon pairs from all the genomes studied. This should correspond to the genome size 0.45 Mb, which is close to the smallest genome in our set. Using this standardized genome dataset we calculated how many codon pairs would still remain significantly biased (Figures 2 and 3). The results show that genome size indeed has a statistical effect on the number of biased codon pairs detected. The fraction of biased codon pairs leveled off after genome reduction (Figure 4B). However, the same figure also demonstrates that genome size was not the reason why B. aphidicola and C. blochmannia have low numbers of biased codon pairs. Even in the standardized sample, most other genomes showed bias in 5–15% codon pairs, whereas B. aphidicola and C. blochmannia had only 1.5% and 1.7% biased codons respectively. Therefore, these two genomes seem effectively to lack biased codon pairs.
We conclude that codon pair usage bias can be distributed in many different ways. Although most organisms have a similar set of universally conserved biased codon pairs, some organisms use slightly different sets (e.g. N. equitans) and some have a very small number of biased codon pairs (e.g. B. aphidicola, C. blochmannia).
Our findings suggest that certain codon contexts are strongly conserved over all domains of life. It has been proposed that codon context is even more important than single codon usage for translational efficiency .
Codon pair usage analysis showed that the most considerably biased conserved codon pairs are biased irrespective of phylogenetic class (Figures 1, 2, 3). To determine whether this is also true when the usage of all possible codon pairs is considered and compared with RSCU in different phylogenetic classes, we used a tree-based method. The correlation of RSCU or RDCU usage between two organisms can be used as a measure of the distance between them: the higher the correlation, the shorter the distance. For example, the distance between two organisms with identical codon usage would be 0 (1-ρ 2 with ρ 2 = 1). These distances can be represented as trees and compared to ribosomal RNA sequence-based phylogenetic trees.
The current study is an extensive investigation of sequence context patterns that are independent of single codon usage and dipeptide usage. We found that some combinations of neighboring codons are similarly avoided or preferred in many different organisms - bacteria, archaea and eukaryotes. The conserved avoidances and preferences of codon pairs observed are not the result of dipeptide biases since the effect of dipeptides was removed. Much of the dataset could be divided into subtypes on the basis of nucleotide patterns influencing the bias of codon pairs. Conserved patterns result mainly from translational effects not from DNA-related mechanisms since the biases are stronger in ORFeomes than in genomes.
It was claimed previously that codon pair preference is primarily determined by a tetranucleotide combination including the last nucleotide of the P-site codon and all three nucleotides of the A-site codon . However, our results showed that different patterns ranging from dinucleotides to hexanucleotides could explain conserved biased codon pair usage. Still, with one exception (codon pair UUUUUU and pattern UnnnnU), all sub-patterns contained a fixed nucleotide in the last position of the P-site codon in the codon pair. As the ribosome does not contact the bases of the codon in the P-site , the reason for this potential P-site effect is not clear.
Previously, the only universal context selection rule found to cover all three domains of life was the avoidance of most codon pairs of the nnUAnn type, which was suggested to result from rejection of TA dinucleotides in DNA sequences . Among 9 groups of under-represented codon pairs found in our study the largest group was also influenced by the avoidance of UA dinucleotides. However, although TA dinucleotides could be avoided at the genome level, this would not exclude the possibility that avoidance of UA dinucleotides is also important for ORFs and effective translation. Our methods allowed us to compare the observed/expected ratios of codon pairs more specifically between ORFeomes and genomes. The results showed that in 75.7% of cases the avoidance effect for nnUAnn codon pairs was stronger in ORFeomes than in genomes, suggesting the influence of translational mechanisms (Table 2).
Many of the avoided nnUAnn patterns contained out-frame stop codons, UAA or UAG, on the sense strand. This indicates that out-frame stop codons influence the avoidance of nnUAnn codon pairs. The reason for avoiding the codon pairs containing out-frame stops could be to minimize premature translational termination through recognition of those stops by a translation termination factor. The observation that only the stop codons UAA and UAG were avoided suggests that this kind of misreading might be caused by termination factor 1, the protein responsible for decoding them. Although the frequencies of erroneous termination events have been studied , we have no information concerning the possible recognition of termination codons through a frameshifting event. Interestingly, most nnUAnn codon pairs also contained out-frame UAA and UAG triplets on the antisense strand. There are no known mechanisms that could explain such avoidances. However, our results suggest that nnUAnn type avoidances are related to translational mechanisms because they are stronger in ORFeomes than in genomes.
Mononucleotide repeats, especially poly(A) and poly(U) tracts, are also known to cause transcriptional and translational frameshifting [8–10]. Therefore, such contexts should be selected against in protein coding sequences. There were eight mononucleotide repeats among the avoided codon pairs. The repeated nucleotide was G (GGGGGG, GGGGGU, GGGGGC and GGGGCC), C (ACCCCG, ACCCCA and GCCCCG) or U (UUUUUU) [Additional file 4]. All those codon pairs were more strongly avoided in ORFeomes than in genomes. This suggests that they were avoided to reduce the frequency of frameshifting events in polynucleotide sequences.
We also found several conserved preferred codon pairs. The number of conserved preferred codon pairs was smaller than the number of avoided codon pairs [Additional files 1, 2]. This suggests that the selection for more effective and more accurate translation acts primarily through avoidance of the most disadvantageous codon pairs and not through over-representation of the most suitable contexts. The most prevalent type of conserved preferred codon pair was nnGCnn [Additional file 5].
The top 10 avoided and preferred codon pairs were not specific to any larger phylogenetic group, suggesting that usage of those codon pairs is universally conserved (Figures 2 and 3). However, in some organisms with small genomes, only a few of those 20 codon pairs were significantly biased. This is not caused by statistical limitations on finding biased codon pairs in smaller genomes, but rather by the absence of codon-pair bias in those organisms. We also observed that some genomes use sets of most avoided and most preferred codon pairs different from the conserved sets identified in this study.
It has been proposed that codon context is even more important than codon usage for translational efficiency . Our findings suggest that certain codon contexts are markedly conserved over all domains of life. However, the comparison of RSCU and RDCU correlations showed that overall codon pair usage is less conserved than single codon usage (Figure 5). This was also confirmed by the shorter branch lengths in the codon pair usage tree than in the single codon usage tree (Figure 6). Tree analysis showed that in the RSCU tree certain phylogenetic classes, for example bacilli, alpha- and gamma-proteobacteria, have extremely similar codon usage preferences within the classes (Figure 6A). However, on the codon pair tree (RDCU tree), species are more distant from each other (codon pair usage is less similar). In contrast, larger phylogenetic groups are positioned together on the RDCU tree. For example, the similarity of codon pair usage in eukaryotes is higher than the similarity of single codon usage (eukaryotes are placed together on the RDCU tree but not on the RSCU tree). The large differences between the RSCU and RDCU tree topologies and branch lengths imply that codon preference and codon pair preference are shaped by different molecular mechanisms.
Codon frequencies correlate with tRNA concentrations, suggesting that this is a major selective force on codon usage patterns [31–33]. The codon pair preferences can be shaped by several different molecular mechanisms. One is the possible decrease of frameshifting errors through avoidance of mononucleotide repeats [8–10]. In addition, it has been suggested that codon context might be influenced by certain structural constraints imposed by two tRNAs occupying the ribosomal P- and A-sites [1, 16, 23]. Unfortunately, we currently have very limited information about the details of interaction between different tRNAs with the ribosome [29, 34, 35], which precludes further extension of this hypothesis.
The ribosome contains three sites for tRNA binding: the A-, P- and E-sites. It has been shown that the E-site tRNA can influence decoding in the A-site [24–26]. In addition, it was shown in fungi that the combinations of three consecutive codons are biased and some combinations are even vanished from the ORFeomes . Therefore, bias might also be observed in codon pair usage where codons are separated by three nucleotides (the 1–3 pair). We observed only one conserved 1–3 interaction, over-representation of the pattern GGUnnnGGU, indicating that interactions between the ribosomal E- and A-sites do not influence the codon context as much as interactions between the P- and A-sites. It was shown that the usage of three neighbouring codons is species specific among fungi . Our results correlate with that and suggest that this bias could also be species specific among bacteria and eukaryotes.
A conserved biased set of codon pairs was found in a dataset covering a large number of organisms from the three domains of life. Most of the pairs had stronger bias on the ORFeome level than on the whole genome level, suggesting that translation has a greater influence on codon pair biases than molecular mechanisms that shape the genomic DNA in general.
We selected 100 bacterial genomes randomly and complemented the random dataset so that all major phylogenetic classes were covered by at least one organism (resulting in 103 bacteria). For archaea we downloaded protein coding sequences for all sequenced genomes (28 genomes, year 2006). In addition, seven eukaryotic genomes were selected - six fungi and human. The protein coding sequences of all bacteria, archaea and fungi were retrieved from ftp://ftp.ncbi.nih.gov/genomes/. Human protein coding sequences were retrieved from Ensembl . The list of all genomes analyzed and the corresponding accession numbers are provided in [additional file 6].
To compile standardized genomes we randomly selected sequences from the set of all protein coding sequences of the corresponding organism until 150,000 ± 1000 codon pairs were obtained.
For the observed values, we counted the number of all possible sense:sense and sense:stop codon pairs (61 × 64 = 3904 pairs) by computer. The initial expected value of a codon pair was calculated using the frequencies of single codons in protein coding sequences. The expected value for a codon pair in the ORFeome was normalized as previously described [16, 17]: the dipeptide bias was removed by multiplying the initial expected value of a codon pair by the normalization coefficient. The normalization coefficient was the ratio of the observed to expected frequencies of the corresponding dipeptide encoded by the codon pair.
To separate translational effects from DNA-related effects influencing codon pair biases we compared the observed/expected ratios of a codon pair in ORFeomes and the corresponding hexanucleotide in genomes. We averaged the observed/expected values of each codon pair over all studied organisms for the comparison of ORFeomes and genomes.
The expected value of a hexanucleotide in a genome was calculated using the frequencies of trinucleotides in genomic sequences of that organism. The trinucleotide frequencies were counted by moving the window one nucleotide at a time. In genomes, the expected values of trinucleotide pairs containing out-frame UAA and UAG triplets on the sense and/or antisense strand were corrected by excluding the frames of coding regions where the given codon pair could not exist. Without normalization the expected values of pairs containing out-frame stops would be exaggerated, so the observed/expected ratio for the given codon pair would also be underestimated.
In each separate ORFeome, only under-represented codon pairs with observed/expected ratios ≤ 0.9 and over-represented codon pairs with observed/expected ratios ≥ 1.1 were subjected to the two-tailed Fisher's exact test. In all analyses, p-values of 0.01 or less were considered statistically significant. We used no multiple correction methods at this point. Codon pairs that were significantly biased in at least 51% of the organisms studied were marked as conserved.
where (observed/expected)i, n is the observed/expected ratio of a codon pair or sub-pattern i of length n and σ[log2(observed/expected)n] is the standard deviation of the observed/expected ratios of all sub-patterns of length n.
Comparison of the z-scores allowed the most biased nucleotide sub-pattern responsible for the bias of the codon pair to be identified.
The programs for all those calculations were written in Perl.
The bias of single codons was described by relative synonymous codon usage (RSCU). RSCU values are the number of times a particular codon is observed, relative to the number of times that the codon would be observed in the absence of any codon usage bias . To represent the bias of codon pairs, we calculated the relative dicodon usage (RDCU), which was based on the observed/expected ratios of four different sets of codon pairs: 1–2 (neighboring codons), 1–3 (codons separated by one intervening codon), 1–4 (codons separated by two intervening codons) and 1–10 (codons separated by eight intervening codons) as a control. Next, we measured the correlation of RSCU values between each pair of bacteria (Spearman's ρ). Similarly, the correlation between RDCU values in pairs of bacteria was calculated. All pairs of bacteria analyzed were divided into nine groups on the basis of the evolutionary distances between them. Pairwise evolutionary distances were retrieved as a 16SRNA distance matrix from the Ribosomal Database Project . Finally, we calculated the average correlation coefficients for each of those groups.
RSCU and RDCU trees were drawn using the corresponding correlation coefficients calculated previously: the higher the correlation, the shorter the distance between two organisms. For example, the distance between two organisms with identical codon usage would be 0 (1-ρ 2 with ρ 2 = 1). Trees were calculated using the Fitch-Margoliash  algorithm from PHYLIP software  and were edited using TreeDyn software .
open reading frame
relative synonymous codon usage
relative dicodon usage
This work was funded by a workgroup grant 0182649s04 from Estonian Ministry of Education and Science (MR), by The Wellcome Trust International Senior Fellowship (070210/Z/03/Z) (TT) and by the EU trough the European Regional Development Fund through the Centre of Excellence in Genomics. The authors acknowledge Ülo Maiväli and Märt Möls for critical reading of the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.