Increased incidence of rare codon clusters at 5' and 3' gene termini: implications for function
© Clarke and Clark. 2010
Received: 2 October 2009
Accepted: 18 February 2010
Published: 18 February 2010
Skip to main content
© Clarke and Clark. 2010
Received: 2 October 2009
Accepted: 18 February 2010
Published: 18 February 2010
The process of translation can be affected by the use of rare versus common codons within the mRNA transcript.
Here, we show that rare codons are enriched at the 5' and 3' termini of genes from E. coli and other prokaryotes. Genes predicted to be secreted show significant enrichment in 5' rare codon clusters, but not 3' rare codon clusters. Surprisingly, no correlation between 5' mRNA structure and rare codon usage was observed.
Potential functional roles for the enrichment of rare codons at terminal positions are explored.
The amino acid sequence of a protein is determined by the sequence of trinucleotide codons in its mRNA. The 20 most common amino acids are encoded by 61 different codons. With the exception of methionine and tryptophan, all of these amino acids are encoded by multiple codons, meaning that many different nucleotide sequences can encode an identical protein sequence. However, the selection of a particular coding sequence is not random. Instead, as a result of numerous forces, including GC bias, some codons are used more frequently than others. The higher demand for these common codons correlates with an increased production of their cognate tRNAs, leading to faster [1–3] and more accurate [4, 5] translation of common codons relative to their rare counterparts.
Yet if rare codons persist only due to incomplete selection against the associated lower translational fidelity and protein yield, it would be expected that rare codons would be randomly distributed throughout the open reading frames (ORFs) of the genome. However, this is not the case. Instead, rare codons often appear in large clusters . These clusters can cause translational pausing, which reduces the local protein translation rate. Rare codon clusters have been identified in genes of all functional classes in a wide variety of organisms .
The clustering of rare codons indicates that there are forces that influence the selection of rare codons within mRNA sequences. It has been suggested that rare codons could influence co-translational protein folding. For example, pausing synthesis of the nascent polypeptide chain could allow folding events to occur at protein domain boundaries, or for slower folding secondary structures [7–10]. However, other factors could also contribute to positive selection for rare codons within an mRNA sequence. For example, stable mRNA secondary structure, especially within the first 40 nucleotides at the 5' end of an open reading frame, could negatively affect protein expression by limiting access to the ribosome binding site or initiator methionine codon . For some sequences, strategic placement of one or more rare codons could disrupt 5' mRNA secondary structure. In this case, selective pressures against rare codons would be balanced by the selective pressure against mRNA structure, causing an enrichment of rare codons beyond what would be expected by random chance. It has also been suggested that, for genes encoding proteins bearing N-terminal signal sequences, 5' rare codons could have a functional role related to secretion, perhaps by transiently reducing translation rate prior to membrane localization of the nascent chain . Though there have been fewer discussions of possible beneficial roles for 3' rare codon clusters, these clusters could cause nascent polypeptide chains to dwell at the ribosome surface near the end of translation , which could allow for the association of molecular chaperones, other subunits of a multimeric protein, partner proteins, or factors involved in targeting or degradation .
Here, we examine the abundance of rare codon clusters at gene termini and other locations, revealing an enrichment of rare codons at both the 5' and 3' end of ORFs from E. coli and other prokaryotes. We examine possible roles for these rare codon clusters in protein biogenesis.
To quantify the relative rareness of codons used across an entire ORFeome, we used the previously developed %MinMax algorithm [6, 15]. %MinMax determines the relative commonness or rareness of an mRNA sequence, given the constraints of the underlying protein sequence and the relative abundances of the codons in a particular organism. In contrast to %MinMax, other methods to quantify codon usage have focused on the relative commonness of codons , which is useful for estimating expression levels but is not designed for investigating the presence of rare codons or translation rate. Similarly, methods that use intracellular tRNA concentrations to estimate translation speed [17, 18] are limited to the small number of organisms with measured tRNA concentrations and must take additional measures to account for the differences in translation speed for tRNAs that bind to multiple codons, such as the 3.4-fold difference in translation rates by the same tRNA for the glutamic acid codons GAA versus GAG .
The %MinMax algorithm identified significant enrichment of %Min windows relative to a random distribution of codons, indicating significant clustering of rare codons throughout the ORFeomes of several organisms, including E. coli, H. sapiens, A. thaliana and S. cerevisiae . %MinMax can be performed on any sequence from any organism with enough sequence data to accurately determine codon usage frequencies. The rare codon clusters identified with %MinMax correlate with experimentally determined translation pause sites .
We also examined the prevalence of rare codon clusters at 3' gene termini. At the 3' end, 490 genes (21.6% of total) have a rare codon cluster within the last 50 windows and 350 genes (15.5% of total) have a rare codon cluster within the last 25 windows (Figure 3). By comparison, there are 355 (15.7%) and 242 (10.7%) genes with rare codon clusters between windows 101-150 and 151-200 from the 3' terminus respectively (p < 0.0001 in both cases, Fischer's exact, two-tailed test).
We also investigated whether the rare codon clustering observed at either terminus was due to any particular rare codon or subset of rare codons (Additional File 1, Figure S1). Briefly, for 5' rare codon clusters, genes were separated into two groups: those with 5' rare codon clusters and those without 5' rare codon clusters. The codon usage for the first 50 codons, the final 50 codons, and the interior codons was tallied. Two separate 2 × 2 contingency tables were then constructed, using the terminal codon usage (5' in one table, 3' in the other) and interior codon usage as columns and the genes with 5' rare codon clusters and without 5' rare codon clusters as rows. A chi-square with Yates correction was used to calculate the p-value for the distribution. This process was repeated for genes with 3' rare codon clusters, predicted signal sequences, and predicted secreted genes. The 5' codon usage for genes predicted to be secreted or predicted to have a signal sequence shows statistically significant enrichment for certain codons; however, this primarily reflects amino acid usage, not codon selection. Tryptophan, aspartic acid, asparagine, glutamine and tyrosine are all under-represented in signal sequences, while cysteines and alanines are over-represented. There is under-enrichment of the most common glycine, arginine and threonine codons, though no rare codons are specifically over-enriched.
There were also some specific and statistically significant correlations that came from the analysis of 5' rare codon clusters: the two common glycine codons, the second most common arginine codon, the most common glutamic acid codons, the most common threonine codon, the most common valine codon and the most common leucine codon were all under-enriched at the 5' terminus in genes with 5' rare codon clusters. This should not be surprising, as selecting for a subset of genes with rare codons would tend to decrease the number of very common codons. No specific subset of rare codons was enriched, however, indicating that the effect is not for a specific subset of codons but rather for the quality of rareness independent of any particular amino acid or codon. At the 3' terminus, there was enrichment for the rare codons CAA and GCT, though the slight enrichment of only two codons is unlikely to create the broad effect observed here.
While significant enrichment of rare codons extends 22 windows (40 codons) from the 5' end of genes in E. coli, the ribosome exit tunnel restricts the conformations and interactions of the 20-40 amino acid residues most recently synthesized by the ribosome [21, 22]. Hence, a rare codon cluster near the 5' end of an mRNA sequence would induce translational pausing at positions where little if any of the nascent chain has exited the ribosomal exit tunnel, meaning that a pause at this point would have very little effect on the co-translational folding of the nascent chain . The clustering of rare codons at the 5' end of coding mRNA sequences could instead reflect a selection against mRNA secondary structure, rather than selection related to translation rate and the appearance of the nascent chain. While the ribosome has an intrinsic helicase activity  that can unwind mRNA secondary structure in the coding region, 5' secondary structure could obscure the ribosome binding site or interfere with translation initiation.
Approximately 18% of E. coli ORFs encode a protein with a predicted N-terminal signal sequence, which is used to transport the encoded protein out of the cell cytoplasm. To examine the influence of signal sequences on the enrichment of rare codon clusters at 5' gene termini, we examined the 454 ORFs longer than 250 windows that are predicted to contain an N-terminal signal sequence (as determined by SignalP 3.0 ). Of these 454 genes, 172 (37.8%) contain a rare codon cluster within the first 50 codon windows, compared to 574 (31.7%) of the 1808 genes without a signal sequence. This small but significant enrichment of rare codon clusters (p = 0.0140, Fischer's exact, two-tailed test) is not seen at 3' termini (113 of 478 genes with a signal sequence have a rare codon cluster in the last 50 codon windows, compared to 424 of 1939 without a signal sequence; p = 0.4246). Therefore, the modest enrichment of rare codons is specific for the 5' end of genes with signal sequences, rather than a general position-independent enrichment of rare codon clusters in all genes with N-terminal signal sequences. Genes predicted to be secreted by SecreteomeP , an algorithm that searches for motifs originally identified in secreted genes lacking a signal sequence, show a similar enrichment. For the 526 genes classified as secreted with SecreteomeP, there is a significant 5' enrichment of rare codon clusters at 5' gene termini (p < 0.0001), with minimal 3' enrichment of rare codon clusters at 3' gene termini (p = 0.0373) (Figure 3). The SecreteomeP dataset does, however, overlap with the SignalP dataset, with 378 genes appearing in both.
Genes were also examined using their assigned functional categories from the JCVI CMR. Most gene class assignments showed no association with rare codon clusters, either enrichment or under-representation. Hypothetical genes showed a non-specific enrichment of rare codon clusters, with enrichment at all positions (p < 1 × 10-5) being reflected in the 5' (p < 1 × 10-7) and 3' (p < 1 × 10-6) enrichment. Genes assigned to nucleotide (p = 0.0093) and amino acid biosynthesis (p = 0.00025) categories showed a general under-representation of rare codon clusters, as might be expected for categories containing primarily highly expressed genes. The only category that showed a significant orientation-specific effect was energy metabolism, which contained significantly fewer than expected rare codon clusters at the 5' end (p < 1 × 10-5), but not in general or at the 3' terminus. The relationship between rare codon clusters and the gene expression level was also examined, using expression levels reported in the NCBI GEO database . Neither the 530 most highly expressed nor the 527 least expressed genes showed any statistically significant correlation to the presence or absence of rare codon clusters in general or at either termini.
The population of rare codon clusters is markedly higher at the termini relative to the non-terminal positions
Percent of genes with rare codon clusters a
2.35 ± 0.32
0.96 ± 0.24
0.47 ± 0.15
0.60 ± 0.16
3.52 ± 0.37
4.04 ± 0.37
0.14 ± 0.06
B. melitensis 16 M
0.86 ± 0.21
Burkholderia sp. 383
0.36 ± 0.09
0.26 ± 0.17
11.06 ± 1.10
0.92 ± 0.20
3.59 ± 0.42
0.73 ± 0.31
2.45 ± 0.63
Nostoc sp. PCC 7120
2.39 ± 0.30
0.85 ± 0.16
R. metallidurans CH34
0.57 ± 0.14
2.47 ± 0.29
2.55 ± 0.32
2.63 ± 0.65
0.42 ± 0.11
0.16 ± 0.11
0.25 ± 0.14
3.21 ± 0.63
3.18 ± 0.83
The mechanism of translation is very different in eukaryotes versus prokaryotes . Therefore, perhaps not surprisingly, the rare codon enrichment reported above for many prokaryotic organisms is not observed in eukaryotes. The human ORFeome, for instance, shows a decrease in rare codon clusters at 5' gene termini, with the percentage of windows with rare codon clusters dropping from 11.54% at non-terminal positions to 8.14% at the extreme 5' terminus. Trypanosoma brucei shows a decrease in rare codon clusters at 3' gene termini (8.6% relative to 9.9% at non-terminal positions). Some genomes, such as A. thaliana, show no significant changes at either terminus. Cryptococcus neoformans shows a significant 3' increase (21.24% relative to 11.06% at non-terminal positions), though no significant difference is observed at the 5' end.
Determining the role(s) of rare codons in protein biogenesis is complicated by literature reports that describe the negative effects of rare codon clusters, particularly at 5' termini , while also reporting examples of rare codons improving protein expression , increasing or altering protein activity  and being conserved through evolution . Here, we have examined the distribution of rare codons along gene sequences, for different protein classes, in order to identify general forces that could shape rare codon usage.
In the absence of any selection, rare codons and codon clusters would appear randomly throughout the ORFeome. By contrast, our results show that rare codon clusters are more likely to appear at the 5' and 3' ends of E. coli genes, rather than non-terminal positions. In particular, genes containing signal sequences showed enrichment at the 5' end of genes, but not the 3' end. This orientation-specific effect suggests a functional usage of rare codons. For example, in eukaryotes, signal recognition particle (SRP) can pause translation of secreted proteins to facilitate their translocation into the endoplasmic reticulum. If SRP is absent, nascent chains are unable to properly engage the translocon, and unprocessed polypeptides accumulate in the cytoplasm . Slowing the rate of translation with antibiotics can counteract the effects of deleting SRP . In a similar manner, it is possible that the prokaryotic 5' rare codon clusters could serve as an alternative route to the same goal: rare codons might represent an SRP-independent mechanism to reduce local translation rates, allowing the ribosome:nascent chain complex to localize to the membrane and facilitating the recognition of exposed signal sequences by the secretion machinery. This process, working in concert with SRP, could increase the efficiency of co-translational or immediate post-translational secretion of nascent polypeptides, preventing the accumulation of transmembrane or secreted polypeptides in the cytoplasm, where their folding and/or secretion may not be as efficient.
Rare codon clusters that occur before any nascent chain sequence has emerged from the ribosomal tunnel could aid secretion via a mechanism independent of signal sequence recognition. For example, if ribosomes bind to an mRNA in rapid succession and are able to translate the sequence without pausing, the resulting polysome will contain multiple ribosomes closely spaced together in both sequence and physical distances . This would increase the local competition for secretory complexes, which could lead to a decrease in secretion efficiency as the accumulating polypeptides are degraded or aggregate during the extended wait for secretion initiation. However, the introduction of a rare codon cluster could space these ribosomes further apart from each other along the mRNA sequence. The ribosomes would be forced to stack up as the first of the group reached the slowly translated section, but after the first ribosome passed through the pause, the second one would enter, slowly translating as the first ribosome more rapidly translated the more common downstream sequence, and this process would repeat for all subsequent ribosomes. This staggering of ribosomes with rare codon clusters could potentially alleviate local competition for translocation complexes and increase the efficiency of secretion.
We also examined an alternative explanation for positive selection of 5' rare codon clusters. Stable mRNA secondary structure can inhibit protein expression by interfering with the initiation of translation, and it has been suggested that rare codons might be employed at the 5' terminus to destabilize these structures . Yet a comparison of ORFs containing rare codon clusters at the 5' end versus those without clusters revealed that potential mRNA secondary structure is independent of rare codon clusters; the distribution of thermodynamic stabilities is similar in both sets of genes. Indeed, while it is possible to increase expression by altering codon usage to prevent secondary structure at the 5' of genes , several alternate methods exist that can accomplish this same goal in vivo, such as optimizing the ribosome binding site to increase translation initiation and maximize ribosome coverage of the mRNA (and, by extension, reducing mRNA secondary structure), or by strengthening the promoter to increase mRNA levels to offset diminished protein production per mRNA. Furthermore, destabilizing 5' mRNA structure need not require significant synonymous substitutions to rare codons. The 5' synonymous nucleotide sequences generated by random reverse translations formed on average more stable secondary structures than the wild type sequences, suggesting that selective pressure against mRNA secondary structures might exist. Yet wild type mRNA sequences have on average less secondary structure stability than randomly generated sequences, without a significant change in the distribution between %Min and %Max. Hence it appears that the selective pressure against 5' secondary structure can be resolved without invoking a significant increase in rare codons. It appears that the introduction of a few synonymous codons, rare or not, is sufficient to destabilize mRNA 5' secondary structure. As a result, we conclude that the presence of significant clusters of rare codon clusters at 5' gene termini is not linked to the elimination of secondary structure, but instead to other possible functional effects.
In contrast to the 5' end, few published hypotheses exist to explain the increase in rare codon clusters observed at the 3' terminus. Some proteins, such as tailspike from S. typhimurium phage P22, have been shown to dwell on the ribosome post-translationally . If dwelling on the ribosome aids in tailspike folding, it is possible that a 3' cluster of rare codons, which would give proteins that fold slowly additional time to fold before release from the ribosome, could replicate this mechanism. Codon usage can be altered without any constraints on the underlying amino acid sequence, which would allow any potential sequence to prolong its association with the ribosome without relying on a potentially sequence-dependent interaction with the ribosomal surface. Pausing at the C-terminus of the nascent polypeptide could also allow co-factors or chaperones to bind to the nearly complete sequence of the nascent polypeptide. It has also been suggested that 3' rare codons could serve as a signal for tagging by SsrA, the absence of which has been shown to negatively impact the expression of certain genes .
In conclusion, rare codon clusters are non-randomly localized and enriched at E. coli gene termini. Moreover, similar terminal enrichment was detected for numerous other prokaryotic organisms, and across diverse protein types, indicating potential functional roles for rare codons in protein biogenesis, folding, secretion and interactions with partner proteins.
The E. coli K12-MG1655 ORFeome, containing 4288 ORFs, was obtained from the JCVI CMR database . The remaining prokaryotic ORFeomes (Agrobacterium tumefaciens, Bacillus anthracis, Bacillus cereus, Bacillus subtilis ‡, Bacteriodes fragilis, Bordetella pertussis, Brucella melitensis 16 M, Burkholderia sp. 383, Coxiella burnetii, Cryptococcus neoformans, Deinococcus radiodurans, Erwinia carotovora, Heliobacter pylori ‡, Neisseria meningitidis, Nostoc sp PCC 7120, Pseudomonas fluorescens, Ralstonia metallidurans CH34, Salmonella entericia, Salmonella typhimurium, Shigella flexneri, Sinorhizobium meliloti, Staphylococcus aureus ‡, Thermus thermophilus, Xylella fastidiosa and Yersinia pestis) were also obtained from the JCVI CMR. Eukaryotic ORFeomes for T. brucei, A. thaliana, C. neoformans were obtained from the annotated databases at JCVI. The human ORFeome was taken from DFCI-CCSB at Harvard . All windows that contained a non-ATGC base were eliminated. ORFs longer than 250 windows were extracted and analyzed for the presence or absence of rare codon clusters. For each position x, the x th window of ORFs was considered to be a rare codon cluster if the %Min value was at least -10%Min, the point where enrichment of rare codons becomes statistically significant in E. coli . The threshold was increased for organisms where -10%Min was not statistically significant to a %Min value that was statistically significant. The three ORFeomes that did not have any statistically significant %Min values (‡) were evaluated using the 10%Min threshold. Codon-biased random reverse translations were created by generating synonymous gene sequences composed of synonymous codons randomly selected using a table weighted for codon usage frequency.
The minimum folding energies (ΔGfolding) for the first 40 nucleotides from each of the 4288 E. coli ORFs were calculated using UNAFold  with the default setting of 37°C.
For each mRNA ΔGfolding value, the population count of ORFs with rare codon clusters was paired with the population count of ORFs without rare codon clusters. These paired values were graphed and a linear regression was performed, leading to a correlation coefficient of 0.73. To account for the small difference in the median stability of sequences with versus without rare codon clusters, the regression was repeated with offsets between the data sets ranging from +2 kcal/mol to -2 kcal/mol in increments of 0.1 kcal/mol. An offset of -0.4 kcal/mol, consistent with the differences in the medians, produced the maximum correlation coefficient (0.7945) and is the value reported in the text.
To determine whether rare codons were significantly enriched at gene termini for certain types of proteins, populations with or without a certain characteristic, i.e., a predicted signal sequence, were evaluated for the presence or absence of either 5' or 3' rare codon clusters. A p-value was calculated from a 2 × 2 contingency table using a Fischer's exact, two-tailed test.
The expression level data for wild-type K12 E. coli were obtained from the NCBI Gene Expression Omnibus, accession numbers GSE1730 and GSE1735. The Cy5/Cy3 ratio, representing the mRNA abundance divided by the genome DNA reference, was used to rank expression levels. The eight separate datasets using wild-type cells grown in LB media were averaged together. The 530 genes with an average Cy5/Cy3 ratio greater than 2.0 were used for the highly expressed dataset while the 527 genes with an average Cy5/Cy3 ratio less than 1.15 were used for the least expressed dataset.
the measured ΔG of folding for the RNA secondary structure as determined by UNAFold
J. Craig Venter Institute Comprehensive Microbial Resource, formerly TIGR (The Institute of Genomic Research) CMR
open reading frame
a collection of all the ORFs for a particular organism
This project was supported by an award from the NIH (GM74807).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.