Analysis of the distribution of functionally relevant rare codons
© Widmann et al; licensee BioMed Central Ltd. 2008
Received: 15 February 2008
Accepted: 05 May 2008
Published: 05 May 2008
The substitution of rare codons with more frequent codons is a commonly applied method in heterologous gene expression to increase protein yields. However, in some cases these substitutions lead to a decrease of protein solubility or activity. To predict these functionally relevant rare codons, a method was developed which is based on an analysis of multisequence alignments of homologous protein families.
The method successfully predicts functionally relevant codons in fatty acid binding protein and chloramphenicol acetyltransferase which had been experimentally determined. However, the analysis of 16 homologous protein families belonging to the α/β hydrolase fold showed that functionally rare codons share no common location in respect to the tertiary and secondary structure.
A systematic analysis of multisequence alignments of homologous protein families can be used to predict rare codons with a potential impact on protein expression. Our analysis showed that most genes contain at least one putative rare codon rich region. Rare codons located near to those regions should be excluded in an approach of improving protein expression by an exchange of rare codons by more frequent codons.
The usage of codons is not random and differs between organisms and genes. Depending on the strength of an organism's translational selection, there is a bias in highly expressed genes to avoid rare codons because of the low concentration of the respective tRNA in the cell  which results in a decrease of translation rates . As a consequence, genes with a high percentage of rare codons generally are translated at a lower rate than genes with a low percentage of rare codons . Therefore, in an effort to increase the yield of recombinant proteins, rare codons have been replaced by more frequently used codons which led to increased yields of active protein [4, 5].
However, gene redesign can also lead to abnormal protein folding and thus a decrease in protein solubility  as well as a decrease in protein activity [7, 8]. It has been suggested that the differences in translational speed and the occurrence of pauses in translation is tightly linked to the folding mechanisms of the respective protein [9, 10], with clustered rare codons having a greater effect on translational speed than separated rare codons . Thus, optimal expression seems to be a consequence of a delicate balance between the occurrence and position of frequent and rare codons. Therefore, the effect of a replacement of rare by frequent codons to the expression level is not obvious. The goal of this work was to classify rare codons as critical and non-critical for expression of a given gene product. Non-critical rare codons could then be safely replaced by more frequent codons, while critical rare codons should not be replaced.
We suppose that critical rare codons can be predicted by comparing the codon usage of homologous proteins in a multisequence alignment. Therefore, we developed a new, cutoff independent approach to assign critical rare codons which compares the observed codon composition of one column in a multisequence alignment to all possible, alternative combinations of synonymous codons. Because the folding pathway of homologous proteins is assumed to be similar, rare codon rich regions (RCRR) which play a critical role in protein folding should be conserved in all members of a protein family. Since there is an increased probability to find rare codons in loop and linker regions , the location of RCRRs in respect to secondary structure elements was analyzed.
This analysis was applied to two proteins for which it was experimentally shown that an exchange of rare codons with more frequent, synonymous codons reduces activity [6, 8]. The analysis of RCRRs was extended to systematically analyse a complete fold family. 16 protein families with a common α/β hydrolase fold were investigated to predict RCRRs, to localize them in respect to secondary and tertiary structure, and to identify possible RCRRs that are conserved in all members of the fold family.
Fatty acid binding protein family
Chloramphenicol acetyltransferase protein family
α/β hydrolase families
Homologous protein families from the Lipase Engineering Database.
Homologous family name
No. of RCRRs
No. of sequences
Ccg1/TafII250-interacting factor B like
BioH protein like
Burkholderia cepacia lipase like
Chloroflexus aurantiacus lipase like
Palmitoyl-protein thioesterase 1 like
Rhizomucor mihei lipase like
Serine carboxypeptidase II like
Number of predicted RCRRs in four groups of secondary structure elements.
Cutoff-independent and unbiased prediction of rare codon rich regions
In most genes an exchange of rare codons with synonymous, more frequent codons is neutral or even increases the yield of soluble protein [4, 13, 14]. For some genes, however, it has been observed that such an exchange surprisingly leads to an increase of incorrectly folded proteins [6, 8, 15]. Therefore, we based our investigation on the hypothesis that there might exist rare codons which have a regulatory function in translation and contribute to the correct folding pathway of a protein. Because the members of a homologous family and probably also of a fold family are expected to have a similar folding pathway, there should be an evolutionary bias towards the conservation of these critical rare codons. Because we only analyse synonymous codons, we restrict our analysis to the observed amino acid sequence. Thus, a possible effect to the expression level upon exchange of an amino acid is not considered by our analysis.
A rare codon is usually defined by a low usage frequency. Two types of rare codons have to be distinguished: (1) rare codons that code for an amino acid that is also encoded by more frequent codons (e.g. the arginine codon AGG) and (2) rare codons of amino acids (e.g. W,Y,H) that are encoded by only one or two rare codons. Our rare codon analysis identifies the first type of rare codons. While these rare codons are supposed to be the result of a significant evolutionary pressure towards using a rare codon instead of a frequent codon at the respective position, the second type of rare codons is strongly biased toward positions with highly conserved amino acids that are encoded exclusively by rare codons. For many organisms, codon usage tables are available . However, a generally applicable distinction between rare and frequent codons is not available and the result of the analysis would depend on the choice of an arbitrary cutoff value. Therefore, we have developed a cutoff-independent approach to assign rare codons by comparing the observed codon composition of one column to all possible, alternative combinations of synonymous codons. For each column a quantitative rare codon score is derived. Instead of single columns, a sliding window of 9 columns is evaluated, because up to 27 nucleotides are involved in binding to the ribosome during translation  and a cumulative effect of neighbouring rare codons has been expected .
Location of rare codon rich regions
It has been suggested that there is an increased tendency for rare codons in loop and linker regions [8, 9, 18]. For two proteins being examined for RCRRs, functionally relevant rare codons have been experimentally identified which led to a decrease of expressed active protein upon exchange by more frequent codons. Interestingly, in the gene coding for a fatty acid binding protein, the functionally relevant rare codons are located in a loop region , while in the second gene, the chloramphenicol acetyltransferase, the functionally relevant rare codons are located in a loop/β-strand region . The observation of functionally relevant rare codons located in both loop and secondary structure regions is confirmed by our analysis of rare codon rich regions which predicts about 50% of RCRRs in loop and secondary structure regions, both in our analysis of the two experimentally examined genes and of 16 α/β hydrolase families. However, because our prediction of RCRRs is restricted to regions with a sufficient conservation of amino acids, highly diverse regions are excluded from the analysis. Therefore, functionally relevant rare codons could not be predicted if they were located in highly variable loop regions.
In the two experimentally investigated genes, RCRRs were predicted in regions linking the two halves of the β-barrel in the fatty acid binding protein and the α and β layer in the chloramphenicol acetyltransferase. Thus it is tempting to associate RCRRs with regions that link two separate folding domains. However, our systematic analysis of 16 α/β hydrolase families provides a more complex picture. Although all families are of the same fold and thus are expected to have a similar folding pathway the RCRRs are nearly equally distributed in the representative α/β hydrolase fold.
This holds true even when a more stringent cutoff is applied and RCRRs close to the minimal score requirement are eliminated. Taking all RCRRs into account, only two areas with an increased density of RCRRs are found. The region encompassing helix D with 4 RCRRs from 6 different families and the loop region connecting β-strand 3 to helix B with 3 RCRRs from 6 different families. However, the region encompassing helix D is highly variable among the α/β hydrolase families and consists of a varying number of strands and helices. The loop region connecting β-strand 3 to helix B connects the first half of the β-sheet to the second half, consisting of 4 β-strands each. Thus, there seems to be no common region in which RCRRs are located in all α/β hydrolases. In addition, 50% of all α/β hydrolase families contain no RCRRs at all. This observation can be explained by either of three possibilities: (1) There are no rare codons which are structurally conserved in all α/β hydrolases and are essential to control folding. However, RCRRs were found in individual homologous families. (2) α/β hydrolases do not have a common folding pathway. While there is evidence that proteins sharing the same fold also share a common folding pathway [19, 20], this observation was based on a small set of proteins and therefore can not be generalized. Indeed, there are some studies showing that proteins sharing a common structure undergo a different folding pathway in vitro [21, 22]. (3) The level of translational selection might differ among species. In most organisms highly expressed genes seem to contain a higher percentage of frequently used codons, while in 30% no such codon bias was found [23, 24]. However, this method averages over the whole gene and therefore does not take local conservation of rare codons into account.
As it has been shown experimentally that replacing rare codons by more frequent codons in proximity to a RCRR can lead to a decrease in protein expression, the analysis of RCRRs could be helpful in predicting those critical rare codons which are probably beneficial to expression and should not be a target for codon replacement.
However, it seems that a prediction of RCRRs has to be restricted to single homologous families
In most cases the substitution of rare codons with more frequent codons leads to increased protein yields in heterologous gene expression. To predict functionally relevant rare codons, multisequence alignments were analyzed to identify conserved rare codon rich regions. The prediction was validated by experimental data on silent mutations of two proteins. Therefore, we suggest that the approach of improving protein expression by an exchange of rare codons by more frequent codons should exclude rare codons located in highly conserved rare codon rich regions. A systematic analysis of 16 α/β hydrolase families predicts that most genes contain at least one putative rare codon rich region. They are however not restricted to loop regions but also occur in secondary structure elements. In addition, no preferred location of rare codon rich regions was found in respect to the common α/β hydrolase fold.
Two proteins were analysed which show decreased activity upon replacement of rare by frequent codons: fatty acid binding protein from Echinococcus granulosus  and chloramphenicol acetyltransferase III from Escherichia coli .
The protein and DNA sequences of proteins homologous to fatty acid binding protein and chloramphenicol acetyltransferase III were retrieved from the GenBank by a BLAST search  starting with GenBank entries GenBank:Q02970 and GenBank:NP_073222, respectively. Only proteins from different organisms and with a sequence identity between 35% and 80% were selected for the subsequent multisequence alignment.
Protein and DNA sequences of 16 protein families (Tab. 1) with 7 or more proteins per family were extracted from the Lipase Engineering Database . The family classification scheme of the Lipase Engineering Database was used which led to some families with overall sequence identities of only 20%. For 14 families representative structures were available in the PDB. Families with more than 10 members were reduced in size by excluding proteins from the same organism if possible, else sequences with the lowest sequence identity were removed.
A multisequence alignment of the protein sequences of each protein family was constructed using ClustalW  with a Gonnet Matrix  and a gap opening and extension penalties of 10 and 0.2, respectively. For each protein sequence, the DNA sequence was retrieved and codons were assigned to the respective amino acid in the multisequence alignment.
For each column of the multisequence alignment, a codon score S was evaluated. For every amino acid, the usage frequency of its codon was taken from the Codon Usage Database . These frequencies were multiplied, resulting in the column frequency α. Then all possible codon combinations were determined and their respective frequencies multiplied, resulting in codon frequencies βi for each combination i (i = 1,N). Each column frequency βi was then compared to the column frequency α, and the number n of all βi ≤ α was determined. The score S of each column was evaluated by normalizing the number n by the number of all possible codon combinations N: S = n/N.
Small values of S correspond to a high percentage of rare codons. Thus, five groups were defined: group 1 of highly conserved rare codons with 0 ≤ S < 0.2, group 2 of conserved rare codons with 0.2 ≤ S < 0.4, group 3 with (0.4 ≤ S < 0.6), group 4 with (0.6 ≤ S < 0.8) and group 5 with (0.8 ≤ S ≤ 1). The number of columns belonging to each group was counted for each protein family and the total sum for each column group was determined (Tab. A3 in Additional file 3). From the total sums, the probability of each column group as well as the ratio between the groups was determined. To predict rare codon rich regions (RCRRs), a window of nine columns was analyzed by counting the numbers S1 and S2 of all columns belonging to group 1 and 2, respectively. The number of columns of group 1 and group 2 correspond to 2.7% and 4.5%, respectively, of all columns and have a ratio of 1.7. A window score W was evaluated by a weighted sum of S1 and S2. Because group 2 columns were 1.7 fold more frequent than group 1 columns, they were weighted with a factor of 0.6:
W = S 1 + S 2 *0.6
Thus each column of group 1 inside the window contributes a score of 1, while a column of group 2 contributes a slightly smaller score of 0.6. Areas with a window score W ≥ 1.8 are designated as a putative RCRR, beginning from the first contributing column to the last one (columns of group one or two). This score was chosen in order to avoid the detection of single columns from group 1 as a putative RCRR. Thus, a putative RCRR is predicted if at least 2 columns of group 1, 1 column of group 1 and 2 columns of group 2, or 3 columns of group 2 are found. For both cases, the probability of a random occurrence was estimated using a binominal distribution: the probability of finding 2 columns of group 1 in a window of 9 columns is 2%, and the probability of finding three or more columns of either group 1 or group 2 is 2%. Therefore, the probability of randomly finding a putative RCRR is 4%. Neighbouring RCRRs with a distance of less than 9 columns are merged. Thus, these merged RCRR will exceed the initial window length of 9 columns. Each of the putative RCRRs were evaluated for the quality of the local multisequence alignment by PLOTCON from the EMBOSS suite  with the EBLOSUM62 matrix. To be accepted as an RCRR the average PLOTCON score of a detected putative RCRR has to be at least 1.0. Thus, putative RCRRs that are located in highly variable regions were rejected.
rare codon rich region.
We thank the Federal Ministry of Education and Research (PTJ 0313434D) for financial support.
- Ikemura T: Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J Mol Biol. 1981, 151 (3): 389-409. 10.1016/0022-2836(81)90003-6.PubMedView ArticleGoogle Scholar
- Varenne S, Buc J, Lloubes R, Lazdunski C: Translation is a non-uniform process. Effect of tRNA availability on the rate of elongation of nascent polypeptide chains. J Mol Biol. 1984, 180 (3): 549-576. 10.1016/0022-2836(84)90027-5.PubMedView ArticleGoogle Scholar
- Pedersen S: Escherichia coli ribosomes translate in vivo with variable rate. Embo J. 1984, 3 (12): 2895-2898.PubMedPubMed CentralGoogle Scholar
- Makoff AJ, Oxer MD, Romanos MA, Fairweather NF, Ballantine S: Expression of tetanus toxin fragment C in E. coli: high level expression by removing rare codons. Nucleic Acids Res. 1989, 17 (24): 10191-10202. 10.1093/nar/17.24.10191.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhou Z, Schnake P, Xiao L, Lal AA: Enhanced expression of a recombinant malaria candidate vaccine in Escherichia coli by codon optimization. Protein Expr Purif. 2004, 34 (1): 87-94. 10.1016/j.pep.2003.11.006.PubMedView ArticleGoogle Scholar
- Cortazzo P, Cervenansky C, Marin M, Reiss C, Ehrlich R, Deana A: Silent mutations affect in vivo protein folding in Escherichia coli. Biochem Biophys Res Commun. 2002, 293 (1): 537-541. 10.1016/S0006-291X(02)00226-7.PubMedView ArticleGoogle Scholar
- Crombie T, Swaffield JC, Brown AJ: Protein folding within the cell is influenced by controlled rates of polypeptide elongation. J Mol Biol. 1992, 228 (1): 7-12. 10.1016/0022-2836(92)90486-4.PubMedView ArticleGoogle Scholar
- Komar AA, Lesnik T, Reiss C: Synonymous codon substitutions affect ribosome traffic and protein folding during in vitro translation. FEBS Lett. 1999, 462 (3): 387-391. 10.1016/S0014-5793(99)01566-5.PubMedView ArticleGoogle Scholar
- Thanaraj TA, Argos P: Protein secondary structural types are differentially coded on messenger RNA. Protein Sci. 1996, 5 (10): 1973-1983.PubMedPubMed CentralView ArticleGoogle Scholar
- Makhoul CH, Trifonov EN: Distribution of rare triplets along mRNA and their relation to protein folding. J Biomol Struct Dyn. 2002, 20 (3): 413-420.PubMedView ArticleGoogle Scholar
- Zhang S, Goldman E, Zubay G: Clustering of low usage codons and ribosome movement. J Theor Biol. 1994, 170 (4): 339-354. 10.1006/jtbi.1994.1196.PubMedView ArticleGoogle Scholar
- Fischer M, Pleiss J: The Lipase Engineering Database: a navigation and analysis tool for protein families. Nucleic Acids Res. 2003, 31 (1): 319-321. 10.1093/nar/gkg015.PubMedPubMed CentralView ArticleGoogle Scholar
- Rangwala SH, Finn RF, Smith CE, Berberich SA, Salsgiver WJ, Stallings WC, Glover GI, Olins PO: High-level production of active HIV-1 protease in Escherichia coli. Gene. 1992, 122 (2): 263-269. 10.1016/0378-1119(92)90214-A.PubMedView ArticleGoogle Scholar
- Slimko EM, Lester HA: Codon optimization of Caenorhabditis elegans GluCl ion channel genes for mammalian cells dramatically improves expression levels. J Neurosci Methods. 2003, 124 (1): 75-81. 10.1016/S0165-0270(02)00362-X.PubMedView ArticleGoogle Scholar
- Yadava A, Ockenhouse CF: Effect of codon optimization on expression levels of a functionally folded malaria vaccine candidate in prokaryotic and eukaryotic expression systems. Infect Immun. 2003, 71 (9): 4961-4969. 10.1128/IAI.71.9.4961-4969.2003.PubMedPubMed CentralView ArticleGoogle Scholar
- Nakamura Y, Gojobori T, Ikemura T: Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic Acids Res. 2000, 28 (1): 292-10.1093/nar/28.1.292.PubMedPubMed CentralView ArticleGoogle Scholar
- Chou T, Lakatos G: Clustered bottlenecks in mRNA translation and protein synthesis. Phys Rev Lett. 2004, 93 (19): 198101-10.1103/PhysRevLett.93.198101.PubMedView ArticleGoogle Scholar
- Thanaraj TA, Argos P: Ribosome-mediated translational pause and protein domain organization. Protein Sci. 1996, 5 (8): 1594-1612.PubMedPubMed CentralView ArticleGoogle Scholar
- Clarke J, Cota E, Fowler SB, Hamill SJ: Folding studies of immunoglobulin-like beta-sandwich proteins suggest that they share a common folding pathway. Structure. 1999, 7 (9): 1145-1153. 10.1016/S0969-2126(99)80181-6.PubMedView ArticleGoogle Scholar
- Kragelund BB, Hojrup P, Jensen MS, Schjerling CK, Juul E, Knudsen J, Poulsen FM: Fast and one-step folding of closely and distantly related homologous proteins of a four-helix bundle family. J Mol Biol. 1996, 256 (1): 187-200. 10.1006/jmbi.1996.0076.PubMedView ArticleGoogle Scholar
- Ropson IJ, Yowler BC, Dalessio PM, Banaszak L, Thompson J: Properties and crystal structure of a beta-barrel folding mutant. Biophys J. 2000, 78 (3): 1551-1560.PubMedPubMed CentralView ArticleGoogle Scholar
- Widmann M, Christen P: Differential effects of molecular chaperones on refolding of homologous proteins. FEBS Lett. 1995, 377 (3): 481-484. 10.1016/0014-5793(95)01406-3.PubMedView ArticleGoogle Scholar
- Sharp PM, Bailes E, Grocock RJ, Peden JF, Sockett RE: Variation in the strength of selected codon usage bias among bacteria. Nucleic Acids Res. 2005, 33 (4): 1141-1153. 10.1093/nar/gki242.PubMedPubMed CentralView ArticleGoogle Scholar
- dos Reis M, Savva R, Wernisch L: Solving the riddle of codon usage preferences: a test for translational selection. Nucleic Acids Res. 2004, 32 (17): 5036-5044. 10.1093/nar/gkh834.PubMedView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.PubMedView ArticleGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22 (22): 4673-4680. 10.1093/nar/22.22.4673.PubMedPubMed CentralView ArticleGoogle Scholar
- Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science. 1992, 256 (5062): 1443-1445. 10.1126/science.1604319.PubMedView ArticleGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000, 16 (6): 276-277. 10.1016/S0168-9525(00)02024-2.PubMedView ArticleGoogle Scholar