Highly expressed proteins have an increased frequency of alanine in the second amino acid position
BMC Genomics volume 7, Article number: 28 (2006)
Although the sequence requirements for translation initiation regions have been frequently analysed, usually the highly expressed genes are not treated as a separate dataset.
To investigate this, we analysed the mRNA regions downstream of initiation codons in nine bacteria, three archaea and three unicellular eukaryotes, comparing the dataset of highly expressed genes to the dataset of all genes. In addition to the detailed analysis of the nucleotide and codon frequencies we compared the N-termini of highly expressed proteins to the N-termini of all proteins coded in the genome.
The most conserved pattern was observed at the amino acid level: strong alanine over-representation was observed at the second amino acid position of highly expressed proteins. This pattern is well conserved in all three domains of life.
Initiation of translation is the basic determinant for the efficiency of translation. In bacteria the small ribosomal subunit, in complex with several initiation factors directly recognizes the translation initiation region (TIR) in mRNA. Determinants important for recognition of TIR are located between positions -20 and +15 , including mRNA secondary structure, purine-rich Shine-Dalgarno region (SD) (AGGAGG in Escherichia coli) [2–4], S1 protein binding A/U-rich enhancer [4–6], spacing between SD and start codon [7, 8], the base immediately preceding the initiation codon  and the identity of the start codon . These sequence motifs are directly involved in recruiting the initiating ribosomes. In addition, it has been found that codon usage at the beginning of open reading frames is non-random due to the selectional pressure for efficient gene expression [11, 12], although precise nature of this pressure remains obscure. 15–20-fold effect on the levels of gene expression can be obtained by varying the codon following the initiation codon in the mRNA coding sequence; in E. coli AAA is the most common and most expression promoting codon in position +2 . The overall preference for G-starting codons also positively correlated with gene expression level in E. coli . On the other hand, NGG codons give strongly reduced gene expression . The preference for A exists in about 20–30 nucleotide positions at the beginning of E. coli genes . Suggestions that the downstream region influences translation initiation by mRNA-rRNA complementary base pairing failed to gain experimental support [17, 18]. It has been shown that all single-stranded regions of 16S rRNAs have very high A content [19, 20] despite of different genomic GC% . Therefore it has been suggested that mRNA rich in A-residues is unstructured, thus being favourable for translation initiation [16, 21, 22].
In eukaryotes the small ribosomal subunit, in complex with several initiation factors and initiator tRNA, first recognizes the 5' end of mRNA and then scans to the initiation codon [23, 24]. The efficiency of translation initiation is reduced if the sequence surrounding the AUG codon deviates significantly from certain preferred nucleotides. For example in Saccharomyces cerevisiae nucleotide context after initiation codon in highly expressed genes is shown to be AUG UC(U/C) [25–27].
The translation initiation mechanism of archaea is not clearly understood. Archaeal translation has both bacterial and eukaryotic characteristics [28–30]. Archaeal translation initiation factors are homologous to those of eukaryotes [31, 32]. On the other hand, the calculations of the free energy values of the base-pairing between the 3' end of 16S rRNA and 5' UTR of mRNA in Archaeoglobus fulgidus, Methanococcus jannaschii and Methanobacterium thermoautotrophicum have shown a reduction in free-energy before the start codon; the patterns are similar to bacteria, but not to Saccharomyces cerevisiae, indicating the presence of a possible Shine-Dalgarno sequence in archaea . Some archaea such as Sulfolobus solfataricus use two distinct mechanisms for translational initiation: SD-dependent initiation operates on distal cistrons of polycistronic mRNAs, whereas 'leaderless' initiation operates on monocistronic mRNAs and on opening cistrons of polycistronic mRNAs which start directly with the initiation codon .
Currently the genome sequences of many bacteria, archaea and eukaryotes are available. This provides a powerful tool for reconsidering the role of mRNA sequences in initiation of translation. As described above, there is evidence that the mRNA sequence immediately following the initiation codon can influence the efficiency of translation. We analysed the nucleotide preference downstream from the initiation codon in the genomes of 9 bacteria, 3 archaea and 3 unicellular eukaryotes. In addition to the detailed analysis of the nucleotide and codon frequencies we compared the N-termini of highly expressed proteins to the N-termini of all proteins coded in the genome. In contrast to many previous studies we have analysed the highly expressed genes as a separate dataset. This analysis identified sequence patterns in highly expressed genes, universal in all three domains of life.
Adenosine frequencies at the beginning of E. coli ORFs
To study the sequence preferences at the beginnings of Escherichia coli ORFs we counted the nucleotide frequencies per codon as codon is functional unit in translation. In the first analysis we studied A content in all genes in the genome. Our results showed that the beginnings of ORFs had increased frequency of adenosine (A) (Fig. 1A). This is in agreement with previous observations that E. coli has a tendency towards A-rich sequences at the 3'-side of the initiation codon . It has been suggested that this phenomenon is explained by the need to decrease the stability of mRNA secondary structure in the initiation site [13, 16].
It is anticipated that the nucleotide preference pattern is even more pronounced in the most highly expressed genes. Codon adaptation index (CAI) characterizes how similar is synonymous codon usage in a given gene to the highly expressed genes. CAI values vary between 0 and 1. The CAI value of 1 is achieved when all amino acids in given gene are coded by the best codon in each synonymous codon family . The correlation between codon adaptation index and expression level is well documented . Therefore, A frequencies in 80 highly expressed genes (HEG) defined by the highest CAI value were analysed. It appeared that the frequency of A was 1.3 times higher in codons 3–5 of HEG comparing to dataset of all genes (P = 2.2E-06). In contrast, there was no increase in frequency of A nucleotide in the second codon. Rather, the frequency of A was decreased 1.3 times as compared to the dataset of all genes, although the statistical significance of the decrease was low (P = 0.079) (Fig. 1A).
To ensure that the difference in A nucleotide frequencies between codons 2 and 3–5 is related to the expression level of genes, the following analysis was performed: E. coli genes were subdivided into seven groups based on their CAI values and the A usage was compared in those groups. This analysis indicated that the preference of A nucleotide in 2 codon decreased only in the group of most highly expressed genes. In contrast, there was positive trend between CAI value and the frequency of A nucleotide in codons 3–5 (Fig. 1B).
Nucleotide usage at the beginning of ORFs in different organisms
This pattern of the A nucleotide frequency can be specific to E. coli or it can be a more general phenomenon. In addition, the decrease of A in the second codon of HEG may be the result of regular changes in the frequencies of other nucleotides. To answer these questions, we analysed the nucleotide preference downstream from the initiation codon in the genomes of 9 bacteria, 3 archaea and 3 eukaryotes using the datasets of all genes and the most highly expressed genes.
Studied bacteria have a wide range of genome sizes, (the smallest is M. genitalium (0.5 Mbp) , the largest E. coli (4.6 Mbp) ), genomic GC%, (the lowest in B. burgdorferi (28.6 %)  and the highest in M. tuberculosis (65.6 %)), different natural living environments (from parasites to free-living organisms) and different maximal growth-rates. The HEG datasets for bacteria other than E. coli were compiled based on the assumption that functional conservation implies conservation of relative gene expression level, a method successfully used in previous works (e.g.  and ). Accordingly, the HEG datasets consisted of orthologues to 80 HEG of E. coli (Additional file 1: Orthologues). The genomes of 3 archaea were also studied. The HEG datasets of archaea were compiled from orthologues to both 80 HEG of E. coli and 80 HEG of S. cerevisiae (Additional file 1: Orthologues). In addition to prokaryotes, we analysed the genomes of three eukaryotes. In multicellular organisms the codon usage pattern could be different in different tissues, possibly creating complexities that we could not treat in an appropriate manner. Therefore we confined our study with unicellular organisms, yeasts S. cerevisiae  and S. pombe  and the malaria parasite P. falciparum . The HEG datasets of S. pombe and P. falciparum consisted of orthologues to 80 HEG of S. cerevisiae (Additional file 1: Orthologues).
Analysing the changes in nucleotide content of HEGs, we observed that the frequency of C in the fifth nucleotide of HEG (C5, corresponding to the second nucleotide of the second codon), was increased when compared to the all genes dataset (Fig. 2). In 14 of the 15 analysed genomes the increase was significant (P < 0.01). Only M. genitalium had no significant increase in the frequency of C5.
We also observed that the frequency of G in the fourth nucleotide of HEG (G4, corresponding to the first nucleotide of the second codon), tends to be increased in all studied genomes, although the increase is less noticeable. In 11 of the 15 genomes the increase was significant (P < 0.01). The increase of G4 and C5 was mostly accompanied with the decrease of A but in some cases also with the decrease in the frequencies of other nucleotides. (Fig. 2, Additional file 2: Pvalues).
In contrast to the increased frequency of G and C in the second codon, significant A increase in codons 3–5 (nucleotides 7–15) of HEG occurred in M. tuberculosis (P = 2.7E-06), in addition to E. coli (P = 2.2E-06) (Fig. 3). This phenomenon might be related to the need to decrease the stability of mRNA secondary structure in the initiation site [13, 16]. Although this tendency is strong in E. coli (Fig. 1A) and M. tuberculosis, it is poorly conserved in other studied genomes (Fig. 3).
The observed nucleotide usage pattern suggests the preference for GCN as the second codon in HEG. Therefore, we compared the codon and amino acid usage at the beginnings of HEG with the beginnings of all genes (Table 1). Indeed, 11 of 15 organisms had significantly (P < 0.01) increased frequency of one of the GCN codons in the second codon. Sequence following the second codon (codons 3–5) had no common preference for certain codons in different organisms. None of the codons was significantly avoided at the beginning of HEG.
The increase in the frequency of GCN codons in the second codon position could be the result of the increased frequency of G4 and C5 (Additional file 2: Pvalues). In this case the overrepresentation of G4 and C5 would be independent of each other. Alternatively, the preference for GCN codons would create a nucleotide usage pattern where overrepresentation of G4 is correlated to the increased frequency of C5. To answer this question, we took out the genes with GNN codons in the second position from our analysis and tested for an increased frequency of C5 in the remaining datasets by comparing HEG to all genes in the genome. Similarly, we took out the genes with NCN codons in the second position and tested for an increased frequency of G4 in the remaining datasets. Slight overrepresentation of G4 and C5 was observed (Table 2) although this was much weaker than the over-representation of GCN codons in the second codon position of HEG (Table 1). The weak preference for other C5-codons or G4-codons apart from GCN codons can be also seen from the codon usage analysis (Table 1). Nevertheless, the preference for other codons is generally weaker than and not as conserved between different species as the preference for GCN codons.
In most cases, the usage of different GCN codons did not significantly differ between the second codon and the other positions in HEG (Table 3), indicating that there is a selection pressure to have the amino acid alanine, not any specific alanine codon at that position. In some genomes the frequency of GCA codons was significantly increased in the second position. This might relate to the increased frequency of A in codons 3–5 that is observed at least in some bacteria (Fig. 3).
Amino acid preferences
In all studied organisms except M. genitalium the frequency of alanine as the second residue in highly expressed proteins was increased (Table 4). The overall preference for other amino acids in this position was also similar in different organisms: preferred amino acids in addition to alanine were glycine and serine. In positions 3, 4, 5 the amino acid preference pattern was not conserved. Still, in four genomes an increased frequency of positively charged amino acids was observed in at least one of these positions. This might be caused by the increased A frequency, as the lysine codons (AAG and AAA) and the overrepresented arginine codons (AGA) are A-rich. It is still interesting that codons AAG and AAA have been chosen from the set of all A-rich codons (Table 1). For example, the codons for asparagine (AAU and AAC) are not overrepresented in positions 3, 4 or 5.
We have found that HEG, when compared to the all genes dataset contain increased frequency of G nucleotide in the 4th position and C in the 5th position of the ORFs (Fig. 2). This tendency is correlated with the codon usage in the second position of HEG where the increased frequency of codons with G in the first and C in the second position is observed (Table 1). The amino acid usage pattern of the proteins coded by the HEG was even stronger: strong alanine (coded by the GCN codon family) overrepresentation was observed at second amino acid position of highly expressed proteins (Table 4). Moreover, the increased frequency of alanine is observed in all genomes analysed, except M. genitalium suggesting a universal feature for all highly expressed genes. Additional information about the selection of genomes for the study and finding the HEGs is presented in Additional files 3 and 4.
The influence of the nucleotides downstream from the initiation codon on the level of gene expression has been previously recognized both in bacteria and eukaryotes. On the other hand no general characteristic for HEG has been previously recognized. As the bacteria and eukaryotes use different mechanisms for initiation of translation, it has been anticipated that the initiation context effects are different. Our studies reveal a general pattern present in all three domains of life. What could be the reason for the observed nucleotide/amino acids usage pattern?
Effects of the second codon
It has been previously shown that the second codon can influence the expression level of a gene. In S. cerevisiae the UCU codon is associated with increased expression level . This is consistent with our observation that the frequency of UCU codon is increased in HEG although the increase of the GCN codons is most prominent (Table 1). In E. coli it has been shown that NGG codons cause low expression level . This is reflected in the strongly decreased frequency of G as a third nucleotide of the second codon of HEG (Fig. 2). In E. coli the AAA codon has been associated with high expression level . Therefore it is rather surprising that the frequency of AAA is not increased in the second codon position of HEG. Moreover, the frequency of A as a first nucleotide of the second codon is even decreased (Fig. 2). Similarly, the decreased frequency of A in the first or second position of the second codon is present in the HEGs of four other bacteria and all three eukaryotes analysed (Fig. 2).
One possible explanation might be related to the drop-off frequency of peptidyl-tRNA from the ribosome. It has been observed that during protein synthesis peptidyl-tRNA can sometimes dissociate from the ribosome instead of being separated into the protein and deaminoacylated tRNA in the termination reaction [46, 47]. In case this drop-off reaction is very efficient, the enzyme responsible for recycling of peptidyl-tRNA, peptidyl-tRNA hydrolase, will be saturated. Therefore the tRNAs will accumulate in the peptidyl-tRNA form and the resulting shortage of deaminoacylated tRNA will not allow efficient translation [47–50]. The rate of this drop-off reaction depends on the length of the nascent peptide chain and on the codon. The shorter the peptide chain, the more efficient the drop-off is . The peptidyl-tRNAs reading codons with A nucleotides in the first or second position are most prone to drop-off . Therefore it is expected that A rich codons in the beginning of ORFs could cause high frequency of peptidyl-tRNA drop-off. As translation of the HEG provides most of the protein synthesis activity of the cell, the A rich codons might be avoided in the beginning of the ORFs to decrease the amount of drop-off products. Similarly, the GCN codons might be important for stabilizing the dipeptidyl-tRNA on the ribosome.
It is important to note that when the influence of the second codon on gene expression has been studied, the amount of the protein product has been measured. Another important parameter for understanding the role of different sequence elements might be the influence of protein overexpression on cell growth. In case some of the analysed sequences cause high level of peptidyl-tRNA drop-off, the growth would be inhibited. Even small inhibition of growth might be selected against at evolutionary scale and therefore influence the choice of sequences in HEG.
Effects of the second amino acid
The first few N-terminal amino acid residues modulate the stability of proteins  and determine the cleavage of N-terminal formyl-methionine (or methionine in eukaryotes) [53–55]. It is possible that the observed nucleotide and codon preferences in highly expressed genes are caused by preference of these cleavage-promoting amino acids. The rules for formyl-methionine (or methionine) cleavage are similar in bacteria and eukaryotes [56, 57]: the initiating amino acid is cleaved in case the second residue is alanine, glycine, proline, serine, threonine or valine. According to N-end rule, all those six amino acid residues are stabilizing in bacteria and also in S. cerevisiae . Alanine, glycine and serine occurred as favourable amino acids in the second position, at least in some organisms (Table 4). We counted the number of proteins containing the Ala, Gly, Pro, Ser, Thr and Val residues in the second position of HEG and all genes datasets (Table 5). This analysis illustrates that the genes coding for proteins with cleavage determining and stabilizing residues in the second position are enriched within HEG. This is consistent with the high number of housekeeping genes in the HEG dataset. In case the observed sequence trends are caused by the selection for cleavage promoting and stabilizing amino acids in the beginning of HEG coded proteins, then it is interesting to note that alanine has been chosen from the set of six amino acids with similar properties. It is possible that the other amino acids are not as efficient as alanine in directing removal of the initiating amino acid and/or promoting protein stability.
In addition, it is also possible that alanine in the second position of the protein assists the entrance of the nascent peptide chain into the ribosomal tunnel . In this context it is interesting to note that in addition to the increased frequency of alanine in the second position, the highly expressed proteins of several organisms contain positively charged amino acids in positions 3, 4 or 5 (Table 4). Therefore, N-termini of highly expressed proteins tend to have special characteristics that might influence their interaction with the ribosome.
Strong alanine over-representation was observed at the second amino acid position of highly expressed proteins. This pattern is well conserved in all three domains of life.
The protein coding sequences of following 9 bacteria, 3 archaea and 3 eukaryotes were retrieved from GenBank: Escherichia coli K12 [GenBank:NC_000913], Bacillus subtilis [GenBank:NC_000964], Haemophilus influenzae [GenBank:NC_000907], Helicobacter pylori 26695 [GenBank:NC_000915], Mycobacterium tuberculosis H37Rv [GenBank:NC_000962], Treponema pallidum [GenBank:NC_000919], Rickettsia prowazekii [GenBank:NC_000963], Mycoplasma genitalium [GenBank:NC_000908], Borrelia burgdorferi [GenBank:NC_001318], Methanococcus jannaschii [GenBank:NC_000909], Archaeoglobus fulgidus [GenBank:NC_000917], Pyrococcus horikoshii [GenBank:NC_000961], Saccharomyces cerevisiae [GenBank:NC_001133-GenBank:NC_001148], Schizosaccharomyces pombe [GenBank:NC_003421, GenBank:NC_003423-GenBank:NC_003424], Plasmodium falciparum [GenBank:NC_000521, GenBank:NC_000910, GenBank:NC_004314-GenBank:NC_004318, GenBank:NC_004325-GenBank:NC_004331]. Two different datasets were compiled from each genome: one containing highly expressed genes (HEG) and the other set consisting of the all genes of the corresponding organism.
Highly expressed genes
For dataset of 80 HEG of E. coli and S. cerevisiae we chose 80 genes having the highest codon adaptation index (CAI) , which was calculated by using program CodonW . Calculation of CAI is based on a dataset of highly expressed genes including genes coding ribosomal proteins, outer membrane proteins, elongation factors, heat shock proteins and RNA polymerase subunits . The HEG datasets for the rest of the studied organisms were compiled based on the assumption that functional conservation implies the conservation of relative gene expression level, method successfully used in previous works (for example  and ). Therefore, the HEG dataset for the rest of the studied bacteria consisted of orthologues to those 80 HEG of E. coli; the HEG datasets of S. pombe and P. falciparum consisted of orthologues to 80 HEG of S. cerevisiae and the HEG datasets of archaea were compiled from orthologues to both 80 HEG of E. coli and 80 HEG of S. cerevisiae (Additional file 1: Orthologues). Orthologues were found by comparing two genomes with reciprocal BLAST search  and selecting mutually best hits by using the program INPARANOID .
We used two-tailed Fisher's exact test (FET) to compare observed frequencies of nucleotide, codon or amino acid in HEG dataset and all genes dataset (all genes dataset contains HEG as a subset). FET examines whether the frequencies in two datasets are different enough to reject the null hypothesis (the exact meanings of null hypothesis for each analysis are described in figure legends). In all figures and tables the P-values of 0.01 or less were considered significant. No correction for multiple testing was applied to any of the analyses.
Stormo GD, Schneider TD, Gold LM: Characterization of translational initiation sites in E. coli. Nucleic Acids Res. 1982, 10 (9): 2971-2996.
Shine J, Dalgarno L: The 3'-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. Proc Natl Acad Sci U S A. 1974, 71 (4): 1342-1346.
Sakai H, Imamura C, Osada Y, Saito R, Washio T, Tomita M: Correlation between Shine--Dalgarno sequence conservation and codon usage of bacterial genes. J Mol Evol. 2001, 52 (2): 164-170.
Komarova AV, Tchufistova LS, Supina EV, Boni IV: Protein S1 counteracts the inhibitory effect of the extended Shine-Dalgarno sequence on translation. Rna. 2002, 8 (9): 1137-1147. 10.1017/S1355838202029990.
Boni IV, Isaeva DM, Musychenko ML, Tzareva NV: Ribosome-messenger recognition: mRNA target sites for ribosomal protein S1. Nucleic Acids Res. 1991, 19 (1): 155-162.
Zhang J, Deutscher MP: A uridine-rich sequence required for translation of prokaryotic mRNA. Proc Natl Acad Sci U S A. 1992, 89 (7): 2605-2609.
Chen H, Bjerknes M, Kumar R, Jay E: Determination of the optimal aligned spacing between the Shine-Dalgarno sequence and the translation initiation codon of Escherichia coli mRNAs. Nucleic Acids Res. 1994, 22 (23): 4953-4957.
Shultzaberger RK, Bucheimer RE, Rudd KE, Schneider TD: Anatomy of Escherichia coli ribosome binding sites. J Mol Biol. 2001, 313 (1): 215-228. 10.1006/jmbi.2001.5040.
Esposito D, Fey JP, Eberhard S, Hicks AJ, Stern DB: In vivo evidence for the prokaryotic model of extended codon-anticodon interaction in translation initiation. Embo J. 2003, 22 (3): 651-656. 10.1093/emboj/cdg072.
Schneider TD, Stormo GD, Gold L, Ehrenfeucht A: Information content of binding sites on nucleotide sequences. J Mol Biol. 1986, 188 (3): 415-431. 10.1016/0022-2836(86)90165-8.
Chen GF, Inouye M: Suppression of the negative effect of minor arginine codons on gene expression; preferential usage of minor codons within the first 25 codons of the Escherichia coli genes. Nucleic Acids Res. 1990, 18 (6): 1465-1473.
Ohno H, Sakai H, Washio T, Tomita M: Preferential usage of some minor codons in bacteria. Gene. 2001, 276 (1-2): 107-115. 10.1016/S0378-1119(01)00670-9.
Stenstrom CM, Jin H, Major LL, Tate WP, Isaksson LA: Codon bias at the 3'-side of the initiation codon is correlated with translation initiation efficiency in Escherichia coli. Gene. 2001, 263 (1-2): 273-284. 10.1016/S0378-1119(00)00550-3.
Gutierrez G, Marquez L, Marin A: Preference for guanosine at first codon position in highly expressed Escherichia coli genes. A relationship with translational efficiency. Nucleic Acids Res. 1996, 24 (13): 2525-2527. 10.1093/nar/24.13.2525.
Gonzalez de Valdivia EI, Isaksson LA: A codon window in mRNA downstream of the initiation codon where NGG codons give strongly reduced gene expression in Escherichia coli. Nucleic Acids Res. 2004, 32 (17): 5198-5205. 10.1093/nar/gkh857.
Rocha EP, Danchin A, Viari A: Translation in Bacillus subtilis: roles and trends of initiation and termination, insights from a genome analysis. Nucleic Acids Res. 1999, 27 (17): 3567-3576. 10.1093/nar/27.17.3567.
Firpo MA, Dahlberg AE: The importance of base pairing in the penultimate stem of Escherichia coli 16S rRNA for ribosomal subunit association. Nucleic Acids Res. 1998, 26 (9): 2156-2160. 10.1093/nar/26.9.2156.
O'Connor M, Asai T, Squires CL, Dahlberg AE: Enhancement of translation by the downstream box does not involve base pairing of mRNA with the penultimate stem sequence of 16S rRNA. Proc Natl Acad Sci U S A. 1999, 96 (16): 8973-8978. 10.1073/pnas.96.16.8973.
Wang HC, Hickey DA: Evidence for strong selective constraint acting on the nucleotide composition of 16S ribosomal RNA genes. Nucleic Acids Res. 2002, 30 (11): 2501-2507. 10.1093/nar/30.11.2501.
Gutell RR, Weiser B, Woese CR, Noller HF: Comparative anatomy of 16-S-like ribosomal RNA. Prog Nucleic Acid Res Mol Biol. 1985, 32: 155-216.
Stenstrom CM, Isaksson LA: Influences on translation initiation and early elongation by the messenger RNA region flanking the initiation codon at the 3' side. Gene. 2002, 288 (1-2): 1-8. 10.1016/S0378-1119(02)00501-2.
Eyre-Walker A, Bulmer M: Reduced synonymous substitution rate at the start of enterobacterial genes. Nucleic Acids Res. 1993, 21 (19): 4599-4603.
Kozak M: Comparison of initiation of protein synthesis in procaryotes, eucaryotes, and organelles. Microbiol Rev. 1983, 47 (1): 1-45.
Kozak M: The scanning model for translation: an update. J Cell Biol. 1989, 108 (2): 229-241. 10.1083/jcb.108.2.229.
Hamilton R, Watanabe CK, de Boer HA: Compilation and comparison of the sequence context around the AUG startcodons in Saccharomyces cerevisiae mRNAs. Nucleic Acids Res. 1987, 15 (8): 3581-3593.
Miyasaka H: The positive relationship between codon usage bias and translation initiation AUG context in Saccharomyces cerevisiae. Yeast. 1999, 15 (8): 633-637. 10.1002/(SICI)1097-0061(19990615)15:8<633::AID-YEA407>3.0.CO;2-O.
Fuglsang A: Bioinformatic analysis of the link between gene composition and expressivity in Saccharomyces cerevisiae and Schizosaccharomyces pombe. Antonie Van Leeuwenhoek. 2004, 86 (2): 135-147. 10.1023/B:ANTO.0000036119.00001.3b.
Bell SD, Jackson SP: Transcription and translation in Archaea: a mosaic of eukaryal and bacterial features. Trends Microbiol. 1998, 6 (6): 222-228. 10.1016/S0966-842X(98)01281-5.
Kapp LD, Lorsch JR: The molecular mechanics of eukaryotic translation. Annu Rev Biochem. 2004, 73: 657-704. 10.1146/annurev.biochem.73.030403.080419.
Dennis PP: Ancient ciphers: translation in Archaea. Cell. 1997, 89 (7): 1007-1010. 10.1016/S0092-8674(00)80288-3.
Kyrpides NC, Woese CR: Archaeal translation initiation revisited: the initiation factor 2 and eukaryotic initiation factor 2B alpha-beta-delta subunit families. Proc Natl Acad Sci U S A. 1998, 95 (7): 3726-3730. 10.1073/pnas.95.7.3726.
Kyrpides NC, Woese CR: Universally conserved translation initiation factors. Proc Natl Acad Sci U S A. 1998, 95 (1): 224-228. 10.1073/pnas.95.1.224.
Osada Y, Saito R, Tomita M: Analysis of base-pairing potentials between 16S rRNA and 5' UTR for translation initiation in various prokaryotes. Bioinformatics. 1999, 15 (7-8): 578-581. 10.1093/bioinformatics/15.7.578.
Benelli D, Maone E, Londei P: Two different mechanisms for ribosome/mRNA interaction in archaeal translation initiation. Mol Microbiol. 2003, 50 (2): 635-643. 10.1046/j.1365-2958.2003.03721.x.
Sharp PM, Li WH: The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987, 15 (3): 1281-1295.
Jansen R, Bussemaker HJ, Gerstein M: Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models. Nucleic Acids Res. 2003, 31 (8): 2242-2251. 10.1093/nar/gkg306.
Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM, Fritchman RD, Weidman JF, Small KV, Sandusky M, Fuhrmann J, Nguyen D, Utterback TR, Saudek DM, Phillips CA, Merrick JM, Tomb JF, Dougherty BA, Bott KF, Hu PC, Lucier TS, Peterson SN, Smith HO, Hutchison CA, Venter JC: The minimal gene complement of Mycoplasma genitalium. Science. 1995, 270 (5235): 397-403.
Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, Gregor J, Davis NW, Kirkpatrick HA, Goeden MA, Rose DJ, Mau B, Shao Y: The complete genome sequence of Escherichia coli K-12. Science. 1997, 277 (5331): 1453-1474. 10.1126/science.277.5331.1453.
Fraser CM, Casjens S, Huang WM, Sutton GG, Clayton R, Lathigra R, White O, Ketchum KA, Dodson R, Hickey EK, Gwinn M, Dougherty B, Tomb JF, Fleischmann RD, Richardson D, Peterson J, Kerlavage AR, Quackenbush J, Salzberg S, Hanson M, van Vugt R, Palmer N, Adams MD, Gocayne J, Weidman J, Utterback T, Watthey L, McDonald L, Artiach P, Bowman C, Garland S, Fuji C, Cotton MD, Horst K, Roberts K, Hatch B, Smith HO, Venter JC: Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Nature. 1997, 390 (6660): 580-586. 10.1038/37551.
Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, Gordon SV, Eiglmeier K, Gas S, Barry CE, Tekaia F, Badcock K, Basham D, Brown D, Chillingworth T, Connor R, Davies R, Devlin K, Feltwell T, Gentles S, Hamlin N, Holroyd S, Hornsby T, Jagels K, Krogh A, McLean J, Moule S, Murphy L, Oliver K, Osborne J, Quail MA, Rajandream MA, Rogers J, Rutter S, Seeger K, Skelton J, Squares R, Squares S, Sulston JE, Taylor K, Whitehead S, Barrell BG: Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998, 393 (6685): 537-544. 10.1038/31159.
McVean GA, Hurst GD: Evolutionary lability of context-dependent codon bias in bacteria. J Mol Evol. 2000, 50 (3): 264-275.
Perriere G, Thioulouse J: Use and misuse of correspondence analysis in codon usage studies. Nucleic Acids Res. 2002, 30 (20): 4548-4555. 10.1093/nar/gkf565.
Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C, Johnston M, Louis EJ, Mewes HW, Murakami Y, Philippsen P, Tettelin H, Oliver SG: Life with 6000 genes. Science. 1996, 274 (5287): 546, 563-7. 10.1126/science.274.5287.546.
Wood V, Gwilliam R, Rajandream MA, Lyne M, Lyne R, Stewart A, Sgouros J, Peat N, Hayles J, Baker S, Basham D, Bowman S, Brooks K, Brown D, Brown S, Chillingworth T, Churcher C, Collins M, Connor R, Cronin A, Davis P, Feltwell T, Fraser A, Gentles S, Goble A, Hamlin N, Harris D, Hidalgo J, Hodgson G, Holroyd S, Hornsby T, Howarth S, Huckle EJ, Hunt S, Jagels K, James K, Jones L, Jones M, Leather S, McDonald S, McLean J, Mooney P, Moule S, Mungall K, Murphy L, Niblett D, Odell C, Oliver K, O'Neil S, Pearson D, Quail MA, Rabbinowitsch E, Rutherford K, Rutter S, Saunders D, Seeger K, Sharp S, Skelton J, Simmonds M, Squares R, Squares S, Stevens K, Taylor K, Taylor RG, Tivey A, Walsh S, Warren T, Whitehead S, Woodward J, Volckaert G, Aert R, Robben J, Grymonprez B, Weltjens I, Vanstreels E, Rieger M, Schafer M, Muller-Auer S, Gabel C, Fuchs M, Dusterhoft A, Fritzc C, Holzer E, Moestl D, Hilbert H, Borzym K, Langer I, Beck A, Lehrach H, Reinhardt R, Pohl TM, Eger P, Zimmermann W, Wedler H, Wambutt R, Purnelle B, Goffeau A, Cadieu E, Dreano S, Gloux S, Lelaure V, Mottier S, Galibert F, Aves SJ, Xiang Z, Hunt C, Moore K, Hurst SM, Lucas M, Rochet M, Gaillardin C, Tallada VA, Garzon A, Thode G, Daga RR, Cruzado L, Jimenez J, Sanchez M, del Rey F, Benito J, Dominguez A, Revuelta JL, Moreno S, Armstrong J, Forsburg SL, Cerutti L, Lowe T, McCombie WR, Paulsen I, Potashkin J, Shpakovski GV, Ussery D, Barrell BG, Nurse P: The genome sequence of Schizosaccharomyces pombe. Nature. 2002, 415 (6874): 871-880. 10.1038/nature724.
Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S, Paulsen IT, James K, Eisen JA, Rutherford K, Salzberg SL, Craig A, Kyes S, Chan MS, Nene V, Shallom SJ, Suh B, Peterson J, Angiuoli S, Pertea M, Allen J, Selengut J, Haft D, Mather MW, Vaidya AB, Martin DM, Fairlamb AH, Fraunholz MJ, Roos DS, Ralph SA, McFadden GI, Cummings LM, Subramanian GM, Mungall C, Venter JC, Carucci DJ, Hoffman SL, Newbold C, Davis RW, Fraser CM, Barrell B: Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002, 419 (6906): 498-511. 10.1038/nature01097.
Menninger JR: The accumulation as peptidyl-transfer RNA of isoaccepting transfer RNA families in Escherichia coli with temperature-sensitive peptidyl-transfer RNA hydrolase. J Biol Chem. 1978, 253 (19): 6808-6813.
Menez J, Heurgue-Hamard V, Buckingham RH: Sequestration of specific tRNA species cognate to the last sense codon of an overproduced gratuitous protein. Nucleic Acids Res. 2000, 28 (23): 4725-4732. 10.1093/nar/28.23.4725.
Hernandez-Sanchez J, Valadez JG, Herrera JV, Ontiveros C, Guarneros G: lambda bar minigene-mediated inhibition of protein synthesis involves accumulation of peptidyl-tRNA and starvation for tRNA. Embo J. 1998, 17 (13): 3758-3765. 10.1093/emboj/17.13.3758.
Heurgue-Hamard V, Dincbas V, Buckingham RH, Ehrenberg M: Origins of minigene-dependent growth inhibition in bacterial cells. Embo J. 2000, 19 (11): 2701-2709. 10.1093/emboj/19.11.2701.
Tenson T, Herrera JV, Kloss P, Guarneros G, Mankin AS: Inhibition of translation and cell growth by minigene expression. J Bacteriol. 1999, 181 (5): 1617-1622.
Cruz-Vera LR, Hernandez-Ramon E, Perez-Zamorano B, Guarneros G: The rate of peptidyl-tRNA dissociation from the ribosome during minigene expression depends on the nature of the last decoding interaction. J Biol Chem. 2003, 278 (28): 26065-26070. 10.1074/jbc.M301129200.
Varshavsky A: The N-end rule: functions, mysteries, uses. Proc Natl Acad Sci U S A. 1996, 93 (22): 12142-12149. 10.1073/pnas.93.22.12142.
Tsunasawa S, Stewart JW, Sherman F: Amino-terminal processing of mutant forms of yeast iso-1-cytochrome c. The specificities of methionine aminopeptidase and acetyltransferase. J Biol Chem. 1985, 260 (9): 5382-5391.
Ben-Bassat A, Bauer K, Chang SY, Myambo K, Boosman A, Chang S: Processing of the initiation methionine from proteins: properties of the Escherichia coli methionine aminopeptidase and its gene structure. J Bacteriol. 1987, 169 (2): 751-757.
Solbiati J, Chapman-Smith A, Miller JL, Miller CG, Cronan JEJ: Processing of the N termini of nascent polypeptide chains requires deformylation prior to methionine removal. J Mol Biol. 1999, 290 (3): 607-614. 10.1006/jmbi.1999.2913.
Hirel PH, Schmitter MJ, Dessen P, Fayat G, Blanquet S: Extent of N-terminal methionine excision from Escherichia coli proteins is governed by the side-chain length of the penultimate amino acid. Proc Natl Acad Sci U S A. 1989, 86 (21): 8247-8251.
Moerschell RP, Hosokawa Y, Tsunasawa S, Sherman F: The specificities of yeast methionine aminopeptidase and acetylation of amino-terminal methionine in vivo. Processing of altered iso-1-cytochromes c created by oligonucleotide transformation. J Biol Chem. 1990, 265 (32): 19638-19643.
Tenson T, Ehrenberg M: Regulatory nascent peptides in the ribosomal tunnel. Cell. 2002, 108 (5): 591-594. 10.1016/S0092-8674(02)00669-4.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.
Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol. 2001, 314 (5): 1041-1052. 10.1006/jmbi.2000.5197.
We thank Ülo Maiväli, Niilo Kaldalu, Arvi Jõers and Jonathan Ouellet for valuable comments on the manuscript and Tõnu Möls for the advice on the use of statistical methods. We thank Katre Palm for proofreading of the grammar. Supported by The Wellcome Trust International Senior Fellowship (070210/Z/03/Z) and Estonian Science Foundation grant (5311). Maido Remm was supported by the Estonian Ministry of Education and Research grant no. 0182649s04.
AT performed the data analysis. MR participated in the study's design and choice of methods. TT conceived the study and participated in its design and coordination. All authors read and approved the final manuscript.
Electronic supplementary material
Additional File 2: P-values for U, C, A and G in nucleotide positions 4–30. H0: there is no difference in nucleotide frequency between all genes and HEG. (ecoli – E. coli, bsubt – B. subtilis, hpylo – H. pylori, hinfl – H. influenzae, mtube – M. tuberculosis, tpall – T. pallidum, rprow – R. prowazekii, mgeni – M. genitalium, bburg – B. burgdorferi, afulg – A. fulgidus, mjann – M. jannaschii, phori – P. horikoshii, scere – S. cerevisiae, spomb – S. pombe, pfalc – P. falciparum). ↑: frequency is increased in HEG compared to all genes. ↓: frequency is decreased in HEG compared to all genes. -: no difference between HEG and all genes datasets. (PDF 42 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Tats, A., Remm, M. & Tenson, T. Highly expressed proteins have an increased frequency of alanine in the second amino acid position. BMC Genomics 7, 28 (2006). https://doi.org/10.1186/1471-2164-7-28