In many microbial genomes, a strong preference for a small number of codons can be observed in genes whose products are needed by the cell in large quantities. This codon usage bias (CUB) improves translational accuracy and speed and is one of several factors optimizing cell growth. Whereas CUB and the overrepresentation of individual proteins have been studied in detail, it is still unclear which high-level metabolic categories are subject to translational optimization in different habitats.
In a systematic study of 388 microbial species, we have identified for each genome a specific subset of genes characterized by a marked CUB, which we named the effectome. As expected, gene products related to protein synthesis are abundant in both archaeal and bacterial effectomes. In addition, enzymes contributing to energy production and gene products involved in protein folding and stabilization are overrepresented. The comparison of genomes from eleven habitats shows that the environment has only a minor effect on the composition of the effectomes. As a paradigmatic example, we detailed the effectome content of 37 bacterial genomes that are most likely exposed to strongest selective pressure towards translational optimization. These effectomes accommodate a broad range of protein functions like enzymes related to glycolysis/gluconeogenesis and the TCA cycle, ATP synthases, aminoacyl-tRNA synthetases, chaperones, proteases that degrade misfolded proteins, protectants against oxidative damage, as well as cold shock and outer membrane proteins.
We made clear that effectomes consist of specific subsets of the proteome being involved in several cellular functions. As expected, some functions are related to cell growth and affect speed and quality of protein synthesis. Additionally, the effectomes contain enzymes of central metabolic pathways and cellular functions sustaining microbial life under stress situations. These findings indicate that cell growth is an important but not the only factor modulating translational accuracy and speed by means of CUB.
The composition of genes coding for ribosomal proteins and translation elongation factors is highly biased in many genomes . This codon usage bias (CUB) is due to a preference for a species-specific set of codons, which are named major codons. Their particular choice depends on the genomic GC-content and can be explained by amino acid specific rules . Beginning with pioneering work in the 1980s, it has been demonstrated convincingly that major codons are more accurately and more efficiently recognized by the most abundant tRNA species [3–10]. These findings support the hypothesis that major codons are used preferentially in genes coding for proteins required by the cell in large quantities (see  and references therein). A further analysis of microbial genomes made clear that CUB is one of several factors to optimize cell growth: Species exposed to selection for rapid growth possess more rRNA operons, more tRNA genes and use major codons more frequently [11, 12]. Additionally it turned out that CUB is the best determinant of minimum generation time .
Based on different measures of CUB, the occurrence and function of translationally optimized gene products has been studied (see e.g. [14, 15] and references therein) and compiled e.g. for Escherichia coli, Frankia, or Yeast ; however, most reports lack a statistical assessment. Broad multi-species analyses of 27  and 461 microbial genomes  aimed at identifying preferred functional categories among codon-optimized genes.
In the following, we report a phylogenetic- and habitat-specific analysis of a particular set of 388 microbial genomes. We found that gene products being optimized for translational efficiency in the course of evolution contribute to protein synthesis, energy production, and protein folding. Compared to Bacteria, translational efficiency is less pronounced in Archaea and restrained to a smaller number of gene functions. In most cases, the function of translationally optimized gene products is only marginally affected by the habitat.
Results and Discussion
The GCB-approach constitutes a quantitative measure of translational efficiency for a broad range of genomes
A number of measures for CUB have been used to predict translationally efficient genes in microbial genomes (e.g. [1, 21–28]). Generally, H1-methods  are more suitable to determine the bias associated with translational efficiency than other approaches . A recent comparison of several measures has shown that two H1-methods, namely the MELP algorithm  and the GCB-approach , have the most consistent behaviour for predicting the expression level of individual genes . The GCB-approach, which we utilize in the following, is based on CB-scores determined species-specifically for each codon; see Methods. We first computed these scores for the 912 microbial entries of the NCBI RefSeq database which consists of a curated and non-redundant collection of reference genomes . We have implemented a web-server (accessible viahttp://www-bioinf.uni-regensburg.de) that calculates for a gene sequence the GCB-score for a wide variety of microbial species. GCB-scores take positive and negative values; the more positive a score is, the higher is the fraction of major codons in the considered gene.
The numbers of rRNA genes, of tRNA genes, and the genome-wide strength of CUB are highly correlated [11, 13]. In order to confirm that GCB-values quantify strength of CUB on the genome level as well, we correlated mean values and the number of tRNA genes in analogy to . The -value was determined for each genome (see Methods) and taken as a measure for the species-specific strength of CUB. To minimize the risk of false positive classification when identifying translationally optimized genes, we selected those 388 genomes showing a marked CUB (see Additional file 1, Table S1 for a listing and Methods for the selection procedure). A plot of these 388 values versus the species-specific number of tRNA genes is shown in Figure 1A. A Spearman rank correlation confirmed for the -values and the number of tRNA genes a statistically significant correspondence (rs= 0.71, p < 0.001), which is stronger than the one deduced from S values , an alternative measure of CUB. For 113 of these 388 species the minimum generation time is listed in . A plot of these numbers versus-values is shown in Figure 1B. Again, a Spearman rank correlation confirmed a statistically significant correspondence (rs= -0.75, p < 0.001), which is stronger than the one published elsewhere . Most likely, the stronger correlation is in both cases due to our focusing on genomes showing a marked CUB.
We conclude from these findings and previous results [30, 31] that the GCB-approach allows us to quantify strength of CUB in a consistent manner for a broad range of genes and genomes and to identify translationally optimized genes. Note that we use the term "translationally optimized" for genes showing a marked CUB. As we did not correlate CUB values and mRNA concentrations for a larger set of species, the term optimization as used here is not necessarily related to the expression level of a gene.
In the following, we name the subset of an individual proteome constituted by genes with a GCB-value ≥ 0.0 the effectome and the above-introduced set of 388 microbial genomes showing a pronounced codon usage bias MG_CUB. Subsets selected for a specific habitat are named MG_CUB(subset); e.g., MG_CUB(Bacteria_TH) comprises the genomes of 13 thermophilic Bacteria belonging to MG_CUB.
In a previous study  M. Carbone has aimed at characterizing the set of genes deemed to be essential for any given bacterial species. In this context, the set of species-specific genes possessing a marked CUB has been named functional genomic core. Although the approach of identifying translationally optimized genes is similar to ours, we did not utilize this term for two reasons: 1) The concept of a genomic core has been coined to address the set of intrinsically conserved genes of a phylogenetic group like Archaea (see e.g.  and references therein). Thus, the above term might be misinterpreted. 2) Irrespective of the strength of CUB within an individual genome, those 200 genes showing strongest CUB have been analyzed in . In contrast, a species-specific effectome consists of a gene set whose size and composition is exclusively determined by CUB and a well-defined cut off.
Figure 1 shows for Bacteria that the strength of CUB varies markedly among species. A comparison of -values makes clear that taxonomical position and lifestyle affect the bias: The -values of mesophilic and thermo-/hyperthermophilic Archaea are similar; the means are 0.070 and 0.074, respectively. In contrast, CUB of psychrophilic/mesophilic Bacteria is higher than that of thermophilic species. The means are 0.10 and 0.06, respectively, and a Mann-Whitney rank sum test signaled the statistically significant difference of the two -distributions (p = 0.003). These findings indicate that CUB is less pronounced in Archaea and strongest in mesophilic and psychrophilic Bacteria.
As pointed out in , these differences are most likely due to the dependence of enzyme activity on temperature and might explain CUB in Bacteria. At higher temperature, diffusion increases, viscosity and activation energy decreases, which both facilitate rapid reactions. Therefore, selective strength on CUB is presumably weaker for thermophilic species. In analogy, stronger CUB might be necessary for psychrophilic species to reach a tolerable growth rate.
Species for which speed and efficiency of growth and replication were strong selective forces during evolution are characterized by a high number of tRNA genes . As we expected the widest range of protein functions in the related effectomes, we selected for further analysis those 37 bacterial genomes possessing more than 90 tRNA genes. The composition of the respective subset MG_CUB(Bacteria_HITR) is listed in Additional file 1, Table S2. The mean -value of this set is 0.15 and indicates a strong selective pressure. Concordantly, the mean of minimum generation times for those 14 species of MG_CUB(Bacteria_HITR) listed in  is 48 min, which is significantly lower than the mean (more than 8 hours) deduced from the whole list.
Depending on the methods used to assess CUB, different fractions of CUB genes have been identified. It has been reported that CUB can be detected in ~28% , ~50% , ~70% , or ~100%  of microbial genomes. In the light of these findings, our choice of ~42% of the genomes was a more conservative approach. Here we decided in agreement with  and suggest that the lifestyle of a microbe determines the strength of CUB. For species which we did not consider due to a small -value, we assumed the relative unimportance of exponential growth.
The effectomes encode a broad and specific range of gene functions
Each analysis of a single proteome reveals a small number of translationally optimized gene products. However, to identify general trends that can be subjected to statistical analyses, one has to explore several genomes and to link the contribution of individual gene products to a more general description of cellular functions. To achieve a multi-level categorization of gene products, we utilized Gene Ontology terms  in combination with the classification system of FunCat .
Gene Ontology (GO) terms allow the description of gene products by means of a strict vocabulary organized in a hierarchical way. However, assessing the most granular GO-terms used to annotate genes is inappropriate for our purposes: E.g. in E. coli, the GO-term "DNA binding" (GO:0004803) is an attribute of transposases, the DnaK suppressor protein, subunits of the DNA polymerase III, elements of prophages, transcription activators, and helicases. Therefore, it is difficult to interpret the overrepresentation of this term in a biologically meaningful way. An overrepresentation of the GO-term "RNA binding" in the effectomes is most probably related to the abundance of ribosomal proteins. These examples demonstrate that higher-level descriptions of gene functions have to be exploited to deduce biologically meaningful results. As an alternative to the analysis of a GO slim (a set of higher level GO-terms) we decided to utilize FunCat categories. FunCat  is a functional annotation scheme for the systematic classification of proteins from whole genomes. Utilizing FunCat has an important advantage over GO-terms: As the number of categories needed to classify effectomes is low, we could compare the full composition of the effectomes and the whole genomes by means of robust statistical tests.
For an analysis, we deduced for each gene-product GO-terms, mapped them onto high-level FunCat categories and assessed their abundance. Relative frequencies of each category were determined both for a complete dataset MG_CUB(subset) and the respective effectomes. To quantify the abundance of a category Cat within a set of effectomes, we computed the term AbundEff(Cat) which is the log-odds ratio of relative frequencies (see Methods). An AbundEff(Cat) value above zero indicates that Cat is overrepresented in the effectomes, a value below zero signals an underrepresentation. To this end, we determined for the set of all archaeal and all bacterial genomes AbundEff-scores for FunCat categories of level 1, which is the most abstract level of describing protein functions. Results are plotted in Figure 2 and listed in Table 1. These differences in the composition of the effectomes and the underlying whole-genome datasets are statistically significant as confirmed by a chi-square test (p < 0.001).
The comparison of AbundEff-values indicates a trend towards the translational optimization of several systemic functions. In the following, the number of FunCat categories is given in brackets after their name. As expected, proteins contributing to "protein synthesis" (12) are a major element of the effectomes. In addition, the category "energy" (2) is overrepresented. These findings show that effectomes are to a great extent composed of proteins being related to cell growth and energy production. However, the underrepresentation of "metabolism" (1) and of "transcription" (11) indicates that there is no general trend to optimize translational efficiency of all functions related to cell growth. The categories "cellular communication/signal transduction mechanism" (30), "transposable elements, viral and plasmid proteins" (38) and "regulation of metabolism and protein function" (18) have lowest AbundEff-values. Most likely, due to their uncritical cellular concentration, elements of regulatory processes (categories 18 and 30) do not undergo optimization of translational efficiency. The codon usage of transposable elements (category 38) and of alien genes is frequently not optimized for their host  which explains most likely their underrepresentation in the effectomes. Alternatively, we utilized COG-categories  for high level classification because of their different approach of grouping genes. Results are listed in Additional file 1, Table S3 and confirm the general trends. In summary, the analysis reveals a consistent tendency, which is at the systemic level independent of taxonomical position: Both in Archaea and in Bacteria, translationally optimized genes are involved in protein synthesis; additionally they contribute to various cellular functions as e.g. to energy production.
The habitat has a minor effect on the composition of the effectomes
To study the impact of the habitat on the composition of effectomes, we determined AbundEff-values for the set of all archaeal and all bacterial effectomes, for MG_CUB(Bacteria_HITR) and for subsets of hyperthermophilic, thermophilic, mesophilic, psychrophilic, aquatic, terrestrial, host-associated, aerobic, anaerobic, non-halophilic, and moderately halophilic Archaea or Bacteria contributing to MG_CUB, if the subset contained at least seven genomes; see Additional file 1, Table S4. For a more detailed analysis of the effectomes and to corroborate the overrepresentation of specific functions not detectable at FunCat level 1, we determined AbundEff-values for FunCat categories of level 2 and compiled them in Additional file 2. Table 2 lists for 12 habitats categories overrepresented in at least one subset of archaeal or bacterial effectomes. In agreement with the above findings, sub-categories related to "protein synthesis" (12.01, 12.04, 12.07) are overrepresented in archaeal and bacterial effectomes. Additionally, specific functions belonging to "protein folding and stabilization" (14.01) are overrepresented both in bacterial and archaeal effectomes. Compared to Bacteria, archaeal effectomes contain a smaller number of gene products related to energy production. In bacterial effectomes enzymes being parts of "glycolysis and gluconeogenesis" (2.01) and of the "tricarboxylic-acid pathway (citrate cycle, Krebs cycle, TCA cycle)" (2.10) are the dominating elements of energy production. All other protein functions are less overrepresented in bacterial effectomes. As expected, AbundEff-values of bacterial genomes being most optimized for cell growth [represented by MG_CUB(Bacteria_HITR)] are in many cases most extreme (compare Table 2) and deviate in some cases from general tendencies. In summary, a comparison of the AbundEff-values indicates two general trends: 1) The composition of archaeal effectomes is focused on a smaller number of systemic gene functions. 2) The habitat has only a minor effect on effectome composition. Figure 3 illustrates the latter finding for nine bacterial habitats: In nearly all cases, the strength of over- or underrepresentation is similarly high.
A paradigmatic case: The effectome composition of bacterial genomes being strongly optimized for cell growth
In order to analyze effectome composition on the level of individual gene products, we used the eggNOG database , which consists of functionally annotated clusters of orthologous genes (COGs) . Additionally, we mapped enzymes onto reference pathways of the KEGG database . To study a prominent example, we analyzed the effectomes of those Bacteria which show strongest signals of translational optimization [the set MG_CUB(Bacteria_HITR)]. The composition of these effectomes is compiled in Additional file 3; respective identifiers for the eggNOG and KEGG database are listed in Additional file 4. Some examples that substantiate the broad range of gene functions contributing to these effectomes are given in the following list, which is sorted according to FunCat categories and annotated according to eggNOG.
Glycolysis and gluconeogenesis (2.01)
Enolases, which are essential for the degradation of carbohydrates via glycolysis; other enzymes of central pathways like glyceraldehyde-3-phosphate dehydrogenase/erythrose-4-phosphate dehydrogenase; fructose/tagatose bisphosphate aldolase; the pyruvate/2-oxoglutarate dehydrogenase complex; triosephosphate isomerase; 3-phosphoglycerate kinase.
Electron transport and membrane-associated energy conservation (2.11)
Elements of the F0F1-type ATP synthase.
Ribosome biogenesis (12.01)
All ribosomal proteins of both subunits.
Translation initiation factors 1, 2, 3; translation elongation factors Tu, Ts, P; the ribosome recycling factor; aminoacyl-tRNA synthetases (see 12.1); ribosomal proteins (see 12.01).
Translational control (12.07)
Bacterial nucleoid DNA-binding protein.
Aminoacyl-tRNA synthetases (12.1)
Synthetases transferring 16 different amino acids occur in the effectomes. The missing tRNA synthetases are related to Gln, His, Cys and Trp.
Protein folding and stabilization (14.01)
Several proteins involved in protein folding and stabilization like chaperones; the peptidyl-prolyl cis-trans isomerase (rotamase), which accelerates the folding of proteins; the parvulin-like peptidyl-prolyl isomerase, which plays a major role in protein secretion; the protease subunit of ATP-dependent Clp proteases, which are important for the degradation of misfolded proteins; the cell division GTPase, which is essential for the cell-division process.
Superoxide dismutase, destroying radicals which are normally produced within the cells and which are toxic to biological systems.
The DNA-binding ferritin-like protein, which protects DNA from oxidative damage.
Cold shock proteins, inhibiting DNA replication at both initiation and elongation steps; the pleiotropic transcriptional repressor, which represses the expression of many genes that are induced as cells make the transition from rapid exponential growth to stationary phase; elements of the glycine cleavage system, which catalyzes the degradation of glycine; glycine/serine hydroxymethyltransferase, which supports the interconversion of serine and glycine; nucleoside diphosphate kinase, which is involved in the synthesis of nucleoside triphosphates other than ATP; adenylosuccinate synthase, which belongs to the de novo pathway of purine nucleotide biosynthesis; several outer membrane proteins.
The mapping of enzymes belonging to FunCat categories 2.01, 2.10, and 2.11 onto KEGG reference pathways makes clear that all enzymes constituting the core of the glycolysis/gluconeogenesis pathway and the TCA cycle are elements of these effectomes; see Additional file 5, Figure S1 and Additional file 6, Figure S2.
The analysis of multiple genomes allows a fine grained correlation of CUB and gene functions
Due to the small number of CUB genes being identified in a single genome, former analyses of individual genomes or small sets of related species (see e.g. ) could identify only a small set of individual gene functions being translationally optimized. These results have been confirmed by [19, 20] and our findings. These three multi-species analyses agree in detecting an overrepresentation of translationally optimized genes in central metabolic functions like in protein synthesis or energy production. However, for other high level functions, some findings presented here and in  or  differ.
Considering individual genes, many of our results coincide with the outcome of , which is based on a smaller set of genomes. This is also true for less pronounced gene functions like the elements of the photosynthesis system of Synechocystis, the role of ferredoxin in Pyrococcus abyssi and the central enzymes of methane metabolism in Methanosarcina acetivorans. In contrast, all proteins involved in acetoclastic methanogenesis  do not belong to the effectome of M. acetivorans, as their GCB-value is ≤ -0.03. The conclusions drawn on the level of metabolic pathways are contrary in some cases, too. For example, in the effectomes of Archaea and Bacteria elements of the transcription apparatus (FunCat category 11) and of transmembrane signal transduction (FunCat category 30.05) are significantly underrepresented, which is in contrast to the postulated composition of functional genomic cores . Our approach regards a metabolic function as translationally optimized only if more than the expected number of related genes shows a marked CUB. It is a matter of debate whether CUB in a small number of related genes is sufficient to declare a whole metabolic process as translationally optimized.
A recently published study  has been based on a machine learning approach for the identification of genes possessing an optimized codon usage (OCU). At mean, the considered genomes have contained 13.2% of OCU genes, in extreme cases, 33% of the genomic content has been OCU. These genes have been utilized to corroborate the enrichment or depletion of metabolic functions which have been characterized by means of GO-terms. In contrast, the effectomes analyzed here, are much smaller: 86% of the effectomes are constituted by at most 5% of the respective genomic content; only four Borrelia species possess effectomes containing more than 25% of their genes. Despite these differences in the amount of CUB genes, the outcomes of both studies overlap to a great extent considering high-level metabolic functions. For example, "electron transport and membrane associated energy conservation" (FunCat category 2.11) and the respective GO-term "ATP synthesis coupled proton transport" were reported as overrepresented. The same is true for functions related to protein folding and elements of energy production like the TCA cycle. Both studies identify an underrepresentation of functions related to "DNA repair" and "inorganic ion transport" (see Additional file 1, Table S3). On the other hand, an enrichment of functions related to antibiotic biosynthesis, nitrogen fixation and of iron-sulfur cluster assembly has only been observed among OCU genes.
Most interestingly, both analyses made clear that the habitat has only a little effect on the set of translationally optimized genes. The habitat-specific analyses did not identify an additional translationally optimized high-level metabolic function. However, considering more specific functions, some habitat-specific findings differ. For example, the overrepresentation of aminoacyl-tRNA synthetases was only identified for MG_CUB(Bacteria_HITR). Most plausible, these disparities as well as those of enrichment/depletion factors are due to the approach-specific choice of analysed gene sets: Effectomes contain exclusively genes showing a marked CUB found in a small set of genomes whereas OCU genes are larger subsets of genomes and have been recruited from a larger set of species. This might e.g. explain why the overrepresentation of genes related to bacterial chromatin is much lower in the effectomes than among OCU genes. The ratios of enrichment factors are 1.34/6.43 for Fis, 1.27/6.21, for IHF, and 1.64/3.82 for Dps, respectively. On the other hand, the maximal enrichment factor for GO-terms among bacterial OCU genes is 8.3. In bacterial effectomes "ribosomal biogenesis" is overrepresented more than 10-fold and "cellular communication" and "transposable elements, viral and plasmid proteins" are depleted more than 10-fold. These differences suggest as future work a more detailed analysis of translationally optimized genes categorized according to the individual strength of CUB.
The analysis of effectomes contributes to a more detailed understanding of critical conditions in microbial life
Most of our knowledge about molecular biology and the physiology of microorganisms has been deduced from batch culture, chemostats, and turbidostats. However, this state of balanced growth is completely unnatural for practically all microbes . In many natural habitats nutrients and energy supplies are limited most of the time. This is why microbes exist in a continuous state of starvation and are in addition competing with other microorganisms for survival. It is difficult to simulate such situations in wet-lab experiments.
In contrast, CUB is the result of selection that shapes individual genomes on an evolutionary timescale. Thus, analysing CUB allows the identification of cellular functions requiring the optimization of translational efficiency in the natural environment. This is why the composition of the effectomes indicates critical elements of metabolic functions and identifies proteins whose translational accuracy and speed is crucial in situations occurring frequently in the typical microbial habitat.
Knowing these critical functions is an important value in itself, but this knowledge might also be relevant for the tailoring of productive strains. For example, our analysis of bacterial genomes being strongly optimized for cell growth made clear that aminoacyl-tRNA synthetases are overrepresented in the respective effectomes. If related strains are used for protein production, it is plausible to assess codon usage and the in vivo concentration of these enzymes in order to maximize the yield.
A comparison of -values and the composition of the effectomes highlight a consistent trend: Generally and independent of the strength of CUB, several central functions involved in protein synthesis, energy production, and protein folding are translationally optimized. Additionally, in certain habitats and due to the prevalent selective forces, both the strength of CUB and the palette of translationally optimized gene products increase. This hypothesis is supported by the above mentioned overrepresentation of aminoacyl-tRNA synthetases. The effectomes of MG_CUB(Bacteria_HITR) contain tRNA synthetases that load 16 different amino acids. Most plausibly, three synthetases do not occur in the effectomes because they are related to amino acids (Trp, Cys, His), which are rare in microbial proteins. The fourth and last aminoacyl-tRNA-synthetase missing in the effectomes is tRNA(Gln). In several Bacteria, Gln-tRNAGln is produced by means of a mischarged Glu-tRNAGln and a Glu-tRNAGln amidotransferase (consisting of subunits A, B, C) through the transamidation of misacylated Glu-tRNAGln. Due to the small number of genes, a statistically sound analysis is not possible in this case. However, in the genomes of Bacillus cereus, Bacillus anthracis, and Bacillus thuringiensis, which lack a glutaminyl-tRNA synthetase, the large subunits A and B of the aspartyl/glutamyl-tRNA(Asn/Gln) amidotransferase have only slightly negative GCB-values (-0.05 and -0.08, respectively). This finding is a further indicator for the fine-tuned composition of microbial effectomes.
In competitive environments nature has found many ways of improving cell growth and response times. A stunning example is the distinctive codirectionality of replication and transcription as e.g. seen in Clostridium tetani. 82% of the genes are transcribed in the same direction as DNA replication . Along these lines, our findings highlight a further facet of the complexity of microbial genomes, their composition, and regulation by confirming the importance of translational efficiency for a large number of protein functions.
Cell growth is an important but not the only factor modulating translational efficiency
Definitely, the optimization of protein synthesis is the strongest selective factor dominating the composition of effectomes. This statement is confirmed by the finding that aminoacyl-tRNA synthetases loading abundant amino acids have been optimized by evolution for translational accuracy and speed. However, the underrepresentation of protein functions involved in transcription and metabolism makes clear that only a specific subset of functions related to cell growth are subject to translational optimization. Our results show that several selective forces modulate the level of translational efficiency. This hypothesis is confirmed by the overrepresentation in the effectomes of chaperones, which assist protein folding, and of proteases, which degrade misfolded proteins. Minimizing damage due to radicals and oxygen as well as the rapid control of DNA replication and gene expression are additional and crucial tasks supported by translationally optimized gene products.
MG_CUB, a non-redundant set of microbial genomes with a marked CUB
We used the microbial genomes section of the Reference Sequence database (RefSeq, version as of Feb. 2009, 912 replicons)  to access a non-redundant collection of richly annotated chromosomes. To concentrate on species with a marked CUB that indicates translational efficiency, we selected datasets containing at least five ribosomal genes with a GCB-value ≥ 0.0 and at least one gene with a GCB-value ≥ 0.1. After eliminating entries belonging to the same taxonomical genus, the complete dataset MG_CUB contained 388 microbial genomes (see Additional file 1, Table S1). The subset of bacterial genomes MG_CUB(Bacteria) subsumes 370 entries with 1 175 058 genes. The subset MG_CUB(Archaea) contains 18 genomes and 39 092 genes. Analogously, subsets containing habitat- or taxon-specific groups HS were named MG_CUB(taxon _HS); e.g., MG_CUB(Bacteria_TH) is the subset of genomes from thermophilic Bacteria; see Additional file 1, Table S4 for details. All subsets analyzed here contain at least seven genomes. The habitat of the microbes was taken from the file ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/lproks_0.txt. Minimum generation times are from .
Determination of GCB-values
The GCB-approach follows the classical and proven concepts of scoring-functions as e.g. utilized for sequence comparison [46, 47] or the identification of horizontally transferred genes . It is based on a species-specific set of codon bias (CB) scores. A CB-score is defined as
Here is the mean frequency of codon cdniin the genome of species j and is the codon usage in a set of reference genes from genome j. As has been shown, ribosomal genes are a valid starting point to determine reference frequencies [30, 49]. In order to deduce CB-scores from genes with strongest bias, the GCB-approach initially starts with codon frequencies of ribosomal genes and utilizes a steepest gradient method to iteratively improve the CB-scores similarly to the concept of . Using CB-values, the GCB-score of an individual gene from species j is determined as
As a measure for the strength of CUB in a genome j, we utilize the mean
deduced from the m genes of genome j with GCB(genek,j) ≥ 0.0, i.e. the species-specific effectome. By limiting the calculation to the content of the effectome, we avoid the likely distortion of the mean value caused by horizontally acquired genes. Due to their origin, most alien genes possess an unrelated codon usage . Thus, a mean GCB-value inferred from the whole genome depends on the origin and the fraction of alien genes, which might render useless this indicator of translational efficiency.
Mapping genes to FunCat and scoring the abundance of categories
For each gene product, GO-terms  were used to relate the product to FunCat categories . The FunCat system is a one to many mapping of individual gene products to functional categories. For that reason, we have not reported on categories like "protein with binding function or cofactor requirement (structural or catalytic)" (16), which are in the case of effectomes dominated by ribosomal proteins. For a taxon- or habitat-specific set of genomes MG_CUB(taxon _HS), the number of gene products contributing to each FunCat category Cat was determined both for the whole dataset (#All(Cat)) and those genes belonging to the related effectomes (#Eff(Cat)). A log-odds score AbundEffwas deduced from the resulting frequencies fAll(Cat) and fEff(Cat) as
A log-odds ratio above zero indicates that more than the number of genes expected due to the distribution of categories in the whole dataset occurs in the effectome. AbundEff-values quantify over- and underrepresentation of categories symmetrically about zero according to log(a) = -log(1/a). For a two-fold over- and the respective underrepresentation follows: log(2) = 0.30 and log(1/2) = -0.30. In order to avoid outliers caused by a too small number of samples, we only analyzed subsets that contained at least seven genomes.
Determining the function of individual proteins
The UniProt interface  was used to map RefSeq identifiers of individual genes onto UniProtKB accession numbers which were fed into the eggNOG database . Thus, we deduced for a set of genes from different genomes a categorized description of protein function in terms of COG classes . Based on RefSeq identifiers, KO-numbers were determined, mapped onto KEG reference pathways , and plotted color-coded.
Sharp PM, Li WH: The codon adaptation index - a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987, 15 (3): 1281-1295. 10.1093/nar/15.3.1281.
Ikemura T: Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes. J Mol Biol. 1981, 146 (1): 1-21. 10.1016/0022-2836(81)90363-6.
Karlin S, Brocchieri L, Campbell A, Cyert M, Mrazek J: Genomic and proteomic comparisons between bacterial and archaeal genomes and related comparisons with the yeast and fly genomes. Proc Natl Acad Sci USA. 2005, 102 (20): 7309-7314. 10.1073/pnas.0502314102.
Sen A, Sur S, Bothra AK, Benson DR, Normand P, Tisa LS: The implication of life style on codon usage patterns and predicted highly expressed genes for three Frankia genomes. Antonie Van Leeuwenhoek. 2008, 93 (4): 335-346. 10.1007/s10482-007-9211-1.
Das S, Roymondal U, Sahoo S: Analyzing gene expression from relative codon usage bias in Yeast genome: a statistical significance and biological relevance. Gene. 2009, 443 (1-2): 121-131. 10.1016/j.gene.2009.04.022.
Coghlan A, Wolfe KH: Relationship of codon bias to mRNA concentration and protein length in Saccharomyces cerevisiae. Yeast. 2000, 16 (12): 1131-1145. 10.1002/1097-0061(20000915)16:12<1131::AID-YEA609>3.0.CO;2-F.
Pruitt KD, Tatusova T, Klimke W, Maglott DR: NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 2009, D32-36. 10.1093/nar/gkn721. 37 Database
Makarova KS, Sorokin AV, Novichkov PS, Wolf YI, Koonin EV: Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea. Biol Direct. 2007, 2: 33-10.1186/1745-6150-2-33.
Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Güldener U, Mannhaupt G, Münsterkötter M, et al: The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res. 2004, 32 (18): 5539-5545. 10.1093/nar/gkh894.
Muller J, Szklarczyk D, Julien P, Letunic I, Roth A, Kuhn M, Powell S, von Mering C, Doerks T, Jensen LJ: eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations. Nucleic Acids Res. 2010, D190-195. 10.1093/nar/gkp951. 38 Database
Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010, D355-360. 10.1093/nar/gkp896. 38 Database
Galagan JE, Nusbaum C, Roy A, Endrizzi MG, Macdonald P, FitzHugh W, Calvo S, Engels R, Smirnov S, Atnoor D, et al: The Genome of M. acetivorans Reveals Extensive Metabolic and Physiological Diversity. Genome Res. 2002, 12 (4): 532-542. 10.1101/gr.223902.
Brüggemann H, Bäumer S, Fricke WF, Wiezer A, Liesegang H, Decker I, Herzberg C, Martinez-Arias R, Merkl R, Henne A, et al: The genome sequence of Clostridium tetani, the causative agent of tetanus disease. Proc Natl Acad Sci USA. 2003, 100 (3): 1316-1321. 10.1073/pnas.0335853100.
We thank M. Münsterkötter for providing as with a mapping of GO ID-values to FunCat categories and R. Sterner and F. Supek for useful comments on the manuscript. The results are from CvM's master thesis project carried out at the University of Hagen without extra funding and we utilized the compute cluster of the University of Regensburg.
Authors and Affiliations
Faculty of Mathematics and Computer Science, University of Hagen, D-58084, Hagen, Germany
Conrad von Mandach
Institute of Biophysics and Physical Biochemistry, University of Regensburg, D-93040, Regensburg, Germany
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
von Mandach, C., Merkl, R. Genes optimized by evolution for accurate and fast translation encode in Archaea and Bacteria a broad and characteristic spectrum of protein functions.
BMC Genomics11, 617 (2010). https://doi.org/10.1186/1471-2164-11-617