Quantification of codon selection for comparative bacterial genomics
© Retchless and Lawrence; licensee BioMed Central Ltd. 2011
Received: 23 March 2011
Accepted: 25 July 2011
Published: 25 July 2011
Skip to main content
© Retchless and Lawrence; licensee BioMed Central Ltd. 2011
Received: 23 March 2011
Accepted: 25 July 2011
Published: 25 July 2011
Statistics measuring codon selection seek to compare genes by their sensitivity to selection for translational efficiency, but existing statistics lack a model for testing the significance of differences between genes. Here, we introduce a new statistic for measuring codon selection, the Adaptive Codon Enrichment (ACE).
This statistic represents codon usage bias in terms of a probabilistic distribution, quantifying the extent that preferred codons are over-represented in the gene of interest relative to the mean and variance that would result from stochastic sampling of codons. Expected codon frequencies are derived from the observed codon usage frequencies of a broad set of genes, such that they are likely to reflect nonselective, genome wide influences on codon usage (e.g. mutational biases). The relative adaptiveness of synonymous codons is deduced from the frequency of codon usage in a pre-selected set of genes relative to the expected frequency. The ACE can predict both transcript abundance during rapid growth and the rate of synonymous substitutions, with accuracy comparable to or greater than existing metrics. We further examine how the composition of reference gene sets affects the accuracy of the statistic, and suggest methods for selecting appropriate reference sets for any genome, including bacteriophages. Finally, we demonstrate that the ACE may naturally be extended to quantify the genome-wide influence of codon selection in a manner that is sensitive to a large fraction of codons in the genome. This reveals substantial variation among genomes, correlated with the tRNA gene number, even among groups of bacteria where previously proposed whole-genome measures show little variation.
The statistical framework of the ACE allows rigorous comparison of the level of codon selection acting on genes, both within a genome and between genomes.
It has long been recognized that protein-coding sequences show nonrandom, organism-specific patterns of codon usage . Codon usage bias is most pronounced in highly expressed genes , where codon preferences are associated with the tRNA abundance within the cytoplasm . Measurement of codon selection is of interest because the extent to which different genes use the preferred codons is predictive of their expression levels and rates of evolutionary change [4–6], and thus their relative importance (in terms of transcript abundance and degree of conservation) to the organism. Comparative studies of codon selection have provided insight into the population structure and lifestyle of organisms [7–14]. Numerous statistics have been devised to measure variation in codon selection among Open Reading Frames (ORFs) within a genome, yet none fully account for the evolutionary dynamics that shape codon usage bias, including compositional differences among genes and genomes. The simplest metrics evaluate how much codon usage frequencies of a gene deviate from expected frequencies. These methods, such as the Effective Numbers of Codons (ENC) and the ENC' [15, 16], incorporate no information about the fitness differences among synonymous codons. This limitation has been addressed by Karlin and Mrazek  and Supek and Vlahovicek [17, 18], whose algorithms simultaneously compare each gene's codon usage both to genome-wide codon frequencies (representing mutational tendencies) and to codon frequencies in a defined set of genes believed to experience strong codon selection. However, this design has been criticized for failing to assign the most extreme values to the genes with the most extreme biases in terms of preferred or non-preferred codons . This artifact results from the maximum possible value being assigned to genes with codon composition identical to pre-selected set of "optimized" genes, even though other genes may show more extreme enrichment of the optimal codons.
This irregularity is absent from statistics that are proportional to the frequency at which preferred codons occur within an ORF. At their simplest, these statistics summarize the optimal codon frequency for each amino acid (e.g. Fop  and CBI ) while more complicated methods construct a scoring table for all codons, quantifying the relative importance of non-optimal codons and weighting the statistic so that it is influenced more by those amino acids for which the synonymous codons have a greater perceived fitness difference (e.g. CAI , tAI , GCB ). One method for normalizing across amino acids is to compare the score of the observed codons against the maximum possible score for an ORF with the same amino acid composition (e.g. CAI, tAI), producing a uniform maximum score for all ORFs regardless of amino acid composition. However, this does not account for the fact that the probability of observing the optimal codon will vary according to amino acid composition, and the values assigned to non-optimal codons can vary greatly among amino acids . Despite the power of these methods for detecting codon selection, none of them quantify the stochastic variation that is expected to arise from mutation-selection balance, which is the primary explanation for the occurrence of non-optimal codons [25, 26]. The selection-mutation-drift theory of synonymous codon usage describes an equilibrium condition where preferred and non-preferred codons occur in proportions determined by mutational biases, selection, and effective population size. Recent studies have calculated the parameters of this model explicitly [8, 24, 27], but only include codons for two-fold degenerate amino acids, limiting the information available to make inferences about individual genes. To date, no analytical method accounts for the variation in the codon usage statistic that arises from the stochastic nature of the selection-mutation-drift model.
Here, we expand upon the scoring-table class of methods by introducing a new statistic that incorporates a table of expected codon frequencies, which amounts to a null hypothesis for codon usage. We present a stochastic model of codon usage, thereby allowing ORFs to be evaluated in terms of their deviation from an expected codon composition. This not only allows us to measure the impact of selection against the background of genome-wide biases, but to normalize the values assigned to non-preferred codons of different amino acids so that amino acid composition does not affect the score under the null model. We also examine different algorithms for systematically assessing codon frequencies - either in the presence or absence of selection - using only the genome sequence of the organism being examined. By deriving the expected distributions of the statistic under a null hypothesis about codon frequencies, our statistical framework provides a means to compare the strength of codon selection within and between genomes.
Below, we describe a statistic for summarizing the codon usage of an ORF. The raw statistic is the sum of values assigned to each of the codons in the sequence and may be normalized according to its expected distribution. Normalized scores for individual genes can be combined to summarize the magnitude of codon selection operating on the entire genome. We compare our measure to previously described codon usage statistics, both conceptually and empirically.
where C ij is the count of that codon within the gene being analyzed and N i is the number of synonyms for its encoded residue. Merkl proposed a similar statistic, the GCB (where his constituent CB is equal to our δ) arguing that this form of statistic is optimal for distinguishing between two populations. Here, we use the sum because it has convenient properties in a stochastic model, described below, which we will use to normalize this continuous statistic. Notably, the SCB expresses codon usage bias as a function of the difference (δ) between unselected (f n ) and selected (f o ) codon frequencies, rather than as a distance from them, thus avoiding shortcomings of other metrics .
The SCB is related to other scoring-table statistics by different normalization routines. Merkl's GCB  is the length-normalized form of the SCB. The logarithm of the CAI  can be derived from the SCB by calculating δ ij with a non-optimized table (f n ) showing no bias among synonymous codons, then calculating the difference between SCB and the maximum possible value given its amino acid composition, and finally dividing by the number of codons in the ORF, ignoring methionine and tryptophan.
Crucially, scoring tables created from δ ij reveal which codons increase in frequency among the most optimized proteins, and to what degree. This is different from the Relative Synonymous Codon Usage values that are used to calculate the CAI , which reflect simply the abundance of codons in optimized genes without reference to their abundance in non-optimized genes. Codons with greatest abundance in optimized genes may not have experienced the strongest selection for enrichment and, in the worst cases, may actually be disfavored. This adjustment to the estimate of codon adaptiveness should have the greatest effect in genomes where nucleotide composition shows the greatest deviation from equal usage.
Normalized Synonymous Codon Usage as a function of alternative codon scoring tables.
Escherichia coli MG1865
Bacillus subtilis 168
Pseudomonas putida KT2440
f n 1
f o 2
Rigorous interpretation of any codon bias statistic depends upon knowledge of its distribution given expected synonymous codon usage frequencies. Issues as simple as discerning if one ORF is more enriched for optimal codons than another cannot be resolved unless we know the values that are expected to arise from ORFs that vary in amino acid composition but not synonymous codon frequencies. Likewise, unless the variance of the summary statistic is known, variation between genes cannot be inferred to result from differences in the strength of selection between those genes rather than being due to the stochastic nature of mutation and drift.
Because amino acids differ in their sensitivity to codon selection, they each contribute different amounts of variance to the final score, so normalization takes into account the variance contributed by each amino acid rather than simply dividing by the length of the encoded protein. The equation for ACE u is equivalent to the average of the Z-value for each individual codon. Notably, the ACE is indifferent to the inclusion or exclusion of methionine and tryptophan codons because, having only single codons, they influence the observed and expected values identically and thus contribute no variability. This is in contrast to statistics that are sensitive to the frequency with which the most preferred codon occurs, such as the CAI, where methionine and tryptophan are explicitly ignored .
To validate that ACE statistics can be treated as random normal variables, we used Monte Carlo simulations to examine the properties of genes for which the SCB fit this assumption. Distributions were constructed from 2000 Monte Carlo samples for each ORF of E. coli and P. putida, using the expected codon distribution of the respective genome. The predicted mean and variance were universally accurate, while deviations from normality were only detectable within the GC-biased P. putida genome. D'Agostino's K-squared test  identified an excess of genes having non-normal SCB null distributions (P < 0.05 for 340 of 5350 ORFs; 6.3%), although the skewness and kurtosis values were universally small (-1 × 10-3 to 8 × 10-4 and -6 × 10-4 to 6 × 10-4, respectively) and the worst approximations were concentrated among genes with less than 100 degenerate codons (67 of 503 small ORFs being non-normal at P < 0.05).
Using existing gene expression data, we examined the predictive power of several codon selection statistics and their robustness in the face of uncertainty regarding optimal parameterization. Here we considered those methods that rely on information about the frequency with which each codon is used within a set of ORFs optimized for translation (f o ). A robust method will generate a consistently high level of performance when the f o table is constructed with any set of ORFs for which the codon usage has been biased by codon selection. We selected six datasets of transcript abundance data for evaluation: E. coli , Pseudomonas aeruginosa , Bacteroides thetaiotaomicron , Bacillus anthracis , Saccharomyces cerevisiae , and Schizosaccharomyces pombe . These include both eukaryotes and bacteria from three phyla, with genomic nucleotide compositions ranging from strongly AT-biased to strongly GC-biased.
For each dataset, we examined the correlation of the transcript abundance data relative to each codon optimization statistic (CAI , GCB , ACEu [this study], Karlin's E , and MELP [17, 18]) when the codon statistic was calibrated against the most abundant transcripts from the same dataset. Here, our intention is not to actually predict the transcript abundance data, but to evaluate the behavior of each method under optimal conditions. By calibrating with the dataset that the statistics are tested against, we avoid arbitrary decisions in parameterization that may inadvertently favor one method over another. To examine how each statistic responds to decreased precision in identifying the optimal genes, the number of genes contributing codons to f o was gradually increased, 20 at a time, until it included half of all genes (far more than would be used to construct f o for typical analyses). For the statistics that require an estimate of codon usage in the absence of codon selection (f n ), we used the codon composition of the entire genome.
The ability of the ACEu to predict gene expression levels in P. aeruginosa with such high accuracy (R = 0.65, 5543 genes, using the 100 most highly expressed genes to construct f o ) is surprising in light of previous studies suggesting that there is little codon selection acting in this genome . Grocock and Sharp  suggested that codon variation in P. aeruginosa was primarily due to the presence of genes with atypical nucleotide composition (presumably recently acquired), with a secondary trend due to codon selection. Recently acquired genes tend to be expressed weakly during growth in rich media, so that even in the absence of codon selection, a statistic that simply discriminated between native and foreign genes would be expected to correlate with expression levels. We tested whether this factor contributed to the high correlation by limiting the analyses to the 1678 genes that are likely to be native to P. aeruginosa because orthologs were detected in each of four other diverse Pseudomonas species: P. mendocina, P. stutzeri, P. entomophila, and P. putida (mean dS > 1.25 for each of the 10 pairs, where dS is synonymous divergence estimated by the method of ). For the 1677 genes in this set that also had transcript abundance values, the correlation coefficient actually increased to R = 0.75 using the same f o , indicating that most of this correlation is indeed due to codon selection.
Correlation of codon statistics with rates of sequence divergence.
Pearson correlation of codon statistic with dS
Reference Genome 1
Mean dS 2
B. subtilis 168
B. subtilis W23
P. aeruginosa PA01
P. aeruginosa PA7
E. coli K12
E. coli K12
B. subtilis W23
P. aeruginosa PA01
ACE statistics rely on an expectation of the codon frequencies that would be observed in the absence of codon selection. Other methods for quantifying codon selection share this requirement, and these frequencies are often estimated from the codon composition of the entire genome under the premise that the majority of genes experience little codon selection. Yet genome-wide codon usage tables will be influenced both by genes experiencing strong codon selection and by genes recently introduced by lateral transfer whose codon usage patterns do not reflect the mutational history of their current genome. Eliminating both of these gene sets from this reference table should produce better predictions of gene expression data from codon frequency data. To exclude these classes of genes, we removed compositionally atypical genes, identified as those with dinucleotide or codon usage patterns that were maximally different from genome-wide averages [40, 41]; as expected, this process excluded genes with extreme CAI values or atypical GC compositions at third codon positions (Additional file 2, Figure S2). Using systematically smaller subsets of E. coli genes to estimate f n , we saw improvement in the correlation between mRNA expression levels and ACE u , MELP, GCB and E (Additional file 3, Figure S3). The optimal reference table was reached when the most atypical ~30% of genes were excluded; additional reduction in the size of the set did not result in significant improvement.
For genomes lacking expression data for calibration, we developed an algorithm to identify a reasonable set of typical genes. Based on the assumption that removal of the most extreme 1% of genes produces more accurate δvalues, the algorithm continues to decrease the gene set by 1% increments as long as a significant majority of codons' δ values shift in the same direction as initially observed (P < 0.05; binomial test with expectation of 0.5). For E. coli, this resulted in a reference table constructed from 77% of the genes (Additional file 4, Figure S4), which is among the largest sets of compositionally typical, native genes that produced stronger correlations to expression data (Additional file 3, Figure S3). Therefore, this method provides a robust approach to selecting a less-biased and less noisy set of genes to approximate codon usage patterns produced by genome-wide processes alone.
The ACE, like other methods, compares each gene's codon usage to the codon usage of a reference set of genes believed to have experienced strong codon selection (f o above). This set can be assembled by choosing genes that are known empirically to be highly expressed during rapid growth. However, these data are both biased to the laboratory conditions under which the organism is cultured and unavailable for many organisms. To eliminate these constraints, we used a two-step method to create this reference set from genomic data alone. To create an initial f o set, we selected a set of genes that could reasonably be inferred to have experienced codon selection; we examined such criteria as strong tAI , high χ2 of codon usage , high values of the P2 metric , low values of ENC  or ENC' , atypical codon composition , homology to genes encoding the translation apparatus (i.e. Translation40), or strong conservation of amino-acid sequence in one or more genomes. Second, we iteratively selected an optimized gene set for each genome. Genes with the highest ACE u values were selected to create the f o table for the next round of the iteration. The processes began by selecting the most biased 40% of genes and reduced this set over 15 iterations until the final table size (10000 codons) was reached and the f o tables stabilized; this approach is similar to those used elsewhere, but based on different statistics [23, 44].
Throughout this iteration process, codon scores are adjusted for the genome-wide tendencies, so the iteration algorithm identifies those genes that most exemplify the broad trend revealed when the initial parameterization set was compared to the whole genome set. Consequently, selection of the initial set is of utmost importance. Initial data sets, each generated using a different criterion, led to identical or nearly-identical reference sets after iteration for the 30 genomes that we examined (Additional file 5, Table S1). The most robust results came from initial reference sets comprised of genes encoding ribosomal proteins (as found elsewhere ), or genes which were most strongly conserved in the largest number of target genomes as determined by BLAST analysis (Additional file 5, Table S1); in both cases, biologically plausible reference sets - as determined by the genes' likely functions in the cell - were reached in >95% of genomes tested. In addition, the other methods, especially the tAI and P2 metrics, also converged on this same set of genes in most cases (Additional file 5, Table S1). The common iteration endpoint reached by multiple initial gene sets lends confidence that the final, iterated f o table is accurately reporting codon selection. As expected, the iterated f o table was similar to the Translation40 set of translation genes in most bacterial genomes.
Our ability to reach this endpoint without specifying particular genes sets (e.g., ribosomal proteins) allows the method to be extended to genomes of bacteriophages and other entities wherein highly-expressed genes are more difficult to identify a priori. For example, a similar analysis of the bacteriophage λ genome identified genes encoding structural proteins as those under strongest codon selection, and dispensable genes of the Nin region as those under the least selection (Additional file 6, Table S2). We examined the optimal size for the f o table by comparing the ACE u obtained with f o tables containing different numbers of codons against the mRNA transcript levels of those genes in E. coli and P. aeruginosa [29, 30, 45]. The optimal table size (i.e. the table generating the highest correlation between ACE u and transcript abundance) was found to be between 5,000 and 10,000 codons (Additional file 7, Figure S5).
The intensity of codon selection varies between genomes and several approaches have been implemented to measure these differences [8, 10, 13, 22, 46, 47]. These studies have found that codon selection -- along with the number of tRNA and rRNA genes -- increases in bacteria with faster growth rates, suggesting that codon adaptation is one of several genomic structures that minimize generation time under optimal growth conditions [9, 24].
In the absence of codon selection, values should approach 1.0, where the codon usage of each gene is a random sample . The Monte Carlo simulations described above confirmed that when all ORFs share the same codon composition, the ACEz distribution for the genome has a mean of zero and a variance of one, resulting in a normalized χ2 of 1.0.
To validate the behavior of the ACEχ2 on real genomes, we examined two genomes (P. aeruginosa and E. coli) that are known to exhibit substantial codon selection, and one (Buchnera aphidicola) that is believed to experience negligible codon selection . P. aeruginosa is of special interest because Grocock and Sharp  demonstrated that highly expressed genes exhibit distinctive codon usage in this genome, but Sharp et al.'s  attempt to estimate the strength of codon selection on 40 translational proteins revealed no selection (S = -0.019). This was attributed to the fact that S is based on the codons for only four amino acids, which did not include codons that were enriched in the highly expressed genes of P. aeruginosa . Because ACE incorporates information from all synonymous codons, this limitation should be avoided.
ACEχ2 of three genomes calculated for different sets of genes.
Genus Core Genes
Enteric Core Genes
Correlation coefficients of genome-wide codon selection measures with the number of tRNA genes in each genome.
Pearson Correlation Coefficient of tRNA Gene Number and Genome-wide Codon Selection Statistica
Number of Genomes
Number of Orthologues
ACEχ 2 core b
ACEχ 2 40 b
0.628 (p = 0.008)c
To evaluate the Enterobacteriaceae without the influence of endosymbionts, we recalculated ACEχ2 without the four endosymbionts mentioned above, but added three more genomes from free-living bacteria (Additional file 8, Table S3). Using the 1060 genes shared among these 14 genomes, we still detected a substantial correlation between tRNA gene number and ACEχ2 (R2 = 0.34; p = 0.01, one tailed test using Fisher's z-transformed correlation; Figure 3A). Noticeably, the relationship of ACEχ2 among genomes is robust to the set of genes used to assign the f n frequencies; for the 11 genomes that were included in both analyses, the correlation of their values was 0.96, despite being calculated with 201 vs. 1060 genes. A significant correlation between ACEχ2 and tRNA count is also seen for the Bacilliaceae (R2 = 0.31; 12 genomes; p = 0.03), while the correlation in the Mycobacteriaceae (R2 = 0.16; 9 genomes; p = 0.15) does not reach the standard significance cutoff of (p = 0.05) due to both a smaller correlation and smaller sample size (Table 4, Additional file 10 Table S5).
In contrast to the ACEχ2, other measures of codon selection detected little variation among genomes of non-endosymbiotic bacteria within the same family and none indicated a significant correlation with tRNA gene count (Table 4). We examined Dethlefsens and Schmidt's ΔN'c , Rocha's ENCdiff , and Sharp's S , but excluded von Mandachs and Merkl's GCBeff because it cannot be meaningfully calculated for many genomes. The Enterobacteriaceae contain the greatest number of sequenced genomes with a substantial amount of synonymous divergence among them (mean dS > 1.0 for all pairwise comparisons), so we focused on them for further investigation of the differences among the statistics (Figure 3B). A fundamental difference between the ACEχ2 and other statistics is that ACEχ2 examines the codon usage variation for a large portion of the genome (the core of shared genes in these examples), whereas the previously described methods examine how the codon usage frequencies of a specified subset of genes differs from the genome-wide average [8, 10, 46].
To test if this focus on the pre-selected set of optimized genes accounted for the differences between the statistics, we calculated ACEχ2 for just the 40 genes that were used for the f o table, while continuing to use the 1060 shared genes for f n . This statistic (ACEχ2 40) had only a moderate correlation to the ACEχ2 for the shared genes (Additional file 10, Table S5), indicating that focusing on this smaller set of genes can substantially distort estimates of genome-wide codon selection. However, this is not the only explanation for the difference between the ACEχ2 and other statistics, as the ACEχ2 40 has an even weaker correlation to ENCdiff and S, while being strongly correlated to ΔN'c (Additional file 10, Table S5).
The ACE measures the effect of codon selection on codon usage, which is a slightly different concept than the magnitude of selection (s) described in population genetic theory. We have taken care to remove the influence of amino-acid composition from the ACE to provide a better prediction of physiological parameters such as gene expression levels. In contrast, an estimate of s should be sensitive to the amino acid composition, and a direct estimate of codon selection will likely provide better estimates of population diversity parameters such as the patterns of polymorphism . Moreover, the ACE is a linear function of codon frequency; for an amino acid encoded by two codons, the contribution to ACE is directly proportional to the frequency of the preferred codon (P). In contrast, selection is a non-linear function of P (i.e. Nes = log[(kP)/(1-P)] where k represents the mutational balance), according to Sharp et al. .
The ACE uses an estimate of the codon composition specified as arising from genome-wide processes alone (e.g. mutation). We constructed a single table to reflect these codon frequencies, implicitly assuming that a uniform process is acting upon all genes in the genome. This assumption of mutational uniformity is less valid in some eukaryotic genomes that harbor isochores, but it is reasonable for bacteria once recently introduced genes are excluded. Slight violations of this assumption arise from subtle strand variation and origin-to-terminus gradients . However, this variability does not generally affect calculation of ACE values; for example, while replication of Firmicutes involves distinct forms DNA polymerase III on each strand, leading to strong strand bias , the correlation between ACE values and dS does not improve when separate f n tables are created from leading and lagging strand genes (data not shown).
One final concern is that codons are not necessarily independent of their neighbors, or that synonymous sites may be constrained by functional demands aside from codon optimization for efficient translation. Among these constraints are determinants of chromosome architecture , mRNA structure , avoidance of ribosome-binding sites  or homopolymeric tracts , or even selection for the more slowly translated codon due to the kinetics of and protein synthesis .
We modeled the stochastic distribution of the ACE as though each gene had a constant amino acid composition and each amino acid could be encoded by any of its cognate codons with a probability given by genome-wide substitution parameters. Of course, amino acids will vary stochastically in a constant regime of mutation and selection, and modeling such variation may increase the expected variance of the ACE, though the normalization across amino acids should minimize any variance introduced by amino acid substitutions. Regardless of that correction, amino acid composition can only crudely be modeled as a simple random variable because selective pressures acting on amino acid substitutions clearly are not uniform across the length of the protein.
Selection acting on synonymous substitutions varies among sites within ORFs [38, 58–62]. The ACE is robust to this complication layered on top of the mutation-selection-drift model, and can be interpreted as being proportional to the number of sites under strong selection for use of the globally preferred codon. Such variation in the strength of selection among sites would reduce the stochastic variance in the ACE and other codon bias statistics.
The construction of two different codon frequency tables (f o and f n ) allows us to separate the genome-wide influences on codon usage from codon selection, which has the greatest effect on highly expressed genes, causing f o to deviate from f n . The use of f n to normalize the iteration process avoids identifying a set of genes that represent the "dominating codon bias" , instead identifying a set representing codon selection . The accuracy of f n has an important role in any analysis of codon selection. We demonstrated a method for identifying a set of genes that best represents the patterns of codon usage that would exist in the absence of codon selection. The codon usage in such genes is generally assumed to reflect mutational biases, but they may in fact be influenced by genome-wide selection for nucleotide composition or biased gene conversion [63–65]. Such complications would not affect the suitability of these genes to represent codon usage in the presence of minimal codon selection.
The statistical framework of the ACE facilitates comparisons among genes within a genome. First, by accounting for the codon frequencies that are expected to be observed in the absence of codon selection, the ACE avoids spurious differences that can result from variation in amino acid composition among genes. A more fundamental difference between ACE and other statistics in this context is that the stochastic variation (i.e. sampling error) in ACE is approximately normally distributed, and its variance can be calculated despite each amino acid contributing a different amount of variation. For a given gene, the error variance of the ACE can be estimated as the variance that would occur in the SCB (Equation 7) when the expected codon frequencies are equal to the observed codon frequencies in that gene. The error variance of the ACE u is then the error variance of the ACE divided by the square of the denominator in Equation 10.
Having an estimate of the error variance, and knowing that the variance is approximately normally distributed, we are able to compare genes using the t-test for two independent samples . This provides a statistical test for whether genes experience different degrees of codon selection. For example, the ACE u values for the independently transcribed methionine biosynthetic genes in E. coli range from -0.094 to -0.018 for the metABC genes to 0.10-0.14 for the metEH genes. While the ACE u of the metABC genes are not significantly different from each other (P > 0.2), all three are significantly different from those of the metE or metH genes (P < 0.01). This difference is not surprising as the metABC genes are only expressed during methionine starvation whereas the metEH genes also function to recycle S-adenosyl-homocysteine, a function that is required even during periods of methionine excess. Therefore, the significant difference in ACE u values supports the hypothesis that the metEH genes would be expressed under a larger number of growth conditions and, as a result, experience greater codon selection.
The ACEχ2 is fundamentally different from previous attempts to quantify variation in the strength of codon selection between genomes. Three recently proposed methods have focused on a small fraction of the ORFs in each genome (e.g. ribosomal proteins) and used the deviation of their codon usage from the genome-wide average as an estimate of the efficacy of selection in each genome [8, 10, 46]. They interpret the strength of selection on a particular subset of ORFs as being representative of, or proportional to, the strength of selection acting on all ORFs in the genome. In contrast, ACEχ2 can be calculated from all genes believed to be long-term residents of the genome. This greater inclusiveness may account for the fact that the ACEχ2 generally correlates more strongly with the tRNA gene number than the other measures. The biological basis for the correlation between tRNA gene number and measures of codon selection remains unclear [66–68], and the ability of the ACEχ2 to quantify codon selection across large sets of genes may facilitate investigations of this relationship.
A strength of the ACE framework is its null model, which allows rigorous statistical tests to be applied to ACE z , ACE u and ACEχ2. This framework can be extended to other metrics. For example, the CAI has inspired other measures of codon usage bias, such as the tAI  and the eAI . These statistics rely on scoring table values (i.e., δij) that are derived from theories of how selection acts on the translational process, rather than being inferred from observed gene sequences. Despite this difference, these statistics are still amenable to the statistical tools developed for ACE, which may provide greater precision to the estimates of codon selection when investigating the molecular nature of codon selection (e.g. ).
We have presented a statistical framework for the interpretation of codon usage biases in microbial genomes, both within and between genomes. The proposed summary statistic for quantifying variation within a genome incorporates the strengths that were previously only found in separate statistics, furthermore this work incorporates an analytical description of the sampling variance for the statistic. The methods presented here can also be applied to genomes for which we do not have prior information about gene expression and codon selection.
Pre-selected sets of highly expressed were taken from previous literature. The set of 40 ribosomal proteins and translation elongation factors (Translation40, ) included the genes tufA, tsf, fusA, rplA-rplF, rplI-rplT and rpsB-rpsT. The codon count for each gene ignores the start codon. The values of f o are the count of each codon divided by the total count of codons for the same amino acid. If any codon is absent in the set of highly expressed gene, it is assigned a count of 0.5.
All genome sequences were downloaded from the NCBI; genes were extracted from the primary (i.e. largest) replicon using the annotations provided by the RefSeq project. For genomes mentioned in the text, accession numbers appear in Additional file 9, Table S4.
Counts of tRNA genes can vary substantially among closely related genomes, so an average value was estimated for each species. Counts from each genome were made from the list of structural RNAs between 60 and 100 bp long, excluding the Sec tRNA. A species average was calculated using weights proportional to branch lengths on a tree constructed with MrBayes using 16S rRNA genes. The resulting values are close to the unweighted average of all genomes from that species in the NCBI database.
Annotated open reading frames were translated and used as BLASTP queries to search databases composed of ORFs from each of the other genomes (e < 1) followed by semiglobal alignment. Sets of putative orthologs were assembled from those ORFs where each was a reciprocal best match with the others. For each analysis, a minimum amino acid similarity was enforced, with decreasing stringency for groups bearing more-divergent taxa: Escherichia 85%; Pseudomonas, Lactococcus, Mycobacterium, Staphylococcus 70%; Enterobacteriaceae, Bacilliaceae 60%.
Slight modifications were made to the described codon statistics to make them comparable with each other. The GCB, like the other statistics, was calculated without consideration of the stop codon. For the calculation of Pearson's correlation coefficient, a logarithmic transformation was applied to the transcript abundance, E , MELP [17, 18], RF P-value  and CAI . This generally increased the correlation between the transcript abundance and the codon bias statistics. Furthermore, it made the statistics conceptually comparable because the GCB  and ACEu are intrinsically calculated with logarithms. Spearman (rank) correlation coefficients were typically weaker and were not used.
All analyses were performed with DNA Master, available at http://cobamide2.bio.pitt.edu, except where other software packages are explicitly mentioned in the text.
This research was supported by NIH grant GM078092 to JGL and a Mellon Fellowship to ACR.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.