Quantitative analysis of mutation and selection pressures on base composition skews in bacterial chromosomes
© Chen and Chen. 2007
Received: 24 April 2007
Accepted: 21 August 2007
Published: 21 August 2007
Skip to main content
© Chen and Chen. 2007
Received: 24 April 2007
Accepted: 21 August 2007
Published: 21 August 2007
Most bacterial chromosomes exhibit asymmetry of base composition with respect to leading vs. lagging strands (GC and AT skews). These skews reflect mainly those in protein coding sequences, which are driven by asymmetric mutation pressures during replication and transcription (notably asymmetric cytosine deamination) plus subsequent selection for preferred structures, signals, amino acid or codons. The transcription-associated effects but not the replication-associated effects contribute to the overall skews through the uneven distribution of the coding sequences on the leading and lagging strands.
Analysis of 185 representative bacterial chromosomes showed diverse and characteristic patterns of skews among different clades. The base composition skews in the coding sequences were used to derive quantitatively the effect of replication-driven mutation plus subsequent selection ('replication-associated pressure', RAP), and the effect of transcription-driven mutation plus subsequent selection at translation level ('transcription-associate pressure', TAP). While different clades exhibit distinct patterns of RAP and TAP, RAP is absent or nearly absent in some bacteria, but TAP is present in all. The selection pressure at the translation level is evident in all bacteria based on the analysis of the skews at the three codon positions. Contribution of asymmetric cytosine deamination was found to be weak to TAP in most phyla, and strong to RAP in all the Proteobacteria but weak in most of the Firmicutes. This possibly reflects the differences in their chromosomal replication machineries. A strong negative correlation between TAP and G+C content and between TAP and chromosomal size were also revealed.
The study reveals the diverse mutation and selection forces associated with replication and transcription in various groups of bacteria that shape the distinct patterns of base composition skews in the chromosomes during evolution. Some closely relative species with distinct base composition parameters are uncovered in this study, which also provides opportunities for comparative bioinformatic and genetic investigations to uncover the underlying principles for mutation and selection.
A genome contains coding information that specifies protein and RNA sequences and structural information that specifies local DNA conformation involved in interactions with proteins. On top of these is the subtle global tendency of a genome to move toward a preferred nucleotide composition and distribution that are characteristic for each clade. Most notable is the G+C content, which vary widely (between 25% to 72%) among the prokaryotes. The preferred G+C content is conserved among closely related species, as are the relative abundance of dinucleotides, trinucleotides, and tetranucleotides [for review, [1–4]].
In addition, in most bacterial chromosomes, mononucleotides exhibit a biased distribution between the two replicating (leading vs. lagging) strands. GC skew, as expressed by (G-C)/(G+C), and AT skew, expressed by (A-T)/(A+T), of bacterial chromosomes were first noticed by Lobry  in Escherichia coli, Bacillus subtilis, and Haemophillus influenzae, and later by Mrazek and Karlin  in Mycoplasma genitalium, Mycoplasma pneumoniae, and Helicobacter pylori. It was noticed that GC skews (and, to lesser extent, AT skews) exhibit a striking sign switch at the replication origin (oriC) and another one at the termination region in many bacterial chromosomes. From an analysis of a limited number (9 to 36) of bacterial chromosomes, it has been proposed that there is an overall excess of purines ('purine excess') or keto bases G and T ('keto excess') in the protein coding sequences (CDS) [7–9]. These base composition skews have been recently reviewed [10–12].
The composition skew may be extended to include a number of oligomer sequences, which are known to be or are likely to be implicated in replication, recombination, and/or repair process of genomes [13, 14]. A classical example is the octameric Chi sequence (CGTGGTGG) in E. coli, which serves as a signal for recombinational repair of double strand breaks, and is important to the rescuing of broken replication forks [ for a concise review]. Another example is the Rag motif (RGNAGGGS) in the E. coli chromosome, the skew of which shift abruptly at the terminus of replication [13, 16]. Chi and Rag motifs together account for about 7% of the global GC skew of the E. coli chromosome .
Base composition skews are shaped by asymmetric accumulation of specific mutations, which are determined at two levels, namely strand-biased mutation and subsequent selection [reviewed in [11, 17]]. These strand-biased mutation forces may be further classified into two basic categories: replication-driven mutation and transcription-driven mutation. Several mechanisms of replication-driven mutation have been proposed based on the asymmetrical structures of the replication forks [reviewed in ], including higher abundance of single-stranded gaps and nicks on the lagging strands that are prone to mismatch repair and cytosine deamination (leading to C-T transition) [10, 18] and asymmetrical enzyme machineries that replicate the leading and lagging strands. Transcription-driven mutation has been proposed to include mutations associated with exposed non-transcribed strands during transcription and transcription-coupled repair . The non-coding sequence (non-CDS) is under replication-related mutation pressure, and free from selection at the translation level. However, the transcribed non-CDS (upstream or downstream from the CDS) is still under transcription-driven mutation. The CDS, on the other hand, is affected by replication-driven mutation and transcription-driven mutation plus selection at the translation level.
These mutations undergo various kinds of selection, including the shaping of the signal sequences on the chromosomes  (see above). A universal and powerful selection is at the translation level, in which adverse mutations are eliminated or selected against. In addition, codon usage and amino acid usage preferences in combination also select optimal mutations at this level. The facts that codons usage in bacteria shows a preference for G over C (a translational selection) and that more genes (up to about 80% in some Gram-positive bacteria) are located on the leading strands than on the lagging strands of most bacterial chromosomes  automatically lead to G excess in the leading strands [21, 22]. Moreover, selection pressure at the translation level may also produce biases in the usage of nucleotides, codons, and amino acids [23–27]. It has been noted that orthologs on the leading strands show lower rates of divergence than those on the lagging strands among various bacteria; this is a reflection of lower mutation pressure on the leading strand . In many cases, these strand biases are considerable, and may be used to predict the replicating strand location of particular CDS with surprising accuracy .
The effect of mutations and subsequent selections on base composition skews cannot be readily separated in analysis. In general, the combined effect of replication-driven mutation plus subsequent selection is treated collectively as 'replication-associated pressure' (RAP), and the combined effect of transcription-driven mutation plus subsequent selection at the translation level as 'transcription-associate pressure' (TAP). While RAP is directly reflected in the overall skew, the effect of TAP depends on relative distribution of the CDS on the two replicating strands. If CDS are equally distributed between the leading and lagging strands, the effect of TAP on base composition skews is nil, and if CDS are present exclusively on one replicating strand, the TAP effect is total.
The TAP effect exerted on CDS on either replicating strand is equal, whereas the RAP effect has an opposite directionality on CDS on two replicating strands. Thus, the base composition skews of CDS on the two replicating strands may be used to extract the effects of RAP and TAP. This general principle has been applied by Lobry and Sueoka  to detect and assess RAP and TAP in 43 bacterial chromosomes using a graphic approach. These graphically deduced RAP and TAP were for GC and AT skews combined together. It was concluded that these two forces were most evident in the weakly selected third codon position and in intergenic regions. The authors noted that the directions of the two effects are almost universal (with some exceptions), resulting in G and T excess in the leading strands, which was compatible with the hypothesis of excess of cytosine deamination in the single-stranded state during DNA replication . In fact the authors modeled their analysis based on C-T transitions and attributed any non-conformity to the effect of TAP.
In this study, using the same general principle but with a more comprehensive mathematical approach, we evaluated the RAP and TAP for GC skews and AT skews in 185 bacterial chromosomes from 11 phyla. The results show diverse and distinct RAP and TAP patterns among different families of the bacterial chromosomes, and each GC and AT skew-shaping force may be very different. While all the chromosomes are under significant TAP, a portion of them is under no or little RAP. Some bacteria (e.g., Firmicutes and proteobacteria) exhibit high RAP and high TAP, some (e.g., Chlamydiae) exhibit only significant RAP and little or no TAP, and a few (e.g., Cyanobacteria) exhibit none of either. Analysis of the RAP and TAP shows that the cytosine deamination may be important for RAP in some bacteria such as proteobacteria, but not in TAP. Instead, there appears to be significant involvement of transversion in the generation of base composition skews.
Our study shows that chromosomes that exhibit high base composition skews generally possess high TAP and RAP. Moreover, the trends and magnitudes of the skews can be correlated to the size and G+C contents of the chromosomes. This is in line with the notion that the base composition skews and their underlying mechanisms are important to the shaping of the bacterial chromosomes during evolution.
The overall base composition skews with respect to leading strands vs. lagging strands over the whole bacterial chromosome are designated χG (for GC skew) and χA (for AT skew). χG is defined as the total number of G minus the total number of C divided by the total number of G and C on the leading strands, and χA is defined as the total number of A minus the total number of T divided by the total number of A and T on the leading strands.
In order to assign the leading and lagging strands, the replication origin (oriC) and termination (ter) must be defined. oriC of only a few bacterial chromosomes has been experimentally determined. For the remaining majority, prediction of oriC has been based on several different parameters. For examples, Worning et al.  predicted the location of oriC using biased distribution of all oligonucleotides up to 8 bp, and Mackiewicz et al.  use three criteria – composition skew, location of dnaA gene, and distribution of DnaA box-like sequences – for oriC prediction. Here we have followed the basic method of Mackiewicz et al.  to predict oriC. Ninety-nine bacterial chromosomes with a predicted oriC were taken from Mackiewicz et al. . From the available complete bacterial sequences, oriC was predicted for another 86 chromosomes. For circular chromosomes, the ter site was assigned to be directly opposite to oriC. For linear chromosomes, the ends are where replication terminates. In total, a total of 185 chromosomes representing 11 phyla [see Additional file 1] were included in this study. Of these bacteria, the largest Phyla are Firmicutes and Proteobacteria, and in many of the analyses, they were further subdivided into Classes. The sizes of the chromosomes ranged from 0.6 to 9.1 Mb (mean 3.4 Mb), and their G+C contents from 24 to 72% (mean 49 %).
χG and χA were calculated from the sequence of the 185 bacterial chromosomes. χG is statistically significant (p < 10-3, χ2 test) for all except 12 bacteria, and χA are significant statistically (p < 10-2, χ2 test) for all except 17 bacteria [see Additional file 1]. These exceptions include four of the five Cyanobacteria tested.
In the plot, related bacterial chromosomes tend to cluster together. For example, the Firmicute chromosomes (green symbols) are essentially all distributed in Quadrant I. Within the Firmicutes, members of the same Class also cluster together. Most other bacterial chromosomes are distributed in Quadrant II. Within Quadrant II, clustering is also seen for proteobacteria (red symbols) and its Classes. This is in accordance with the notion that the base distribution skews are evolutionally conserved.
The χG vs. χA plot also shows a general trend for the absolute values of these two values to increase in proportion (r = 0.72). From the denser central area, the chromosomes diverge in two general directions, one toward simultaneously increasing χG and χA, and the other toward increasing χG but decreasing χA. The former corresponds to 'purine excess' in the leading strand as noted by Freeman et al.  for nine bacterial chromosomes. Most of the chromosomes in this trend lie in Quadrant I, and belong to Firmicutes and also F. nucleatum. Of these, the clostridia chromosomes (green inverted triangles) have the highest χA and χG values. The other trend, in which χG varies in inverse proportion with χA, corresponds to 'keto excess' trend also noted by Freeman et al. . Most of these chromosomes lie mainly in Quadrant II, but a few are in Quadrant IV. That related bacteria have the similar strengths of keto and purine excesses has also been noted by Song et al.  for 36 species examined.
For subsequent analysis, we separate the genome sequences into CDS (protein-coding sequences) and non-CDS (the remaining sequences). CDS constitutes the major portion of the bacterial chromosomes. In the 185 chromosomes investigated, the fractions of CDS range from 50.9% (Sodalis glossinidius) to 95.5% (Candidatus Pelagibacter ubique) with a mean of 86.2%.
CDS is susceptible to both RAP and TAP. Non-CDS is more complicated in that it contains both non-transcribed and transcribed regions (stable RNA genes and transcribed regions upstream and downstream of genes). The non-transcribed part is susceptible to RAP only, and the transcribed part is susceptible to both RAP and TAP (but without translation pressure). Unless transcription maps in non-CDS is available, it is impossible to investigate the components that shape the skews in non-CDS. In contrast, CDSs provide a simpler model for extraction of information regarding the operations of RAP and TAP in this study.
In contrast, χG nc and χA nc deviate noticeably more widely from χG and χA, respectively, for most bacterial chromosomes (Figure 2, open circles). Most (87%) of the χG nc values are higher than the corresponding χG values with a mean difference of 2 × 10-2. In contrast, χA nc is higher than the corresponding χA in only about 37% of the bacteria regardless of their phylogenetic groups. The deviations of skews in the non-CDS and the CDS presumably reflect the difference in the mutation pressures and selection pressures exerted on these sequences, which are expected to be lower in non-CDS.
The base composition skews in the CDS may be used to estimate RAP and TAP under the assumption that the effects of the two forces are independent of each other. This assumption is reasonable, because, considering the relatively low magnitude of the base composition skews, it is very unlikely that any nucleotide position is simultaneously affected by RAP and TAP.
The GC skew in the CDS on the leading strand (designated σG d ) and on the lagging strand (designated σG g ) may be represented, respectively, as:
σG d = σG T + σG R
σG g = σG T - σG R
where σG T and σG R are GC skews shaped by TAP and RAP in CDS, respectively.
From these, σG T and σG R may be derived as:
σG T = (σG d + σG g )/2 (1)
σG R = (σG d - σG g )/2 (2)
Similarly, the AT skews generated by TAP (i.e., σA T ) and RAP (i.e., σA R ) may be derived from AT skews in the CDS on the leading (i.e., σA d ) and lagging strand (i.e., σA g ) as:
σA T = (σA d + σA g )/2 (3)
σA R = (σA d - σA g )/2 (4)
It is noteworthy that σG d - σG g and σA d - σA g correspond to 'ΔGC skew' and 'ΔAT skew', respectively, described by Rocha and Danchin , which are defined as the difference between the average skews of the genes in the leading strand and those in the lagging strand.
Different phyla exhibit distinct patterns of skews in base compositions and CDS (χCDS; see below), and within the same phylum different species tend to exhibit similar patterns. For example, the Firmicute chromosomes (Groups 9–12) have the highest χCDS, χG cd , and χA cd . In contrast, the Cyanobacterial chromosomes (Group 7) have essentially no χCDS, χG cd , or χA cd . Most phyla also display distinct patterns associated with the calculated effects of RAP and TAP on the base composition skews. Most strikingly, the Spirochaete chromosomes (Group 19) have large and approximately equal σG T and σG R values, together with large σA T and σA R values of opposite signs. In contrast, all these values are nearly zero in Actinobacterial chromosomes (Group 2).
Average base composition skews in CDS and RAP and TAP effects in seven phyla of bacteria*
The σA T averages are positive in all the phyla (0.001~0.067) except in Cyanobacteria (-0.011). In contrast, the σA R averages are negative (-0.012~-0.075) in five of the seven major phyla and near zero in the other two (Cyanobacteria and Firmicutes). Therefore, TAP is generally biased toward A excess in all the seven major phyla, while RAP is biased toward T excess in five major phyla and very low or non-existing in the other two.
where CDS d and CDS g are the total lengths of CDS on the leading and lagging strands, respectively, or
χG cd = σG R + χCDS·σG T , (5)
Equations (5) and (6) are based on the assumptions that (i) the TAP effect (σG T and (σA T ) is equal on the leading and lagging strands, and (ii) the RAP and TAP effects are independent. Such assumptions are supported by the results of Tillier and Collins  in their analysis of 12 bacterial species. Moreover, σG cd and χA cd values calculated from (5) and (6) differed from the actual values only slightly. The mean of errors is 3 × 10-4 ± 6 × 10-4 (S. D.) for χG cd , and 2 × 10-4 ± 2 × 10-4 (S. D.) for χA cd .
Equations (5) and (6) bring in the third main factor in determining the base composition skews in the CDS of bacterial chromosomes, χCDS. The χCDS values vary between -0.15 to 0.74 with a mean of 0.23 among the 185 bacterial chromosomes. When χCDS approaches zero (equal distribution of CDS on the replicating strands), the skews are contributed to by RAP only.
The Firmicutes have the highest χCDS (0.11~0.74), which, in combination with moderately high σG R and σA R values, produce the highest χG cd and χA cd among all the bacterial groups (Figure 3). The high χCDS values in the Firmicutes have been noted to be associated with the presence of a polC gene in the genome : Chromosome containing both polC and dnaE have an average χCDS of 0.78, whereas for chromosomes containing only dnaE have an average χCDS of 0.58. The reason for this correlation is not clear.
Contribution of RAP and TAP to the base composition skews in CDS (derived from Figure 4)
Similarly, χA cd is also strongly correlated to σA R (r = 0.85). The linear regression line for χA cd vs. σA R passes near the origin, and has also a slope of 0.60. χA cd is weakly correlated with σA T (r = 0.49), but is strongly correlated with χCDS·σA T (r = 0.72) as expected. The χA cd vs. χCDS·σA T linear regression line also intersect near the origin with a slope of 0.40.
These results indicate that the relative contribution to either χA cd or χG cd by RAP and TAP (after χCDS attenuation) is approximately 60% and 40%, respectively, across the bacterial spectrum. These average values, however, are not good reflections of RAP and TAP in individual species, which may deviate from these average ratios significantly.
Regression analysis of the effect of RAP and TAP on the base composition skews in the non-CDS (Figure 4) shows that χG nc and χA nc are not only strongly correlated with σG R (r = 0.88) and σA R (r = 0.86), respectively, but also moderately correlated to σG T (r = 0.51) and σA T (r = 0.34), respectively, and moderately correlated to χCDS·σG T (r = 0.72) and χCDS·σA T (r = 0.49), respectively. The latter two sets of correlation presumably are due to the presence of transcribed sequences in the set of non-CDS, which would be under transcription-driven pressure, but not translation-driven pressure. The transcribed non-CDS cannot be readily separated from the non-transcribe non-CDS, and therefore the RAP and TAP on the two groups of sequences cannot be separately evaluated from Figure 4.
The RAP and TAP analysis may be used to examine the cytosine deamination model [27, 34, 35], which proposes that base composition skew is generated by preferred deamination of cytosine in single-stranded DNA such as the lagging strands during replication and the sense strands during transcription [18, 36]. If strand-biased cytosine deamination plays a major role in shaping base composition skews during replication and transcription, the result of C to T transition would be reflected as a negative correlation between σG R and σA R and between σG T and σA T , respectively.
A weak positive correlation was seen between weight-adjusted σG T and σA T in the 185 chromosomes (m = 0.42, r = 0.54, p < 0.01; right panel). Examination of individual phyla shows only significant positive correlations (m = 0.97~2.01, r = 0.75~0.94, p ranging from < 0.01 to 0.05) in α-Proteobacteria and ε-Proteobacteria, and Clostridia in Firmicutes. No significant correlation exists in other phyla, which, in some cases, is due to the small sample size. The lack of negative correlation here indicates that the cytosine deamination model does not play a major role in shaping TAP in most if not all bacteria.
The three positions of codons are under different selection pressures at the translation level. The third position is 4-way or 2-way degenerate for most amino acids, and enjoys the largest freedom for synonymous substitutions. The G+C content of a bacterial chromosome is principally shaped by the G+C content at this position [37, 38]. The G+C content of the other two positions correlate only weakly with the G+C content of the chromosome, but still exhibit biased preference for different bases: a significant overrepresentation of G and underrepresentation of T at the first position and overrepresentation of A and T and underrepresentation of G at the second position. In our tabulation of the 185 bacterial chromosomes, frequencies of occurrence are 0.35 and 0.17 of G and T, respectively, at the first position, 0.30, 0.30, and 0.17 of A, T, and G, respectively, at the second position. These biases constitute part of the selection at the level of translation.
In contrast, in the plots of σG T and σA T at the three positions against the overall σG T and σA T , none of the three linear trend lines intersect at the origin (Figure 6, bottom two panels). Even for the bacterial chromosomes that exhibit no or little overall σG T and σA T , their σG T and σA T values at the three codon positions are far from zero. In these chromosomes, TAP is all positive on the first codon position, but is cancelled out by the negative effect on the other two positions. The lack of any all-zero case indicates the presence of considerable TAP in all the bacterial chromosomes.
The σG T correlation plot (bottom left panel) showed the strongest TAP effect at the first and third positions (r = 0.82, 0.82). The strong correlation at the more relaxed third position is expected. The strong and all positive effect at the first codon may be attributed to the selection for overrepresentation (by 60%) of G at that position in all the bacteria. There is essentially no correlation (r = -0.08) at the second position. σG T at the second position is negative regardless of the overall σG T , presumably reflecting the underrepresentation (by 32%) of G at this position.
In the σA T correlation plot (bottom right panel), all three positions exhibit a moderately strong linear correlation (r = 0.55~0.79). The all positive σA T values at the first position presumably reflect the underrepresentation of T (by 32%) at this position (see above).
Examining the skews and other parameters of the 185 chromosomes (Figure 3) revealed a number of exceptional cases that exhibit skew patterns atypical for the particular clade. Interestingly, these atypical skews are accompanied by atypical G+C contents. For example, the six species of Rickettsiales (three Rickettsia species, two Wolbachia species, and Candidatus Pelagibacter ubique; chromosome 101~106) display atypical skews parameters among the α-proteobacteria. They also stand out in this clade as displaying unusually low G+C contents (29.0–35.2% vs. an average of 61.5%). Moreover, two closely related spirochaetes, Treponema denticola and Treponema pallidum (chromosome number 184 and 185) differ greatly in skew parameters as well as G+C contents.
G+C content is strongly and inversely correlated with σA T (r = -0.71) but not with σA R (r = -0.09; p > 0.05). χA cd is only weakly and inversely correlated with G+C content (r = 0.40). G+C content is also loosely correlated with the size of the bacterial chromosomes (r = 0.55; data not shown) as previously noted . Therefore, correlation between the chromosomal size and the skew parameters were also examined (Figure 7B). The analysis shows an insignificant correlation between the chromosomal size and σG R (r = -0.12) and σA R (r = -0.12), but a weak negative correlation between the chromosomal size and σG T (r = -0.33) and σA T (r = -0.35).
There appears to be a general trend toward decreasing magnitudes of the skew parameters with increasing chromosome size. The GC skew parameters converge from positive values toward zero; whereas the AT skew parameters converge from positive and negative values toward zero. This seems to suggest that larger bacterial chromosomes are under lower RAP and TAP.
The present study of the base composition skews reveals widely variable patterns of skews as well as the underlying shaping forces, suggesting extensive diversification of the mutation and selection spectra during evolution of the bacterial chromosomes. Analyzing base substitutions between orthologs in 33 closely related strains in 6 clades, Rocha et al.  have previously reached similar conclusions. Based on this, these authors proposed that the skew shaping process is multifactorial. The same conclusion may be drawn from the current study.
The RAP and TAP analysis at the three codon positions (Figure 6) reveals interesting contrasts between the two. Most remarkable is the omnipresence of TAP, in contrast to the absence of RAP in portions of the bacteria. The analysis also shows that TAP is (at least partly) contributed by selection pressure at the translational level that generates the biased base composition at the first two codon positions. It implies that the apparent lack of TAP in some bacterial chromosomes does not reflect the absence of it, but rather cancellation among the effect on the three positions.
The basic principle of deducing RAP and TAP by comparison of base composition skews in the CDS on the two replicating strands used in this study is similar to that employed by Mackiewicz et al.  and Lobry and Sueoka .
Mackiewicz et al.  performed 'detrended DNA walks' on seven bacterial chromosomes, which displayed base composition skews on a two-dimensional plot. DNA walks on nucleotides in the CDS on two complementary strands of the chromosomes were added or subtracted. Addition of the skews would cancel the effect of RAP, thus revealing other mutation and selection effect (essentially equivalent to RAP). Subtraction, on the other hand, would cancel the effect of TAP, and leaving RAP. The subtraction (RAP) curve would exhibit a sign switch whereas the addition (TAP) curve would exhibit a maximum and minimum at the origin and terminus of replication. The results of such analyses were available for the chromosomes of B. subtilis  and B. burgdorferi .
It is noteworthy that the relative RAP and TAP thus deduced were for both GC and AT skews combined. They may further be broken down into RAP and TAP specific for GC and AT skews by applying vector analysis, i.e., RAP for GC and AT skews corresponding to y 1 - y 2 and x 1 - x 2, respectively (Figure 8A); and 'TAP' for GC and AT skews corresponding to y c - 0.5 and x c - 0.5, respectively (Figure 8B). By applying this method on the original data of Lobry and Sueoka , we extracted relative RAP and TAP values from 32 bacterial chromosomes that are included in our study, and compare them to those derived mathematically in this study. Correlation is very high (r = 0.96 and 0.97, respectively) between their and our RAP values for both GC and AT skews. Correlation is also very high (r = 0.96) between their and our TAP values for AT skews, but slightly lower (r = 0.83) for GC skews.
Our RAP vs. TAP analysis (Figure 5) shows that the cytosine deamination model may be applicable to RAP in a number of clades, but not a major contributor to TAP in all bacteria. This suggests that, if the model is applicable, the exposed single strands during replication and during transcription may differ in length, binding proteins (e.g., single-strand binding proteins), and other characteristics.
Our conclusion appears to be different from previous studies that favor the cytosine deamination model for TAP [for example, [27, 34, 35]]. However, the previous studies are based on investigation of a small number of actively transcribed genes from Proteobacteria. For example, the conclusion of Francino and Ochman  was based a phylogenic comparison of an approximately 1.8-kb actively transcribed non-CDS in 12 E. coli strains. The E. coli chromosomes (No. 80~83) in the present study exhibit very low or statistically insignificant χA and σA T , but moderate χG and σG T , indicating a lack of a major contribution by cytosine deamination.
It is noteworthy that the number of bacterial genes highly expressed during exponential growth appears to be relatively small [42, 43], and, there is evidence for the absence of cytosine deamination effect in the transcription of cryptic genes . Thus, it is possible that either the small sample in the previous studies is not representative for chromosomes as a whole, or that the previously postulated cytosine deamination effect on TAP acts only on a small number of highly expressed genes, and represents only a minor fraction of overall TAP.
On the other hand, our analysis shows that cytosine deamination may play a significant role in RAP in α-, β-, γ-, and (to a lesser degree) ε-Proteobacteria, and Mollicutes. This is in general agreement with the base substitution analysis of Rocha et al. , which shows cytosine deamination is applicable in RAP in Bordetella (β-Proteobacteria), E. coli (γ-Proteobacteria), Neisseria (β-Proteobacteria), and Streptococcus (Firmicutes), but not in, Bacillus and Staphylococcus (two Firmicute clades), and Rickettsia (α-Proteobacteria). The latter, being intracellular parasites, may be considered an exceptional case.
In contrast to the situation in Proteobacteria, cytosine deamination cannot be applicable to RAP in most Firmicutes (except Mollicutes). This suggests a major difference in the state of single-stranded DNA exposed during replication in these two phyla of bacteria. In Proteobacteria, both strands of the chromosomes are replicated by DnaE. In contrast, the Firmicute genomes encode an additional replicase PolC , which is known to replicate the leading strand in B. subtilis . Perhaps the two distinct systems generate single-stranded intermediates of very different states.
The distinctly different cytosine deamination effects between these two phyla of bacteria correspond to the separation of the base composition skews into the 'purine excess' trend in the Firmicutes and 'keto excess' trend in the Proteobacteria (Figure 1). The cytosine deamination model has been found to be the most likely cause of TAP in other studies [12, 32]. However, the sample size (28) was considerably smaller than that in this study, and no clade-based analysis was performed.
Because of the positive (albeit loose) correlation between the G+C content and size of bacterial chromosomes , it is not surprising to find that there is also a weak negative correlation between the chromosomal size and the TAPs (σG T and σA T ; Figure 6B). However, there is an under-representation of larger bacterial chromosomes in the current set of sequenced genomes. It is possible that the convergence towards zero skews, TAPs, and RAPs observed in the larger chromosomes (Figure 6B) may be due to the small sample of large chromosomes. This remains to be investigated when more large bacterial chromosomes are sequenced.
On the other hand, the diminishing skews, TAPs, and RAPs in large chromosomes may be real and reflect an evolutional trend. It is reasonable to assume that the larger chromosomes have generally evolved from smaller ones (except for reductive evolution in parasites) by acquiring extra genes necessary for more complex structures (through differentiation) and contingency functions (e.g., secondary metabolism) that provide adaptability and a competition edge. Under this premise, the increase in gene number (and chromosomal size) is accompanied by an increase in G+C content and decrease in RAP, TAP, and base composition skews. If such an evolution trend is real, it suggests that RAP and TAP were stronger among ancestral bacteria.
The RAP and TAP analysis in this study may provide guidance for further bioinformatic and genetic investigations into the underlying principles for these mutation and selection forces. The best candidates for such investigations are probably closely relative species with distinct base composition parameters, such as the aforementioned Rickettsiales chromosomes (number 101 – 106), which display remarkably low G+C contents and high σG R , σG T , and σA T compared to other α-proteobacterial chromosomes, and the chromosomes of two spirochaetes, T. denticola and T. pallidum (number 184 and 185) that differ greatly in G+C contents and σA T (opposite signs). Moreover, the chromosomes of Moorella thermoacetica and Thermoanaerobacter tengcongensis (number 55 and 56) are also unique among the Clostridia in displaying atypically low χG cd , χA cd , σG T , and σA T . All these provide opportunities for comparative investigation to uncover the underlying genetic elements.
In summary we have analyzed the base composition skews and the underlying mutation/selection forces associated with replication and transcription among 185 bacterial chromosomes in 11 phyla. The diverse patterns that are characteristic for different clades provide clues to the evolution that shape these skews. The correlation among the skews, the G+C content, and the size of the chromosomes also hints at the direction of the trends of the evolution.
The chromosomal sequences were taken from National Center for Biotechnology Information . Prediction of the oriC location followed the basic procedure of Mackiewicz et al.  using two methods: (i) DNA asymmetry (i.e., sign switch site of either GC or AT skew) and (ii) location of dnaA gene. A putative oriC was assigned at the first base of dnaA, when the locations predicted by these two methods were within 7% of the length of the chromosome. Chromosomes with more than one dnaA homologs were excluded. The ter site was assigned to be directly opposite of oriC for circular chromosomes. For linear chromosomes (such as those of Streptomyces and Borrelia), the ends are the ter sites.
A, T, G, and C denote the numbers of these nucleotides in the sequence or replicon under consideration. Their locations on the leading and lagging strands are denoted by subscript d and g, respectively. Overall base composition skews with respect to the leading and lagging strands in a bacterial chromosome are designated with the symbol χ, and defined by and .
A chromosomal sequence is divided into two super sets, CDS and non-CDS. CDS represent all the protein coding sequences, and non-CDS the rest of the sequences (including stable RNA-coding sequences). Quantitative parameters specific for CDS and non-CDS bear a cd and nc subscript, respectively. χG cd and χA cd represent calculated base composition skew with respect to the leading and lagging strand in CDS only; and χG nc , and χA nc , in non-CDS only. For example, , where the superscript cd denotes the bases in CDS.
The base composition skew in CDS is denoted by the symbol σ. Combined with the replicating strand designations, σG d and σG g denote GC skews in the CDS on the leading and lagging strands, respectively. σA d and σA g are similarly defined.
χCDS, a measure of the skew of distribution of CDSs with respect to the replicating strands, is defined as:
The statistical significance of the calculated skews was estimated by binomial distribution probability and a χ 2 test.
The processed data were charted and statistically analyzed using Aabel (version 2.1, Gigawiz) running under Mac OS X (version 10.4.8) on a PowerMac (Apple).
The complete analytical data of the 185 chromosomes are available in Additional file 1.
coding sequence on the leading strand
coding sequence on the lagging strand
We thank Professor Ralph Kirby for critical reading of the manuscript and suggestions for improvement, and an anonymous reviewer for suggesting the analysis of the skew effect at different codon positions, which resulted in new insights. This study is supported by research grants (NSC93-2321-B010-004, NSC94-2321-B010-005) from National Science Council, R. O. C. and a grant (Aim for the Top University Plan) from the Ministry of Education, R. O. C.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.