Asymmetrical pattern of GC-skew within genes of most unicellular fungi
We first examined the average pattern of GC-skew at three model yeast species, Saccharomyces cerevisiae, Schizosaccharomyces pombe and Candida albicans. In each species, we averaged the skews of all genes at each position (base pair) relative to the translation start and stop sites (Figure 1a, b). In all three species, we observe a trivial positive skew at the exact start and stop sites (ATG and stop codons are G-skewed). In addition, we find a tendency for negative GC-skew (preference for C) around the translation start sites and positive GC-skew (preference for G) around translation stop sites. Notably, the magnitude of these skews varies between species, with weak GC skews in S. cerevisiae and S. pombe and a much stronger GC skew in C. albicans. In fact, in C. albicans, simple inspection of the GC skew in relation to genes shows that many, but not all, genes have a pattern of GC skew that increases from 5' to 3' end. (Figure 1c)
To evaluate the scope of this pattern we examined for each gene whether the change in skew between the translation start and stop sites is significant. Genes were shuffled 1000 times, and GC skew in the 5' and 3' hundred base pairs calculated. If the difference in GC skew was in the top 5% of randomized data, we scored the GC skew change as significant. We found a significant change of GC skew along 33% of the genes in C. albicans but only 14% and 7.5% of the genes in S. cerevisiae and S. pombe, respectively. Thus, the observed pattern of GC skew is a common property of C. albicans genes but is much less frequent in the other model yeast species.
The variability among the three model yeast species prompted us to examine the patterns of GC-skews among additional species. We thus downloaded the genomic sequences and gene annotations of all sequenced unicellular fungi and examined the average pattern of GC-skew in each species (Figure 2a). Surprisingly, we found that most species have a considerable average GC skew, as in C. albicans but not as in S. cerevisiae and S. pombe. Furthermore, most species follow the trend described above, with a relative preference for C at the 5'-end and for G at the 3'-end, suggesting that this is a conserved phenomenon among unicellular fungi.
The species can be divided roughly into four groups. At the top of Figure 2a can be seen those with a steep gradient of C-bias from the promoter to the begining of the coding region, followed by the majority of the gene G-biased. These are mainly Candida and Saccharomyces species. The second group is the Schizosaccharomyces species, which are C-biased in the promoter, but not GC-skewed within the gene, except to a very small extent. The third group is the sensu stricto Saccharomyces species, with a few non-Saccharomyces members. This group is not GC-skewed in the promoter, but shows a slight pattern of increasing G-bias in the gene, which does not extend into the 3' UTR. The fourth group, which consists of most of the species analysed, is C-biased for most of the gene, with a small G-biased region at the 3' end, which often extends into the UTR. As expected, closely-related species tend to share the same GC-skew signature, although there are some exceptions. For example, Candida glabrata and Saccharomyces castellii are closely related (Despite its genus name, the Candida glabrata genome is more similar to the genomes of Saccharomyces than to those of Candida species), but S. castellii is clearly in group one and C. glabrata in group three.
GC-skews affect amino-acid choice
Since the pattern of GC skew shifts from negative to positive along genes, we examined the abundance ratio of each nucleotide, between the 5' and 3' ends. Strikingly, A and T showed only minor deviations in abundance between 5' and 3' ends, whilst G and C showed considerable gene-position effects, with G being highest at the 3' end, and C highest at the 5' end. (Figure 3a).
Next we examined if the usage of amino acids also differs between 5'-ends and 3'-ends. Indeed, we find that most amino acids are biased to either 5'-ends or 3'-ends of C. albicans genes, in agreement with the GC-skew of the corresponding codons (Figure 3b). (P < 0.001 for all amino acids except A, C, F, N, and R; Chi-square test). For example, the Tryptophan (W) codon, UGG, is G-biased, and tryptophan is approximately twice as common at the 3'-ends than at the 5'-ends of C. albicans genes.
The correlation between GC-skew and a preference for certain amino-acids could indicate that the choice of amino acids is under selection and that this generates the observed pattern of GC skew. For example, we noted that pairs of similar residues (serine and threonine, leucine and isoleucine, aspartic acid and glutamic acid) often have similar fold-change between the 5' and 3' ends (Figure 3b). However, the pattern of increased GC skew from the 5'-end to the 3'-end continues beyond the coding region, both into promoters and into 3'- UTRs (Figures 1, 2). In fact, for most species, the strongest (negative) GC skew is observed at the promoter and then decreases sharply at the coding region. Thus, the patterns of GC-skew could not be accounted for solely by selection for amino-acid choice, and we propose that it primarily reflects mutational load.
The mutational load hypothesis would predict a greater effect at fourfold-degenerate positions, as these positions are believed to be less constrained, yet we find an effect of similar magnitude at non-degenerate positions (i.e. first and second codon positions) (additional file 1, Figure S1). This could, in principle, indicate a role for selection for amino-acid choice in generating the observed patterns of GC-skew. However, we note that fourfold-degenerate positions are constrained by the choice of codons, as some codons are more efficiently translated than others. Notably, efficient codons are particularly enriched with Cytosines (over Guanines) in their third-base positions (additional file 1, Figure S2), and are depleted the 5' ends of genes [17, 18]. This should generate an opposite bias to the one that we observe, with abundance of Cytosines (negative GC-skew) at the 3' ends of genes. We thus speculate that selection pressure for increased translational efficiency along genes [17, 18] reduces the GC-skew at fourfold-degenerate positions. Accordingly, the effects of constraint for amino acid choice, at non-degenerate positions, and translational efficiency, at degenerate positions, could reduce the impact of mutational load to a similar extent, thereby generating seemingly codon-position independent increase in GC-skews along genes.
Uniform AT-skews are correlated with positive, but not negative, GC-skews
Most species also have considerable AT-skews within genes, although somewhat lower than GC-skews (Figure 2b), and the extent of AT and GC skews are correlated both across genes within the same species (Figure 4a) and across different species (Figure 4b). However, in contrast to GC-skews, AT-skews are largely constant across the length of genes, with a general preference for A over T (positive AT-skew) and correlate primarily with the extent of GC-skew at the middle of genes, but less so at the 5' and 3' ends (Figure 4a). These results suggest that a common mechanism generates a uniform and positive AT- and GC-skew throughout genes, but that additional unknown mechanisms generate the decreased GC skew at 5' ends and increased GC skew at 3'-ends. Consistent with this possibility, we note that for some species such as C. albicans, the pattern of skew fits well with a model of three regimes: a constant GC-skew at most of the gene, a gradual (linear) decrease of GC-skew at the 5'-end and a gradual (linear) increase at the 3'-end (Figure 4c).
GC-skews may be associated with transcription and DNA methylation
The observation that GC-skews extend from coding region to the surrounding UTRs and is in fact strongest at the 5'-UTR supports the possibility that GC-skews are generated by a mechanism that is coupled to transcription. To further examine this possibility we examined the correlation between GC-skew and expression levels across C. albicans genes. We found significant (P < 0.001), yet very weak correlations with both the negative GC-skew at the 5'-UTR (R = -0.096) and the positive GC-skew at the 3'-UTR (R = 0.071) (additional file 1, Figure. S3). These weak correlations support the possibility of transcription-coupled GC-skews but suggest that the extent of GC skew is not determined primarily by the level of transcription but rather largely by other aspects of the transcriptional process. For example, PolII dynamics (e.g. stalling) and the gene-specific binding of various regulatory factors to the transcriptional machinery may have strong effects on the degree of GC-skew.
Methylation of cytosine bases is an evolutionary ancient system that has been lost separately in S. cervisiae and S. pombe (which have weak skews), but retained in N. crassa and C. albicans (which have strong skews) [19, 20], and therefore might be related to the mechanisms that generate the observed GC-skews. However, we did not detect any significant difference in GC skews between the 150 methylated Candida genes listed in [21] and the genome-wide level of GC skew, nor did we find any correlation between the level of methylation in those 150 genes and their GC skews (data not shown).