GC3 biology in corn, rice, sorghum and other grasses
© Tatarinova et al; licensee BioMed Central Ltd. 2010
Received: 16 September 2009
Accepted: 16 May 2010
Published: 16 May 2010
The third, or wobble, position in a codon provides a high degree of possible degeneracy and is an elegant fault-tolerance mechanism. Nucleotide biases between organisms at the wobble position have been documented and correlated with the abundances of the complementary tRNAs. We and others have noticed a bias for cytosine and guanine at the third position in a subset of transcripts within a single organism. The bias is present in some plant species and warm-blooded vertebrates but not in all plants, or in invertebrates or cold-blooded vertebrates.
Here we demonstrate that in certain organisms the amount of GC at the wobble position (GC3) can be used to distinguish two classes of genes. We highlight the following features of genes with high GC3 content: they (1) provide more targets for methylation, (2) exhibit more variable expression, (3) more frequently possess upstream TATA boxes, (4) are predominant in certain classes of genes (e.g., stress responsive genes) and (5) have a GC3 content that increases from 5'to 3'. These observations led us to formulate a hypothesis to explain GC3 bimodality in grasses.
Our findings suggest that high levels of GC3 typify a class of genes whose expression is regulated through DNA methylation or are a legacy of accelerated evolution through gene conversion. We discuss the three most probable explanations for GC3 bimodality: biased gene conversion, transcriptional and translational advantage and gene methylation.
Since bimodality has been detected in only some plant families, we suggest that this feature has developed independently in warm-blooded animals and in certain members of the commelinids clade. The GC3 bias could possibly be explained as a consequence of some larger genomic bias. For example, over three decades ago, Macaya et al.  observed that some genomes contain isochores, megabase-long regions with either high or low GC contents. Isochores have been reported in warm-blooded vertebrates and in some reptiles [7–9]. Compositionally homogenous DNA regions of at least 50-100 kb have been found in several dicot and monocot genomes (pea, sunflower, tobacco, barley, rice, maize, oat and wheat), supporting the existence of isochores in plants [10–12]. It is not yet known whether all eukaryotic genomes are characterised by an isochore structure .
Press and Robins  reported that high GC isochores contain a mixture of GC- and AT-rich genes, whereas high AT (low GC) isochores contain mostly AT-rich genes. Genes found within high and low GC isochores are functionally distinguishable by statistical analysis of their gene ontology categories . The authors suggested that some genes require AT-richness, while others, contained within large coherent blocks, have a strong bias towards mutations to GC.
The neutral theory of evolution states that for a change to come about in the population as a whole, the new characteristic must be as good as or better than the old one. Under the assumption of neutrality, genes would acquire characteristics of the surrounding isochores. Therefore, noncritical elements such as synonymous bases in 3rd codon positions and 5' and 3' UTRs should be GC-rich in high GC isochores. In fact, several groups have found a positive correlation between the GC3 levels of a gene and of its surrounding genomic area [2, 13, 14]. Mouchiroud et al.  found an 8-fold enrichment for high-GC3 genes within the top 3% of the GC-richest isochores in humans. These observations support the neutrality assumption. Elhaik et al. , however, found little correlation between GC3 and isochores within a species and none between species. Furthermore, the correlation with generally GC-rich areas is only modest (R2 = 0.43) , suggesting that a more complex explanation must be sought. Moreover, isochores have been reported in both GC3 unimodal and bimodal organisms and therefore cannot provide an exclusive explanation for GC3 bimodality.
Campbell and Gowri  described differences in codon usage in different plant genomes, algae and cyanobacteria, and showed that bimodality existed only in monocots. In a series of publications [10, 11], GC3 levels were analyzed for five Poaceae and three dicot species. It was found that compositional patterns in the dicot species resembled those of cold-blooded vertebrates, while the grasses resembled warm-blooded vertebrates. Bimodality of GC3 distribution in grasses, and specifically in rice, was reported by Carels and Bernardi , Wang and Hickey  and Salinas et al. . These authors explained the differences in codon usage among some rice genes by a rapid evolutionary increase in GC content. They gave two possible explanations for the observed bi-modality: (1) positive Darwinian selection, acting at the level of translational efficiency; and (2) neutral mutational bias.
Several characteristics related to high GC3 genes have been observed to date. Duret et al.  examined vertebrate sequences and described two properties of high-GC3 genes: the proteins are generally shorter, and introns are either absent or short in comparison to low-GC3 genes. Carels and Bernardi  compared genes in plants with generally high GC content to those with generally lower GC content. Although the differences were most prominent in Gramineae, they observed that other families of plants including dicots (e.g. Brassicaceae and Fabaceae) could be segregated by GC distribution. They also observed the tendency towards short or no introns in GC-rich genes and identified a correlation between GC content, intron size and location among homologs across species. Duret et al.  reported a small correlation between GC3 and the general GC richness of the surrounding >10 kb of genomic sequence. The relationship between gene length and GC3 for many organisms has been analyzed in a number of publications during the last decade [18–21]. Gene lengths in C. elegans, D. melanogaster, A. thaliana and O. sativa are negatively associated with GC3. Shorter genes in bacteria tend to have more variable expression levels, and selective pressure on codon usage is also higher in shorter genes . It was recently demostrated that corn genes with high GC3 tend to be mono-exonic . It has been reported that shorter and intron-poor genes have either stronger [24–26] or more variable [27, 28] expression levels because introns can delay regulatory responses and are selected against in genes whose transcripts require rapid adjustment for survival of environmental challenges . Ren et al.  showed opposite trends in plant and animal genomes; highly expressed genes tend to be longer in plants and shorter in animals. A recent paper by Jeffares et al.  proposed a reconciliation of these observations: both plants and animals show consistent inverse relationships between intron density (defined as intron number/unspliced transcript length) and rapid regulation (measured as the fastest rate of change of gene expression intensity in a time course experiment).
An influence of translation on codon bias has been proposed on the basis of increased hydrogen bonding and hence strength of G-C pairing in contrast to A-T pairing. This increased pairing may improve transcript stability at the mRNA level or improve the speed or fidelity of translation, thereby improving protein production, as has been shown in a number of species including bacteria and some eukaryotes . This is supported by the analysis of Campbell and Gowri , who studied codon usage in plants and found two groups of genes that had preferences for GC-ending codons in monocots but not dicots. Additionally, Jabbari et al.  found a correlation between high-GC genes and amino acid hydropathy. However, Wang and Hickey  used concordance analysis of synonymous and non-synonymous differences to show that the primary effect is not at the codon or protein level.
Several groups [3, 14, 18, 22, 31] have suggested that the effect of high or low GC3 may be at the level of transcription. The generally shorter introns and coding sequences of high-GC3 genes led Carels and Bernardi  to suggest that selective pressure has driven housekeeping and non-regulated genes to higher GC contents while the longer AT-rich genes have been maintained to provide more opportunity for regulation and alternative splicing. Clay et al.  looked at CpG islands upstream of GC-rich and GC-poor transcripts and found little correlation. Nevertheless, the observation of higher GC within the introns of GC3 transcripts as well as the 5' region, and the weak correlation between general genomic GC content and GC3 level, suggests that the transcriptional machinery may be involved.
Conflicting ideas about codon usage bias and expression levels have been published. Wang and Hickey  reported that codon bias is not correlated with gene expression. Using S. cerevisiae expression and sequence data, Dekker  showed that on average, GC-rich genes are significantly more transcriptionally active than AT-rich genes. A recent paper by Roymondal et al.  presented an expression measure of a gene, devised to predict the level of gene expression from relative codon bias. They suggested that since the bias is caused by the presence of optimal codons that are recognized by the most abundant tRNA species, the high-GC3 peak appears as a manifestation of natural selection acting in grasses and warm-blooded vertebrates. This process shapes the codon usage patterns for selected genes to gain optimal expression levels in response to changing environments. Roymondal et al.  mentioned that within any genome, codon bias tends to be much stronger in highly expressed genes.
Attempts have been made to discover an association between functional classes of genes and GC3. Carels and Bernardi  characterized the high GC-containing transcripts as housekeeping and photosynthetic. D'Onofrio et al.  found GC3 to be higher in genes involved with cellular metabolism and lower in those involved with information storage processing. These observations are consistent with previous studies of general GC contents of genes in arabidopsis .
The existence of a codon usage gradient along the coding regions was previously discussed by Hooper et al. , who outlined the possible advantages of a positive GC3 gradient. Based on an analysis of E. coli genes, the authors suggested that G3-containing codons may be translated more quickly and with lower error rate than other codons, thus avoiding congestion at the ribosomes because of a gradual increase of speed of translation along the gene. Wong et al.  discovered that in the plant kingdom, O. sativa genes are richer in GC at the 5' end than at the 3' end. This gradient and imbalance in nucleotide strand composition extends beyond the coding region; transcription start sites are characterized by a pronounced peak in CG-skew [36, 37], and mRNAs tend to be purine-rich (A for low GC organisms and G for high GC organisms) [38, 39]. Avoidance of unnecessary 'kissing interactions' between and within mRNAs was mentioned by Lao et al.  as a possible explanation for purine loading. Species adapted to hotter environments have stronger selection pressure towards purine loading since nucleic acids are "stickier" at high temperatures . This effect is the most pronounced at the wobble position of codons.
Aside from transcriptional and translational influences, it is possible that the driver for differences in GC3 operates at a recombinational level. Gene duplication in the Poaceae has been mentioned as one possible explanation of GC3 bimodality . The authors suggested that duplicated genes in O. sativa can be partitioned into 10 blocks by chromosomal location; these blocks have significantly different synonymous substitution rates (Ks). Wang et al.  found that Ks was negatively correlated with the GC content at the third position of codons (correlation coefficient -0.455) and that the bimodal distribution of Ks was split into two unimodal distributions corresponding to high- and low-GC3 genes. Related to this idea are advances in understanding of the accelerated evolutionary rates of some genes. Holmquist  described a model in which hybridization of similar genes during recombination resulted in a bias toward higher GC content in the recombined areas. Birdsell  demonstrated that recombination significantly increases GC3 in a selectively neutral manner; the GC-biased mismatch repair system evolved in various organisms as a response to AT mutational bias. Birdsell  suggested that unimodal low-GC3 species may have prevailing AT mutational bias, random fixation of the most common types, or mutation or absence of GC-biased gene conversion . The authors hypothesized that recombination is more likely to occur within conserved and regulatory regions of the genome; therefore, introns, intergenic regions and pseudogenes tend to have lower GC contents than ORFs. Galtier et al.  noticed that GC-biased gene conversion, frequently accompanied by an increase in GC3, influenced the evolutionary trajectory of human proteins by promoting the fixation of deleterious AT→GC mutations. These observations raise the possibility that the high-GC3 class of genes might have appeared as a consequence of accelerated evolution.
With the increasing amount of genomic and transcript information available within the public databases as well as the improved understanding of gene conversion and gene regulatory mechanisms, we returned to the puzzle of GC3 bimodality in grasses in an effort to understand the significance of this phenomenon. We concentrate our discussion around Oryza sativa as it is one of the best-studied grass species at the genomic level.
Gene classes in several organisms are readily identified by GC3 plots
We revisited the extent of variation of GC3 found in various species. In Figure 1 we have plotted the distributions of GC3 for 12 plant and animal species. Distributions of GC3 in H. sapiens, O. sativa, C. reinhardtii, and Z. mays are clearly bi-modal, A. cepa, A. thaliana, G. max, S. cerevisiae and C. elegans are unimodal, and B. napus, D. rerio and M. musculus have intermediate distributions. Uni- and bi-modality of GC3 distributions in various organisms have been reported previously [3, 18, 46] and our results are consistent with the earlier observations on the species tested.
Isochores may not explain the presence of GC3-rich genes in grasses
Previous reports on GC3-rich genes have suggested that these are present in GC-rich regions of the genome, aka isochores [10, 12, 47]. The authors suggested that GC3 bimodality in grasses came about because these genes are located in regions of their respective genomes that differ in G+C content. Two decades of full genome sequencing and annotation of numerous plant genomes make it worthwhile to revisit the issue of codon usage in plants and plant isochore organization. In order to answer the question of isochores in grasses, we analyzed the GC contents of coding and non-coding sequences in O. sativa. Overall, the correlation of GC3 values between adjacent genes is 0.05, indicating that there is no significant clustering of these genes. We separated all mRNA-validated rice genes into two groups on the basis of GC3 content: the "low" group, where GC3<0.8, contains 11,608 genes; and the "high" group, where GC3 ≥ 0.8, contains 4,889 genes. The choice of cut-off point between the two groups was based on the position of the lowest GC3 value between the two peaks. (This approach is different from the one outlined in  and , where the two classes were distinguished by overall GC content. In those two studies, the average GC3 contents were 0.89 in the high group and 0.69 in the low group). We analyzed the spatial distribution of genes with high GC3 values. Of the 4,889 genes in the high group, 3,661 are evenly distributed across the genome; 485 genes (out of for the remaining 1,228) occur in 36 clumps of 10 or more genes (Additional file 1: Supplementary Figure SF1 and Additional file 2: Supplementary Table ST1). Five of these clumps are likely to result from relatively recent gene duplication, since they consist of genes with identical PFAM annotations. From the analysis of seven animal species, Elhaik et al.  inferred that GC3 can only explain a very small proportion of the variation in GC content of long genomic sequences flanking the genes, and correlations between GC3 and GC in the flanking region decayed rapidly with distance from the gene. Accordingly, we examined 1,000 nucleotides upstream of the 16,497 rice genes and also found no significant correlation in GC content between the open reading frames and flanking regions. The GC contents of the high and low groups gave nearly identical unimodal bell-shaped frequency distributions centered at GC = 0.4. These results suggest an absence of isochore organization in the rice genome and indicate that the high-GC3 genes are not closely associated with GC-rich regions in rice.
GC3 correlates with variability of gene expression
GC3 correlates with the presence of an upstream TATA box
Number of expressed paralogs and orthologs is negatively associated with GC3
GC3 is negatively correlated with gene length and intron density
Genes in the middle of the GC3 spectrum (0.4<GC3<0.7) have a negative correlation with ORF length (Pearson's correlation coefficient = -0.3), whereas for genes in the high GC3 class and for those with GC3<0.4, it is approximately 0. As was previously observed , variability of gene expression is negatively correlated with intron density. We computed Pearson's correlation coefficient between GC3 and intron density for O. sativa and S. bicolor: for both grass species it is approximately -0.3 (Additional File 1: Supplementary Figure SF3). Genes with high GC3 tend to be mono-exonic . This is consistent with our observation of a positive relationship between gene expression variability and GC3. On the basis of this evidence, we suggest that rapidly evolving genes are shorter, have more variable expression and are GC3-rich. More "evolutionarily stable" genes tend to accumulate introns and increase the ORF length.
Gradient of codon usage along the gene
Codon usage and gene classes
The first two nucleotides in a codon are more reflective of gene function than the third one. Using coding sequences of O. sativa, we computed average GC3 and GC12 for GO and FPAM annotations. The coefficient of variation for GC12 is approximately three times smaller than the coefficient of variation for GC3. However, the third position in the codon also affects gene function. Liu et al.  demonstrated that synonymous codon usage and gene function are strongly correlated in O. sativa; they found that genes involved in metabolic processes have a preference for C or G in the third position of a codon. Different PFAM families show affinity for high- or low-GC3 classes. For example, O. sativa genes annotated as "expressed proteins" are more prevalent in the low class (22% vs. 33%) and alpha-expansins are more prevalent in the high group (relative abundance is 46). Details are given in the Supplementary data (Additional File 2: Supplementary Table ST2). It appears that GC3 increase tends to co-evolve in some PFAM families of grass genes across multiple organisms. The distribution of GC3 in histone, ribosomal and chrolophyll a-b binding protein coding genes are very similar for rice and corn. In both organisms, 80% of chrolophyll a-b binding proteins have GC3>0.85, ribosomal proteins are approximately normally distributed around GC3 = 0.65, and 60% of all histones have GC3>0.75. Another way to look at the relationship between gene category and GC3 is by considering GO annotation (see Additional File 2: Supplementary Tables ST3-ST7). D. rerio, M. musculus, H. sapiens, C. reinhardtii, O. sativa and Z. mays have higher GC3 values than A. thaliana and we were curious to see if GC3 is consistent between these organisms and GO categories. The high-GC3 species also have consistently higher GC3 values for genes from the following GO classes: electron transport or energy pathways, response to abiotic or biotic stimuli, response to stress, transcription and signal transduction. Therefore, we conclude that certain classes of genes are characterized by high GC3 values across kingdoms.
GC3 in CDS and GC genomic context are not correlated
High-GC3 genes have more targets for methylation
High-GC3 genes and GC-biased gene conversion
Many previous studies have demonstrated a significant association between GC3 and recombination rate across different plant and animal species [41, 42, 44, 45, 60–62]. The conclusion is that high GC3 content in an organism indicates a recombining genome. Similarly, the presence of two distinct GC classes of genes may suggest the existence of recombining and non-recombining regions within that genome. To support this hypothesis, we computed the mutation rates of rice genes (see Methods). For our curated dataset of 16 K rice genes, we found a positive correlation between the density of SNPs per 1000 nucleotides and GC3 (R2 = 0.71, SNP = 1.114+0.583GC3). Association with overall GC content is much weaker, R2 = 0.32. Therefore, we conclude that high-GC3 genes accumulate more mutations and are located in the highly recombining regions of the rice genome.
Importance of GC3
Analysis of gene-specific codon usage bias shows that GC3 is the major characteristic of codon utilization in Poaceae. In order to demonstrate this, we used Principal Component Analysis (PCA) to find a basis for the space of codon vectors. Approximately 50% of the variance in codon usage is explained by the first principal component; this component has an almost perfect negative correlation (-0.98) with GC3. The remaining components contribute at most 4% each to the variance; the second principal component is weakly correlated to GC3 skew.
Deviations from unimodal bell-shaped distributions of GC3 appear in many species, but grasses have very pronounced bimodal distributions (Figure 1, Additional File 1: Supplementary SF8 and SF9). Bimodality in warm-blooded vertebrates can be explained by the presence of isochores. Although there are many similarities between genes in high-GC human isochores and high-GC3 genes in grasses, the isochore hypothesis does not fully explain the existence of high-GC3 genes in grasses: first, there is no correlation between ORFs and the flanking regions; second, most species with isochores do not have a high-GC3 peak. Possible causes of bimodality may be elucidated by comparing genes in the high- and low-GC3 classes. These classes differ in nucleotide composition and composition gradients along coding regions. High-GC3 class genes have a significantly higher frequency of CG dinucleotides (potential targets for methylation); therefore, there is an additional regulatory mechanism for high-GC3 genes. Springer et al.  reported that out of eight classes of methyl-CpG-binding domain proteins present in dicots, only six exist in monocots, suggesting a difference between dicots and monocots in silencing of methylated genes.
Two competing processes may affect the frequency of methylation targets: the GC-based mismatch repair mechanism and AT-biased mutational pressure. In recombining organisms (e.g., grasses and warm-blooded vertebrates), the GC content of coding and regulatory regions is enhanced because of the action of the GC-based mismatch repair mechanism; this effect is especially pronounced for GC3. Recombination has been shown to be a driving force for the increase in GC3 in many organisms . Repair (recombination) happens all over the genome with a certain precision, leading to an increase in GC. If repair did not occur in defence-related genes, the organism may fail to survive or to reproduce. If repair did not happen in less important genes (and, consequently, their GC content remained the same), it may not be detrimental to the organism. AT-biased mutational pressure, resulting from cytosine deamination  or oxidative damage to C and G bases , counteracts the influence of recombination; and in most asexually-reproducing species and self-pollinating plants, AT bias is the winning process. Our analysis from aligning indica and japonica, as well as earlier publications , indicate that genomic regions under higher selective pressure are more frequently recombining and therefore increase their GC3 content. This mechanism may explain the pronounced differences in GC3 between A. thaliana and its closest relatives. Comparison of the nucleotide compositions of coding regions in A. thaliana, R. sativus, B. rapa, and B. napus reveals that the GC3 values of R. sativus, B. rapa, and B. napus genes are on average 0.05 higher than those of the corresponding A. thaliana orthologs . An important difference between A. thaliana and Brassica and Raphanus is that the latter two genera are self-incompatible, whereas A. thaliana is self-pollinating. Self-pollination in arabidopsis keeps its recombination rates low and thus reduces the GC3 content of its genes. Self-pollination is also reported in some grasses such as wheat, barley and oats. Analysis of recombination in wheat  showed that the genome contains areas of high and low recombination. Grasses have an efficient reproductive mechanism and high genetic variability that enables them to adapt to different climates and soil types [69, 70]. We hypothesize that since self-pollination generally lowers recombination rates, evolutionary pressure will selectively maintain high recombination rates for some genes. Analysis of highly recombinogenic genomic regions of wheat, barley, maize and oat identified several genes of agronomic importance in these regions (including resistance genes against obligate biotrophs and genes encoding seed storage proteins) . In addition to the methylation-driven growth of high-GC3, we hypothesize that developing GC3 richness in some genes may, if it is not balanced by AT-bias, work as a feed-forward mechanism. Once it appears in genes under selective pressure, it provides additional transcriptional advantage. GC pairs differ from AT pairs since guanine binds to cytosine with three hydrogen bonds, while adenine forms only two bonds with thymine. This additional hydrogen bond makes GC pairs more stable and GC-rich genes will have different biochemical properties from AT-rich genes. When an AT pair is replaced by a GC pair in the third position of a codon, the protein sequence remains unchanged but an additional hydrogen bond is introduced. This additional bond can make transcription more efficient and reliable, change the array of RNA binding proteins, or significantly alter the three-dimensional folding of the messenger RNA. In this case, those plant species that thrive and adapt successfully to harsh environments demonstrate a strong preference for GC3 in the third position of the codon.
High GC3 content provides more targets for methylation. The correlation between methylation and GC3 is supported by Stayssman et al. , who reported a positive correlation between methylation of internal unmethylated regions and expression of the host gene. In this paper we have demonstrated a positive correlation between GC3 and variability of gene expression; we also found that high-GC3 genes are more enriched in CG than the low-GC3 class. Therefore, GC3 classes provide more targets for de novo methylation, which can serve as an additional mechanism of transcriptional regulation and affect the variability of gene expression. Additional transcriptional regulation makes species more adaptable to external stresses.
Grasses have undergone several genome duplications. Genomic regions varied in their recombination rates and GC3 contents. Since high GC3 content in a gene provided an evolutionary advantage, this was frequently the sole copy retained in grasses. This may explain why genes in the high-GC3 class frequently lack paralogs. High-GC3 genes provide an evolutionary advantage owing to their optimized codon usage and to the existence of methylation targets allowing for an additional mechanism of transcriptional regulation. Therefore, the high-GC3 class of genes has been maintained in grasses for generations.
In this paper we combine a variety of prior observations and insights on GC3 biology with new observations using larger genome data sets to establish a unifying framework of hypotheses to explain all the available data fully. This framework consists of evolutionary forces and sexual reproduction patterns to justify a wide variety of observed codon usage patterns in plants and animals. These evolutionary forces are realized through introducing new mutations during meiotic recombination and fixation with the help of DNA methylation and transcriptional mechanisms. The presence of GC3-rich genes is not likely to be a consequence of chromosomal isochores or horizontal gene transfer. Regardless of their initial origin, high-GC3 genes in recombining species possessed a self-maintaining mechanism that over time could only increase their drift towards even higher GC3 values. This uncompensated drift may explain the pronounced bimodality of some rapidly-evolving species. Competing forces acting in grasses make GC3 distribution distinctly bimodal; genes in the high-GC3 class are more transcriptionally regulated, provide more targets for methylation and accumulate more mutations than genes in the low-GC3 class.
In our analysis, we concentrated on those plant species that benefit from complete sets of full-length cDNAs and sequenced (complete or nearly complete) genomic data. We used the following species: O. sativa, S. bicolor, A. thaliana, C. reinhardtii, Z. mays, D. rerio, M. musculus and H. sapiens. O. sativa genes and genomic sequences were downloaded from the Rice Genome Annotation project; after exclusion of all transposon-like genes and genes without full-length cDNA support we obtained a final set of 16,497 genes. Rice promoter sequences were downloaded from the Osiris database ; positions of Transcription Start Sites were refined using the TSSer algorithm . Rice microarray data were obtained from NCBI, Gene Expression Omnibus, platform GPL2025. We used two measures of expression: average intensity and standard deviation across 106 series of gene expression measurements. We used the recently published sequence and annotation data from the Joint Genome Institute for C. reinhardtii and S. bicolor (27,640), released 08/28/2008  and 10/28/2008  respectively. A. thaliana genes (27,741) were downloaded from The Arabidopsis Information Resource. Collections of D. rerio, M. musculus and H. sapiens sequences were taken from NCBI. Z. mays sequences were obtained from J. Craig Venter Institute. The remainder of the plant transcripts for the Poaceae family (aka grasses) were downloaded from TIGR Plant Transcript Assemblies. We used the frequency of single nucleotide polymorphisms per 1-kb gene length, obtained from the Plant Genome Mapping Laboratory, University of Georgia , as a crude proxy for the local recombination rate in rice. Supplementary figures and tables are available at http://model.research.glam.ac.uk/projects/glacombio/GC3/.
Calculation of z-scores
For each gene, GC3 values and the standard deviation of log-transformed gene expression values were computed across all experiments. Genome-wide distributions of both GC3 and gene expression are approximately normal. For each of these measures, the parameters μ (mean) and σ (standard deviation) of the corresponding normal distributions were determined. The standard deviations of gene expression and GC3 values were converted to z-scores, , and the standardized scores were plotted.
Calculation of relative abundance
Relative abundance was calculated according to , in which it was observed that the profiles of relative dinucleotide abundance values (genome signatures) are equivalent to the "general design" of organisms, and closely-related species have similar genome signatures. The computational formulae for di- and tri-nucleotide relative abundance values are , where N stands for any nucleotide and W denotes A or T. As demonstrated by , the ratio of observed to expected CpG frequency underestimates the real CpG deficiency in GC-rich sequences: because the formula is non-linear, an identical fraction of mutated CpG in high- and low-GC classes of genes results in artificially higher values of ρ CG for the former than the latter. The authors suggested the use of a threshold of ρ CG as a function of G+C frequency to assess the presence of unmethylated sites, which can be calculated using the following formula: . In order to take the influence of this mathematical artifact into account in addition to the original relative abundance values, we also considered GC-corrected values defined as .
Principal Component Analysis
Principal Component Analysis (PCA) involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components. The data are represented in a new coordinate system such that the greatest variance of the data lies on the first principal component, the second greatest variance on the second coordinate, and so on . Our approach was generally similar to that of Chen et al. : for each gene i of O. sativa we calculated codon frequency ci, m(w), where m(w) stands for w th codon for amino acid m, and applied PCA (using the princomp function in R).
TT received her PhD in Applied Mathematics from the University of Southern California. Currently she is a Senior Lecturer in Statistics, University of Glamorgan, Wales. Prior to accepting this post she worked as a computational scientist for the biotechnology company Ceres, Loyola Marymount University and Georgia Institute of Technology. TT developed algorithms for analyzing gene expression analysis, discovering promoter motifs and genome annotation.
NA is a Senior Computational Scientist at Ceres. He received his PhD in Molecular Biology, VNII Genetika, Russia. He was a postdoctoral researcher in Kyoto University, Japan and later at NCI/NIH, and then a computational scientist at Amgen. He has done computational work on discovery of promoter motifs, protein structure, fold recognition and lead discovery.
JB earned his PhD in Molecular Biology and Genetics from the University of Pennsylvania and his BS in Computer Science from the University of Wisconsin. He has long-standing interest in genomics and has been leading bioinformatics and informatics efforts at the Human Genome Sequencing Center (Baylor College of Medicine), a biopharmaceutical company (UCB Pharma), and most recently at an agricultural biotechnology company (Ceres, Inc.)
KAF received his PhD in Genetics from Ohio State University. Upon graduation, he held positions in two companies, and later moved to the Dept of Plant Sciences at the University of Arizona. After accepting a position to start up a genomics company, Ceres, in 1997, KAF led the company's sequencing strategy, which resulted in the largest number of plant cDNAs that had ever been sequenced. Working with computational biologists at Ceres, including the three co-authors on this paper, he helped use these cDNAs to advance our understanding of plant transcriptomes. Currently KAF is the Director of the School of Plant Sciences at the University of Arizona.
We are grateful to Paul Burns, Paul Messenger, and BioMedES editorial services for proofreading the manuscript. We would like to thank our colleagues from Ceres, Inc, Loyola Marymount University, Georgia Institute of Technology, and University of Glamorgan for fruitful discussions and a supportive environment.
- Campbell W, Gowri G: Codon Usage in Higher Plants, Green Algae, and Cyanobacteria. Plant Physiology. 1990, 92: 1-11. 10.1104/pp.92.1.1.PubMed CentralPubMedView Article
- Duret L, Mouchiroud D, Gautier C: Statistical analysis of vertebrate sequences reveals that long genes are scarce in GC-rich isochores. Journal of Molecular Evolution. 1995, 40 (3): 308-17. 10.1007/BF00163235.PubMedView Article
- Carels N, Bernardi G: Two classes of genes in plants. Genetics. 2000, 154: 1819-1825.PubMed CentralPubMed
- Lescot M: Insights into the Musa genome: syntenic relationships to rice and between Musa. BMC Genomics. 2008, 9: 58-10.1186/1471-2164-9-58.PubMed CentralPubMedView Article
- Paterson A, Bowers J, Feltus F, Tang H, Lin L, Wang X: Comparative Genomics of Grasses Promises a Bountiful Harvest. PlantPhysiology. 2009, 149: 125-131.
- Macaya G, Thiery JP, Bernardi G: An approach to the organization of eukaryotic genomes at a macromolecular level. J Mol Biol. 1976, 108 (1): 3237-54. 10.1016/S0022-2836(76)80105-2.View Article
- Bernardi G: Isochores and the evolutionary genomics of vertebrates. Gene. 2000, 3-17. 10.1016/S0378-1119(99)00485-0.
- Hughes S, Zelus D, Mouchiroud D: Warm-blooded isochore structure in Nile crocodile and turtle. Mol Biol Evol. 1999, 16: 1521-1527.PubMedView Article
- Cammarano R, Constantini M, Bernardi G: The isochore patterns of invertebrate genomes. BMC Genomics. 2009, 10 (538): 10.1186/1471-2164-10-538.
- Matassi G, Montero LM, Salinas J, Bernardi G: The isochore organization and the compositional distribution of homologous coding sequence in the nuclear genome of plants. Nucleic Acids Research. 1989, 17 (13): 5273-5290. 10.1093/nar/17.13.5273.PubMed CentralPubMedView Article
- Montero LM, Salinas J, Matassi G, Bernardi G: Gene distribution and isochore organization in the nuclear. Nucleic Acids Research. 1990, 18 (7): 1859-1867. 10.1093/nar/18.7.1859.PubMed CentralPubMedView Article
- Salinas J, Matassi G, Montero L, Bernardi G: Compositional compartentalization and compositional patterns in the nuclear genomes of plants. Nucleic Acids Research. 1988, 16 (10): 4269-4285. 10.1093/nar/16.10.4269.PubMed CentralPubMedView Article
- Press WH, Robins H: Isochores Exhibit Evidence of Genes Interacting with the Large-scale Genomic Environment. Genetics. 2006, 174: 1029-1040. 10.1534/genetics.105.054445.PubMed CentralPubMedView Article
- Clay O, Caccio S, Zoubak Z, Mouchiroud D, Bernardi G: Human Coding and Non-coding DNA: compositional correlations. Mol Phys Evol. 1996, 5: 2-12. 10.1006/mpev.1996.0002.View Article
- Mouchiroud D, D'Onofrio G, Aïssani B, Macaya G, Gautier C, Bernardi G: The distribution of genes in the human genome. Gene. 1991, 100: 181-7. 10.1016/0378-1119(91)90364-H.PubMedView Article
- Elhaik E, Landan G, Graur D: Can GC content at third-codon positions be used as a proxy for isochore composition?. Mol Biol Evol. 2009, 26 (8): 1829-33. 10.1093/molbev/msp100.PubMedView Article
- Duret L, Semon M, Piganeau G, Mouchiroud D, Galtier N: Vanishing GC-rich isochores in mammalian genomes. Genetics. 2002, 162 (4): 1837-47.PubMed CentralPubMed
- Wang HC, Hickey DA: Rapid divergence of codon usage patterns within the rice genome. BMC Evol Biol. 2007, 7 (Suppl 1): S6-10.1186/1471-2148-7-S1-S6.PubMed CentralPubMedView Article
- Duret L, Mouchiroud D: Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Nat Acad Sci USA. 1999, 96: 4482-4487. 10.1073/pnas.96.8.4482.View Article
- Karlin S, Mrazek J, Campbell AM: Codon Usages in different gene classes of the Escherichia coli genome. Mol Microbiol. 1998, 29 (6): 1341-55. 10.1046/j.1365-2958.1998.01008.x.PubMedView Article
- Hooper SD, Berg OG: Gradients in nucleotide and codon usage along E. coli genome. Nucleic Acids Research. 2000, 28 (18): 3517-3523. 10.1093/nar/28.18.3517.PubMed CentralPubMedView Article
- Roymondal U, Das S, Sahoo S: Predicting gene expression level from relative codon usage bias: an application to Escherchia coli genome. DNA Research. 2009, 8: 1-18. 10.1016/j.dnarep.2008.11.002.View Article
- Alexandrov N, Brover V, Freidin S, Troukhan M, Tatarinova T, Zhang H, Swaller T, Lu Y, Bouck J, Flavell R: Insights into corn genes derived from large-scale cDNA sequencing. Plant Mol Biol. 2009, 69 (1-2): 179-94. 10.1007/s11103-008-9415-4.PubMed CentralPubMedView Article
- Chiaromonte F, Miller W, Bouhassira EE: Gene Length and Proximity to Neighbors Affect Genome-Wide Expression Levels. Genome Res. 2003, 13: 2602-2608. 10.1101/gr.1169203.PubMed CentralPubMedView Article
- Ren XY, Vorst O, Fiers M, Stiekema W, Nap JP: In plants, highly expressed genes are the least compact. Trends in Genetics. 2006, 10: 528-532. 10.1016/j.tig.2006.08.008.View Article
- Castillo-Davis CI, Mekhedov SL, Hart DL, Koonin EV, Kondrashov FA: Selection for short introns in highly expressed genes. Nature Genetics. 2002, 31: 415-418.PubMed
- Lawniczak M, Holloway A, Begun DJ: Genomic analysis of the relationship between gene expression variation and DNA polymorphism in Drosophila simulans. Genome Biol. 2008, 9: 10.1186/gb-2008-9-8-r125. (8: R125).
- Jeffares DC, Penkett CJ, Bähler J: Rapidly regulated genes are intron poor. Trends in Genetics. 2008, 24 (8): 375-378. 10.1016/j.tig.2008.05.006.PubMedView Article
- Gouy M, Gautier G: Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Research. 1982, 10: 7055-7074. 10.1093/nar/10.22.7055.PubMed CentralPubMedView Article
- Jabbari K, Cruveiller S, Clay O, Bernardi G: The correlation between GC3 and hydropathy in human genes. Gene. 2003, 317 (1-2): 137-40. 10.1016/S0378-1119(03)00663-2.PubMedView Article
- Arhondakis S, Clay O, Bernardi G: GC level and expression of human coding sequnces. BBRC. 2008, 367: 542-545.PubMed
- Dekker J: GC- and AT-rich chromatin domains differ in conformation and histone modification status and are differentially modulated by Rpd3p. Genome Biology. 2007, 8-10.1186/gb-2007-8-6-r116.
- D'Onofrio G, Ghosh TC, Saccone S: Different functional classes of genes are characterized by different compositional properties. FEBS. 2007, 581 (30): 5819-24. 10.1016/j.febslet.2007.11.052.View Article
- Chiapello H, Lisacek F, Caboche M, Henaut A: Codon usage and gene function are related in sequences of Arabidopsis thaliana. Gene. 1998, 209 (1-2): GC1-GC38. 10.1016/S0378-1119(97)00671-9.PubMedView Article
- Wong GK, Wang J, Tao L, Tan J, Zhang J, Passey DA, Yu J: Compositional gradients in Gramineae genes. Genome Res. 2002, 12: 851-856. 10.1101/gr.189102.PubMed CentralPubMedView Article
- Tatarinova T, Brover V, Troukhan M, Alexandrov N: Skew in CG content near the transcription start site in Arabidopsis thaliana. Bioinformatics. 2003, 19 (Suppl 1): i313-4. 10.1093/bioinformatics/btg1043.PubMedView Article
- Fujimori S, Washio T, Tomita M: GC-compositional strand bias around transcription start sites in plants and fungi. BMC Genomics. 2005, 6 (26):
- Bell SJ, Chow YC, Ho JYK, Forsdyke DR: Correlation of Chi orientation with transcription indicates a fundamental relationship between recmbination and transcription. Gene. 1998, 216: 285-292. 10.1016/S0378-1119(98)00333-3.PubMedView Article
- Dang KD, Dutt PB, Forsdyke DR: Chargaff difference analysis of the bithorax complex of Drosophila melanogaster. Biochem Cell Biol. 1998, 76: 129-137. 10.1139/bcb-76-1-129.PubMedView Article
- Lao PJ, Forsdyke DR: Thermophilic Bacteria Strictly Obey Szhybaski's Transcription Detection Rule and Politely Purine-Load RNAs with Both Adenine and Guanine. Genome Research. 2000, 10: 228-236. 10.1101/gr.10.2.228.PubMed CentralPubMedView Article
- Wang X, Shi X, Hao B, Ge S, Luo J: Duplication and DNA segmental loss in the rice genome: implications for diploidization. New Phytologist. 2005, 165 (3): 937-946. 10.1111/j.1469-8137.2004.01293.x.PubMedView Article
- Holmquist GP: Chromosome bands, their chromatin flavors, and their functional features. American Journal of Human Genetics. 1992, 51 (1): 17-37.PubMed CentralPubMed
- Birdsell J: Integrating genomics, bioinformatics, and classical genetics to study the effects of recombination on genome evolution. Molecular Biology and Evolution. 2002, 19: 1181-1197.PubMedView Article
- Marais G: Biased gene conversion: implications for genome and sex evolution. Trends in Genetics. 2003, 19 (6): 330-338. 10.1016/S0168-9525(03)00116-1.PubMedView Article
- Galtier N, Duret L, Glemin S, Ranwez V: GC-biased gene conversion promotes the fixation of deleterious amino acid changes in primates. Trends in Genetics. 2008, 25 (1):
- Wang HC, Singer GA, Hickey DA: Mutational bias affects protein evlolution in flowering plants. Molecular Biology and Evolution. 2004, 21 (1): 90-96. 10.1093/molbev/msh003.PubMedView Article
- Bernardi G: Structural and evolutionary genomics: natural selection in genome evolution. 2004, Amsterdam, Elsevier
- Vinogradov AE: Isochores and tissue specificity. Nucleic Acids Research. 2003, 31: 5212-5220. 10.1093/nar/gkg699.PubMed CentralPubMedView Article
- Urruita AO, Hurst LD: The signature of selection mediated by expression of human genes. Genome research. 2003, 13: 2230-2264. 10.1101/gr.641103..
- Molina C, Grotewood E: Genome wide analysis of Arabidopsis core promoters. BMC Genomics. 2005, 6 (1): 10.1186/1471-2164-6-25.
- Yang C, Bolotin E, Jiang T, Sladek FM, Martinez M: Prevalence of the initiator over the TATA box in human and yeast genes and identification of DNA motifs enriched in human TATA-less core promoters. Gene. 2007, 389: 52-65. 10.1016/j.gene.2006.09.029.PubMed CentralPubMedView Article
- Moshonov S, Elfakess R, Golan-Mashiach M, Sinvani H, Dikstein R: Links between core promoter and basic gene features influence gene expression. BMC Genomics. 2008, 9 (92):
- Troukhan M, Tatarinova T, Bouck J, Flawell R, Alexandrov N: Genome-wide discovery of cis-elements in promoter sequences using gene expression data. Omics. 2009, 13 (1):
- Liu Q, Dou S, Ji Z, Xue Q: Synonymous codon usage and gene function are strongly related in Oryza sativa. Biosystems. 2005, 80 (2): 123-131. 10.1016/j.biosystems.2004.10.008.PubMedView Article
- Kalisz S, Purugganan MD: Epialleles via DNA methylation: consequences for plant evolution. Trends Ecol Evol. 2004, 19 (6): 309-14. 10.1016/j.tree.2004.03.034.PubMedView Article
- Stayssman R, Nejman D, Roberts D, Steinfeld I, Blum B, Benvenisty N, Simon I, Yakhili Z, Cedar H: Developmental programming of CpG island methylation profiles in the human genome. Nature Structural & Molecular Biology. 2009, 16 (5): 564-571. 10.1038/nsmb.1594.View Article
- Tran R, Henikoff J, Zilberman D, Ditt R, Jacobsen S, Henikoff S: DNA methylation profiling identifies CG methylation clusters in Arabidopsis genes. Curr Biol. 2005, 15 (2): 154-9. 10.1016/j.cub.2005.01.008.PubMedView Article
- Chela-Flores J, Espejo Acuña C: On the possible effects of homeostatic shifts in human embryonic development. Acta Biotheor. 1990, 38 (2): 135-42. 10.1007/BF00047550.PubMedView Article
- Pradhan S, Urwin N, Jenkins G, Adams R: Efect of CWG methylation on experession of plant genes. Biochem J. 1999, 341: 473-476. 10.1042/0264-6021:3410473.PubMed CentralPubMedView Article
- Ikemura T, Wada K: Evident diversity of codon usage patterns of human genes with respect to chromosome banding patterns and chromosome numbers; relation between nucleotide sequence data and cytogenic data. Nucleic Acids Research. 1991, 19 (16): 4333-9. 10.1093/nar/19.16.4333.PubMed CentralPubMedView Article
- Eyre-Walker A: Recombination and mammalian genome evolution. Proc Biol Sci. 1993, 252 (1335): 237-43. 10.1098/rspb.1993.0071.PubMedView Article
- Hurst L, Williams E: Covariation of GC content and the silent site substitution rate in rodents: implications for methodology and for the evolution of isochores. Gene. 2000, 261 (1): 107-14. 10.1016/S0378-1119(00)00489-3.PubMedView Article
- Springer N, Kaeppler S: Evolutionary Divergence of Monocot and Dicot Methyl-CpG-Binding Domain Proteins. Plant Physiology. 2005, 138 (1): 92-104. 10.1104/pp.105.060566.PubMed CentralPubMedView Article
- Galtier N, Piganeau G, Mouchiroud D, Duret L: GC-content evolution in mammalian genomes: the biased gene conversion hypothesis. Genetics. 2001, 159: 907-911.PubMed CentralPubMed
- Coulondre C, Caccio S, Zoubak S, Mouchiroud D, Bernardi G: Molecular Basis of base substitution hotspots in E. coli. Nature. 1978, 274: 775-780. 10.1038/274775a0.PubMedView Article
- Newcomb T, Loeb L: Oxidative DNA damage and mutagenesis. DNA damage and repair, DNA repair in prokaryotes and lower eukaryotes. Edited by: Nikoloff J, Hoekstra F. 1998, Humana Press, 1: 65-84.
- Villagomez L, Tatarinova T, Kuleck G: Ecological genomics: Construction of molecular pathways responsible for gene regulation and adaptation to heavy metal stress in Arabidopsis thaliana and Raphanus sativus. 2009, ISMB/ECCB; Stockholm
- Dvorak J, Yang ZL, You F, Luo M: Deletion polymorphysms is wheat chromosome regions with contrasting recombination rates. Genetics. 2004, 168 (3): 1665-1675. 10.1534/genetics.103.024927.PubMed CentralPubMedView Article
- Keller B, Feuillet C: Colinearity and gene density in grass genomes. Trends in plant science. 2000, 5 (6): 10.1016/S1360-1385(00)01629-0.
- Levy A, Feldman M: The impact of polyploidy on grass genome evolution. Plant Physiology. 2002, 130: 1587-1593. 10.1104/pp.015727.PubMed CentralPubMedView Article
- Ouyang S, Zhu W, Hamilton J, Lin H, Campbell M, Childs K, Thibaud-Nissen F, Malek RL, Lee Y, Zheng L: The TIGR Rice Genome Annotation Resource: improvements and new features. Nucleic Acids Research. 2007, 35: D883-D887. 10.1093/nar/gkl976.PubMed CentralPubMedView Article
- Morris RT, O'Connor TR, Wyrick JJ: Osiris: an integrated promoter database for Oryza sativa L. Bioinformatics. 2008, 24 (24): 2915-2917. 10.1093/bioinformatics/btn537.PubMedView Article
- Merchant SS, Prochnik SE, Vallon O, Harris EH, Karpowicz SJ, Witman GB, Terry A, Salamov A, Fritz-Laylin LK, Maréchal-Drouard L: The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science. 2007, 318 (5848): 245-50. 10.1126/science.1143609.PubMed CentralPubMedView Article
- Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Haberer G, Hellsten U, Mitros T, Poliakov A: The Sorghum bicolor genome and the diversification of grasses. Nature. 2009, 457 (7229): 551-6. 10.1038/nature07723.PubMedView Article
- Childs KL, Hamilton J, Zhu W, Ly E, Cheung F, Wu H, Rabinowicz P, Town C, Buell A, Chan C: The TIGR Plant Transcript Assemblies database. Nucleic Acids Res. 2007, D846-51. 10.1093/nar/gkl785.
- Feltus A, Wan J, Schulze S, Estill J, Jiang N, Patterson A: An SNP resourse for rice genetics and breeding based on subspecies indica and japonica genome alignment. Genome Res. 2004, 14: 1812-1819. 10.1101/gr.2479404.PubMed CentralPubMedView Article
- Karlin S, Mrazek J: Compositional differences within and between eukaryotic genomes. PNAS USA. 1997, 10227-10232. 10.1073/pnas.94.19.10227.
- Duret L, Galtier N: The Covariation Between TpA Deficiency, CpG Deficiency, and G+C Content of Human Isochores Is Due to a Mathematical Artifact. Molecular Biology and Evolution. 2000, 17: 1620-1625.PubMedView Article
- Jolliffe IT: Principal Component Analysis. New York, Springer Series in Statistics. 2002
- Chen S, Lee W, Hottes A, Shapiro L, McAdams H: Codon usage between genoms is constrained by genome-wide mutationlal processes. PNAS USA. 2004, 101 (10): 3480-3485. 10.1073/pnas.0307827100.PubMed CentralPubMedView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.