GC3 biology in corn, rice, sorghum and other grasses
© Tatarinova et al. 2010
Received: 16 September 2009
Accepted: 16 May 2010
Published: 16 May 2010
Skip to main content
© Tatarinova et al. 2010
Received: 16 September 2009
Accepted: 16 May 2010
Published: 16 May 2010
The third, or wobble, position in a codon provides a high degree of possible degeneracy and is an elegant fault-tolerance mechanism. Nucleotide biases between organisms at the wobble position have been documented and correlated with the abundances of the complementary tRNAs. We and others have noticed a bias for cytosine and guanine at the third position in a subset of transcripts within a single organism. The bias is present in some plant species and warm-blooded vertebrates but not in all plants, or in invertebrates or cold-blooded vertebrates.
Here we demonstrate that in certain organisms the amount of GC at the wobble position (GC3) can be used to distinguish two classes of genes. We highlight the following features of genes with high GC3 content: they (1) provide more targets for methylation, (2) exhibit more variable expression, (3) more frequently possess upstream TATA boxes, (4) are predominant in certain classes of genes (e.g., stress responsive genes) and (5) have a GC3 content that increases from 5'to 3'. These observations led us to formulate a hypothesis to explain GC3 bimodality in grasses.
Our findings suggest that high levels of GC3 typify a class of genes whose expression is regulated through DNA methylation or are a legacy of accelerated evolution through gene conversion. We discuss the three most probable explanations for GC3 bimodality: biased gene conversion, transcriptional and translational advantage and gene methylation.
Since bimodality has been detected in only some plant families, we suggest that this feature has developed independently in warm-blooded animals and in certain members of the commelinids clade. The GC3 bias could possibly be explained as a consequence of some larger genomic bias. For example, over three decades ago, Macaya et al.  observed that some genomes contain isochores, megabase-long regions with either high or low GC contents. Isochores have been reported in warm-blooded vertebrates and in some reptiles [7–9]. Compositionally homogenous DNA regions of at least 50-100 kb have been found in several dicot and monocot genomes (pea, sunflower, tobacco, barley, rice, maize, oat and wheat), supporting the existence of isochores in plants [10–12]. It is not yet known whether all eukaryotic genomes are characterised by an isochore structure .
Press and Robins  reported that high GC isochores contain a mixture of GC- and AT-rich genes, whereas high AT (low GC) isochores contain mostly AT-rich genes. Genes found within high and low GC isochores are functionally distinguishable by statistical analysis of their gene ontology categories . The authors suggested that some genes require AT-richness, while others, contained within large coherent blocks, have a strong bias towards mutations to GC.
The neutral theory of evolution states that for a change to come about in the population as a whole, the new characteristic must be as good as or better than the old one. Under the assumption of neutrality, genes would acquire characteristics of the surrounding isochores. Therefore, noncritical elements such as synonymous bases in 3rd codon positions and 5' and 3' UTRs should be GC-rich in high GC isochores. In fact, several groups have found a positive correlation between the GC3 levels of a gene and of its surrounding genomic area [2, 13, 14]. Mouchiroud et al.  found an 8-fold enrichment for high-GC3 genes within the top 3% of the GC-richest isochores in humans. These observations support the neutrality assumption. Elhaik et al. , however, found little correlation between GC3 and isochores within a species and none between species. Furthermore, the correlation with generally GC-rich areas is only modest (R2 = 0.43) , suggesting that a more complex explanation must be sought. Moreover, isochores have been reported in both GC3 unimodal and bimodal organisms and therefore cannot provide an exclusive explanation for GC3 bimodality.
Campbell and Gowri  described differences in codon usage in different plant genomes, algae and cyanobacteria, and showed that bimodality existed only in monocots. In a series of publications [10, 11], GC3 levels were analyzed for five Poaceae and three dicot species. It was found that compositional patterns in the dicot species resembled those of cold-blooded vertebrates, while the grasses resembled warm-blooded vertebrates. Bimodality of GC3 distribution in grasses, and specifically in rice, was reported by Carels and Bernardi , Wang and Hickey  and Salinas et al. . These authors explained the differences in codon usage among some rice genes by a rapid evolutionary increase in GC content. They gave two possible explanations for the observed bi-modality: (1) positive Darwinian selection, acting at the level of translational efficiency; and (2) neutral mutational bias.
Several characteristics related to high GC3 genes have been observed to date. Duret et al.  examined vertebrate sequences and described two properties of high-GC3 genes: the proteins are generally shorter, and introns are either absent or short in comparison to low-GC3 genes. Carels and Bernardi  compared genes in plants with generally high GC content to those with generally lower GC content. Although the differences were most prominent in Gramineae, they observed that other families of plants including dicots (e.g. Brassicaceae and Fabaceae) could be segregated by GC distribution. They also observed the tendency towards short or no introns in GC-rich genes and identified a correlation between GC content, intron size and location among homologs across species. Duret et al.  reported a small correlation between GC3 and the general GC richness of the surrounding >10 kb of genomic sequence. The relationship between gene length and GC3 for many organisms has been analyzed in a number of publications during the last decade [18–21]. Gene lengths in C. elegans, D. melanogaster, A. thaliana  and O. sativa  are negatively associated with GC3. Shorter genes in bacteria tend to have more variable expression levels, and selective pressure on codon usage is also higher in shorter genes . It was recently demostrated that corn genes with high GC3 tend to be mono-exonic . It has been reported that shorter and intron-poor genes have either stronger [24–26] or more variable [27, 28] expression levels because introns can delay regulatory responses and are selected against in genes whose transcripts require rapid adjustment for survival of environmental challenges . Ren et al.  showed opposite trends in plant and animal genomes; highly expressed genes tend to be longer in plants and shorter in animals. A recent paper by Jeffares et al.  proposed a reconciliation of these observations: both plants and animals show consistent inverse relationships between intron density (defined as intron number/unspliced transcript length) and rapid regulation (measured as the fastest rate of change of gene expression intensity in a time course experiment).
An influence of translation on codon bias has been proposed on the basis of increased hydrogen bonding and hence strength of G-C pairing in contrast to A-T pairing. This increased pairing may improve transcript stability at the mRNA level or improve the speed or fidelity of translation, thereby improving protein production, as has been shown in a number of species including bacteria and some eukaryotes . This is supported by the analysis of Campbell and Gowri , who studied codon usage in plants and found two groups of genes that had preferences for GC-ending codons in monocots but not dicots. Additionally, Jabbari et al.  found a correlation between high-GC genes and amino acid hydropathy. However, Wang and Hickey  used concordance analysis of synonymous and non-synonymous differences to show that the primary effect is not at the codon or protein level.
Several groups [3, 14, 18, 22, 31] have suggested that the effect of high or low GC3 may be at the level of transcription. The generally shorter introns and coding sequences of high-GC3 genes led Carels and Bernardi  to suggest that selective pressure has driven housekeeping and non-regulated genes to higher GC contents while the longer AT-rich genes have been maintained to provide more opportunity for regulation and alternative splicing. Clay et al.  looked at CpG islands upstream of GC-rich and GC-poor transcripts and found little correlation. Nevertheless, the observation of higher GC within the introns of GC3 transcripts as well as the 5' region, and the weak correlation between general genomic GC content and GC3 level, suggests that the transcriptional machinery may be involved.
Conflicting ideas about codon usage bias and expression levels have been published. Wang and Hickey  reported that codon bias is not correlated with gene expression. Using S. cerevisiae expression and sequence data, Dekker  showed that on average, GC-rich genes are significantly more transcriptionally active than AT-rich genes. A recent paper by Roymondal et al.  presented an expression measure of a gene, devised to predict the level of gene expression from relative codon bias. They suggested that since the bias is caused by the presence of optimal codons that are recognized by the most abundant tRNA species, the high-GC3 peak appears as a manifestation of natural selection acting in grasses and warm-blooded vertebrates. This process shapes the codon usage patterns for selected genes to gain optimal expression levels in response to changing environments. Roymondal et al.  mentioned that within any genome, codon bias tends to be much stronger in highly expressed genes.
Attempts have been made to discover an association between functional classes of genes and GC3. Carels and Bernardi  characterized the high GC-containing transcripts as housekeeping and photosynthetic. D'Onofrio et al.  found GC3 to be higher in genes involved with cellular metabolism and lower in those involved with information storage processing. These observations are consistent with previous studies of general GC contents of genes in arabidopsis .
The existence of a codon usage gradient along the coding regions was previously discussed by Hooper et al. , who outlined the possible advantages of a positive GC3 gradient. Based on an analysis of E. coli genes, the authors suggested that G3-containing codons may be translated more quickly and with lower error rate than other codons, thus avoiding congestion at the ribosomes because of a gradual increase of speed of translation along the gene. Wong et al.  discovered that in the plant kingdom, O. sativa genes are richer in GC at the 5' end than at the 3' end. This gradient and imbalance in nucleotide strand composition extends beyond the coding region; transcription start sites are characterized by a pronounced peak in CG-skew [36, 37], and mRNAs tend to be purine-rich (A for low GC organisms and G for high GC organisms) [38, 39]. Avoidance of unnecessary 'kissing interactions' between and within mRNAs was mentioned by Lao et al.  as a possible explanation for purine loading. Species adapted to hotter environments have stronger selection pressure towards purine loading since nucleic acids are "stickier" at high temperatures . This effect is the most pronounced at the wobble position of codons.
Aside from transcriptional and translational influences, it is possible that the driver for differences in GC3 operates at a recombinational level. Gene duplication in the Poaceae has been mentioned as one possible explanation of GC3 bimodality . The authors suggested that duplicated genes in O. sativa can be partitioned into 10 blocks by chromosomal location; these blocks have significantly different synonymous substitution rates (Ks). Wang et al.  found that Ks was negatively correlated with the GC content at the third position of codons (correlation coefficient -0.455) and that the bimodal distribution of Ks was split into two unimodal distributions corresponding to high- and low-GC3 genes. Related to this idea are advances in understanding of the accelerated evolutionary rates of some genes. Holmquist  described a model in which hybridization of similar genes during recombination resulted in a bias toward higher GC content in the recombined areas. Birdsell  demonstrated that recombination significantly increases GC3 in a selectively neutral manner; the GC-biased mismatch repair system evolved in various organisms as a response to AT mutational bias. Birdsell  suggested that unimodal low-GC3 species may have prevailing AT mutational bias, random fixation of the most common types, or mutation or absence of GC-biased gene conversion . The authors hypothesized that recombination is more likely to occur within conserved and regulatory regions of the genome; therefore, introns, intergenic regions and pseudogenes tend to have lower GC contents than ORFs. Galtier et al.  noticed that GC-biased gene conversion, frequently accompanied by an increase in GC3, influenced the evolutionary trajectory of human proteins by promoting the fixation of deleterious AT→GC mutations. These observations raise the possibility that the high-GC3 class of genes might have appeared as a consequence of accelerated evolution.
With the increasing amount of genomic and transcript information available within the public databases as well as the improved understanding of gene conversion and gene regulatory mechanisms, we returned to the puzzle of GC3 bimodality in grasses in an effort to understand the significance of this phenomenon. We concentrate our discussion around Oryza sativa as it is one of the best-studied grass species at the genomic level.
We revisited the extent of variation of GC3 found in various species. In Figure 1 we have plotted the distributions of GC3 for 12 plant and animal species. Distributions of GC3 in H. sapiens, O. sativa, C. reinhardtii, and Z. mays are clearly bi-modal, A. cepa, A. thaliana, G. max, S. cerevisiae and C. elegans are unimodal, and B. napus, D. rerio and M. musculus have intermediate distributions. Uni- and bi-modality of GC3 distributions in various organisms have been reported previously [3, 18, 46] and our results are consistent with the earlier observations on the species tested.
Previous reports on GC3-rich genes have suggested that these are present in GC-rich regions of the genome, aka isochores [10, 12, 47]. The authors suggested that GC3 bimodality in grasses came about because these genes are located in regions of their respective genomes that differ in G+C content. Two decades of full genome sequencing and annotation of numerous plant genomes make it worthwhile to revisit the issue of codon usage in plants and plant isochore organization. In order to answer the question of isochores in grasses, we analyzed the GC contents of coding and non-coding sequences in O. sativa. Overall, the correlation of GC3 values between adjacent genes is 0.05, indicating that there is no significant clustering of these genes. We separated all mRNA-validated rice genes into two groups on the basis of GC3 content: the "low" group, where GC3<0.8, contains 11,608 genes; and the "high" group, where GC3 ≥ 0.8, contains 4,889 genes. The choice of cut-off point between the two groups was based on the position of the lowest GC3 value between the two peaks. (This approach is different from the one outlined in  and , where the two classes were distinguished by overall GC content. In those two studies, the average GC3 contents were 0.89 in the high group and 0.69 in the low group). We analyzed the spatial distribution of genes with high GC3 values. Of the 4,889 genes in the high group, 3,661 are evenly distributed across the genome; 485 genes (out of for the remaining 1,228) occur in 36 clumps of 10 or more genes (Additional file 1 : Supplementary Figure SF1 and Additional file 2 : Supplementary Table ST1). Five of these clumps are likely to result from relatively recent gene duplication, since they consist of genes with identical PFAM annotations. From the analysis of seven animal species, Elhaik et al.  inferred that GC3 can only explain a very small proportion of the variation in GC content of long genomic sequences flanking the genes, and correlations between GC3 and GC in the flanking region decayed rapidly with distance from the gene. Accordingly, we examined 1,000 nucleotides upstream of the 16,497 rice genes and also found no significant correlation in GC content between the open reading frames and flanking regions. The GC contents of the high and low groups gave nearly identical unimodal bell-shaped frequency distributions centered at GC = 0.4. These results suggest an absence of isochore organization in the rice genome and indicate that the high-GC3 genes are not closely associated with GC-rich regions in rice.
Genes in the middle of the GC3 spectrum (0.4<GC3<0.7) have a negative correlation with ORF length (Pearson's correlation coefficient = -0.3), whereas for genes in the high GC3 class and for those with GC3<0.4, it is approximately 0. As was previously observed , variability of gene expression is negatively correlated with intron density. We computed Pearson's correlation coefficient between GC3 and intron density for O. sativa and S. bicolor : for both grass species it is approximately -0.3 (Additional File 1 : Supplementary Figure SF3). Genes with high GC3 tend to be mono-exonic . This is consistent with our observation of a positive relationship between gene expression variability and GC3. On the basis of this evidence, we suggest that rapidly evolving genes are shorter, have more variable expression and are GC3-rich. More "evolutionarily stable" genes tend to accumulate introns and increase the ORF length.
Codon usage and gene classes
The first two nucleotides in a codon are more reflective of gene function than the third one. Using coding sequences of O. sativa, we computed average GC3 and GC12 for GO and FPAM annotations. The coefficient of variation for GC12 is approximately three times smaller than the coefficient of variation for GC3. However, the third position in the codon also affects gene function. Liu et al.  demonstrated that synonymous codon usage and gene function are strongly correlated in O. sativa; they found that genes involved in metabolic processes have a preference for C or G in the third position of a codon. Different PFAM families show affinity for high- or low-GC3 classes. For example, O. sativa genes annotated as "expressed proteins" are more prevalent in the low class (22% vs. 33%) and alpha-expansins are more prevalent in the high group (relative abundance is 46). Details are given in the Supplementary data (Additional File 2 : Supplementary Table ST2). It appears that GC3 increase tends to co-evolve in some PFAM families of grass genes across multiple organisms. The distribution of GC3 in histone, ribosomal and chrolophyll a-b binding protein coding genes are very similar for rice and corn. In both organisms, 80% of chrolophyll a-b binding proteins have GC3>0.85, ribosomal proteins are approximately normally distributed around GC3 = 0.65, and 60% of all histones have GC3>0.75. Another way to look at the relationship between gene category and GC3 is by considering GO annotation (see Additional File 2 : Supplementary Tables ST3-ST7). D. rerio, M. musculus, H. sapiens, C. reinhardtii, O. sativa and Z. mays have higher GC3 values than A. thaliana and we were curious to see if GC3 is consistent between these organisms and GO categories. The high-GC3 species also have consistently higher GC3 values for genes from the following GO classes: electron transport or energy pathways, response to abiotic or biotic stimuli, response to stress, transcription and signal transduction. Therefore, we conclude that certain classes of genes are characterized by high GC3 values across kingdoms.
Many previous studies have demonstrated a significant association between GC3 and recombination rate across different plant and animal species [41, 42, 44, 45, 60–62]. The conclusion is that high GC3 content in an organism indicates a recombining genome. Similarly, the presence of two distinct GC classes of genes may suggest the existence of recombining and non-recombining regions within that genome. To support this hypothesis, we computed the mutation rates of rice genes (see Methods). For our curated dataset of 16 K rice genes, we found a positive correlation between the density of SNPs per 1000 nucleotides and GC3 (R2 = 0.71, SNP = 1.114+0.583GC3). Association with overall GC content is much weaker, R2 = 0.32. Therefore, we conclude that high-GC3 genes accumulate more mutations and are located in the highly recombining regions of the rice genome.
Analysis of gene-specific codon usage bias shows that GC3 is the major characteristic of codon utilization in Poaceae. In order to demonstrate this, we used Principal Component Analysis (PCA) to find a basis for the space of codon vectors. Approximately 50% of the variance in codon usage is explained by the first principal component; this component has an almost perfect negative correlation (-0.98) with GC3. The remaining components contribute at most 4% each to the variance; the second principal component is weakly correlated to GC3 skew.
Deviations from unimodal bell-shaped distributions of GC3 appear in many species, but grasses have very pronounced bimodal distributions (Figure 1, Additional File 1 : Supplementary SF8 and SF9). Bimodality in warm-blooded vertebrates can be explained by the presence of isochores. Although there are many similarities between genes in high-GC human isochores and high-GC3 genes in grasses, the isochore hypothesis does not fully explain the existence of high-GC3 genes in grasses: first, there is no correlation between ORFs and the flanking regions; second, most species with isochores do not have a high-GC3 peak. Possible causes of bimodality may be elucidated by comparing genes in the high- and low-GC3 classes. These classes differ in nucleotide composition and composition gradients along coding regions. High-GC3 class genes have a significantly higher frequency of CG dinucleotides (potential targets for methylation); therefore, there is an additional regulatory mechanism for high-GC3 genes. Springer et al.  reported that out of eight classes of methyl-CpG-binding domain proteins present in dicots, only six exist in monocots, suggesting a difference between dicots and monocots in silencing of methylated genes.
Two competing processes may affect the frequency of methylation targets: the GC-based mismatch repair mechanism and AT-biased mutational pressure. In recombining organisms (e.g., grasses and warm-blooded vertebrates), the GC content of coding and regulatory regions is enhanced because of the action of the GC-based mismatch repair mechanism; this effect is especially pronounced for GC3 . Recombination has been shown to be a driving force for the increase in GC3 in many organisms . Repair (recombination) happens all over the genome with a certain precision, leading to an increase in GC. If repair did not occur in defence-related genes, the organism may fail to survive or to reproduce. If repair did not happen in less important genes (and, consequently, their GC content remained the same), it may not be detrimental to the organism. AT-biased mutational pressure, resulting from cytosine deamination  or oxidative damage to C and G bases , counteracts the influence of recombination; and in most asexually-reproducing species and self-pollinating plants, AT bias is the winning process. Our analysis from aligning indica and japonica, as well as earlier publications , indicate that genomic regions under higher selective pressure are more frequently recombining and therefore increase their GC3 content. This mechanism may explain the pronounced differences in GC3 between A. thaliana and its closest relatives. Comparison of the nucleotide compositions of coding regions in A. thaliana, R. sativus, B. rapa, and B. napus reveals that the GC3 values of R. sativus, B. rapa, and B. napus genes are on average 0.05 higher than those of the corresponding A. thaliana orthologs . An important difference between A. thaliana and Brassica and Raphanus is that the latter two genera are self-incompatible, whereas A. thaliana is self-pollinating. Self-pollination in arabidopsis keeps its recombination rates low and thus reduces the GC3 content of its genes. Self-pollination is also reported in some grasses such as wheat, barley and oats. Analysis of recombination in wheat  showed that the genome contains areas of high and low recombination. Grasses have an efficient reproductive mechanism and high genetic variability that enables them to adapt to different climates and soil types [69, 70]. We hypothesize that since self-pollination generally lowers recombination rates, evolutionary pressure will selectively maintain high recombination rates for some genes. Analysis of highly recombinogenic genomic regions of wheat, barley, maize and oat identified several genes of agronomic importance in these regions (including resistance genes against obligate biotrophs and genes encoding seed storage proteins) . In addition to the methylation-driven growth of high-GC3, we hypothesize that developing GC3 richness in some genes may, if it is not balanced by AT-bias, work as a feed-forward mechanism. Once it appears in genes under selective pressure, it provides additional transcriptional advantage. GC pairs differ from AT pairs since guanine binds to cytosine with three hydrogen bonds, while adenine forms only two bonds with thymine. This additional hydrogen bond makes GC pairs more stable and GC-rich genes will have different biochemical properties from AT-rich genes. When an AT pair is replaced by a GC pair in the third position of a codon, the protein sequence remains unchanged but an additional hydrogen bond is introduced. This additional bond can make transcription more efficient and reliable, change the array of RNA binding proteins, or significantly alter the three-dimensional folding of the messenger RNA. In this case, those plant species that thrive and adapt successfully to harsh environments demonstrate a strong preference for GC3 in the third position of the codon.
High GC3 content provides more targets for methylation. The correlation between methylation and GC3 is supported by Stayssman et al. , who reported a positive correlation between methylation of internal unmethylated regions and expression of the host gene. In this paper we have demonstrated a positive correlation between GC3 and variability of gene expression; we also found that high-GC3 genes are more enriched in CG than the low-GC3 class. Therefore, GC3 classes provide more targets for de novo methylation, which can serve as an additional mechanism of transcriptional regulation and affect the variability of gene expression. Additional transcriptional regulation makes species more adaptable to external stresses.
Grasses have undergone several genome duplications. Genomic regions varied in their recombination rates and GC3 contents. Since high GC3 content in a gene provided an evolutionary advantage, this was frequently the sole copy retained in grasses. This may explain why genes in the high-GC3 class frequently lack paralogs. High-GC3 genes provide an evolutionary advantage owing to their optimized codon usage and to the existence of methylation targets allowing for an additional mechanism of transcriptional regulation. Therefore, the high-GC3 class of genes has been maintained in grasses for generations.
In this paper we combine a variety of prior observations and insights on GC3 biology with new observations using larger genome data sets to establish a unifying framework of hypotheses to explain all the available data fully. This framework consists of evolutionary forces and sexual reproduction patterns to justify a wide variety of observed codon usage patterns in plants and animals. These evolutionary forces are realized through introducing new mutations during meiotic recombination and fixation with the help of DNA methylation and transcriptional mechanisms. The presence of GC3-rich genes is not likely to be a consequence of chromosomal isochores or horizontal gene transfer. Regardless of their initial origin, high-GC3 genes in recombining species possessed a self-maintaining mechanism that over time could only increase their drift towards even higher GC3 values. This uncompensated drift may explain the pronounced bimodality of some rapidly-evolving species. Competing forces acting in grasses make GC3 distribution distinctly bimodal; genes in the high-GC3 class are more transcriptionally regulated, provide more targets for methylation and accumulate more mutations than genes in the low-GC3 class.
In our analysis, we concentrated on those plant species that benefit from complete sets of full-length cDNAs and sequenced (complete or nearly complete) genomic data. We used the following species: O. sativa, S. bicolor, A. thaliana, C. reinhardtii, Z. mays, D. rerio, M. musculus and H. sapiens. O. sativa genes and genomic sequences were downloaded from the Rice Genome Annotation project ; after exclusion of all transposon-like genes and genes without full-length cDNA support we obtained a final set of 16,497 genes. Rice promoter sequences were downloaded from the Osiris database ; positions of Transcription Start Sites were refined using the TSSer algorithm . Rice microarray data were obtained from NCBI, Gene Expression Omnibus, platform GPL2025. We used two measures of expression: average intensity and standard deviation across 106 series of gene expression measurements. We used the recently published sequence and annotation data from the Joint Genome Institute for C. reinhardtii and S. bicolor (27,640), released 08/28/2008  and 10/28/2008  respectively. A. thaliana genes (27,741) were downloaded from The Arabidopsis Information Resource. Collections of D. rerio, M. musculus and H. sapiens sequences were taken from NCBI. Z. mays sequences were obtained from J. Craig Venter Institute. The remainder of the plant transcripts for the Poaceae family (aka grasses) were downloaded from TIGR Plant Transcript Assemblies . We used the frequency of single nucleotide polymorphisms per 1-kb gene length, obtained from the Plant Genome Mapping Laboratory, University of Georgia , as a crude proxy for the local recombination rate in rice. Supplementary figures and tables are available at http://model.research.glam.ac.uk/projects/glacombio/GC3/.
For each gene, GC3 values and the standard deviation of log-transformed gene expression values were computed across all experiments. Genome-wide distributions of both GC3 and gene expression are approximately normal. For each of these measures, the parameters μ (mean) and σ (standard deviation) of the corresponding normal distributions were determined. The standard deviations of gene expression and GC3 values were converted to z-scores, , and the standardized scores were plotted.
Relative abundance was calculated according to , in which it was observed that the profiles of relative dinucleotide abundance values (genome signatures) are equivalent to the "general design" of organisms, and closely-related species have similar genome signatures. The computational formulae for di- and tri-nucleotide relative abundance values are , where N stands for any nucleotide and W denotes A or T. As demonstrated by , the ratio of observed to expected CpG frequency underestimates the real CpG deficiency in GC-rich sequences: because the formula is non-linear, an identical fraction of mutated CpG in high- and low-GC classes of genes results in artificially higher values of ρ CG for the former than the latter. The authors suggested the use of a threshold of ρ CG as a function of G+C frequency to assess the presence of unmethylated sites, which can be calculated using the following formula: . In order to take the influence of this mathematical artifact into account in addition to the original relative abundance values, we also considered GC-corrected values defined as .
Principal Component Analysis (PCA) involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components. The data are represented in a new coordinate system such that the greatest variance of the data lies on the first principal component, the second greatest variance on the second coordinate, and so on . Our approach was generally similar to that of Chen et al. : for each gene i of O. sativa we calculated codon frequency c i, m (w), where m(w) stands for w th codon for amino acid m, and applied PCA (using the princomp function in R).
TT received her PhD in Applied Mathematics from the University of Southern California. Currently she is a Senior Lecturer in Statistics, University of Glamorgan, Wales. Prior to accepting this post she worked as a computational scientist for the biotechnology company Ceres, Loyola Marymount University and Georgia Institute of Technology. TT developed algorithms for analyzing gene expression analysis, discovering promoter motifs and genome annotation.
NA is a Senior Computational Scientist at Ceres. He received his PhD in Molecular Biology, VNII Genetika, Russia. He was a postdoctoral researcher in Kyoto University, Japan and later at NCI/NIH, and then a computational scientist at Amgen. He has done computational work on discovery of promoter motifs, protein structure, fold recognition and lead discovery.
JB earned his PhD in Molecular Biology and Genetics from the University of Pennsylvania and his BS in Computer Science from the University of Wisconsin. He has long-standing interest in genomics and has been leading bioinformatics and informatics efforts at the Human Genome Sequencing Center (Baylor College of Medicine), a biopharmaceutical company (UCB Pharma), and most recently at an agricultural biotechnology company (Ceres, Inc.)
KAF received his PhD in Genetics from Ohio State University. Upon graduation, he held positions in two companies, and later moved to the Dept of Plant Sciences at the University of Arizona. After accepting a position to start up a genomics company, Ceres, in 1997, KAF led the company's sequencing strategy, which resulted in the largest number of plant cDNAs that had ever been sequenced. Working with computational biologists at Ceres, including the three co-authors on this paper, he helped use these cDNAs to advance our understanding of plant transcriptomes. Currently KAF is the Director of the School of Plant Sciences at the University of Arizona.
We are grateful to Paul Burns, Paul Messenger, and BioMedES editorial services for proofreading the manuscript. We would like to thank our colleagues from Ceres, Inc, Loyola Marymount University, Georgia Institute of Technology, and University of Glamorgan for fruitful discussions and a supportive environment.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.