- Research article
- Open Access
The rules of gene expression in plants: Organ identity and gene body methylation are key factors for regulation of gene expression in Arabidopsis thaliana
BMC Genomics volume 9, Article number: 438 (2008)
Microarray technology is a widely used approach for monitoring genome-wide gene expression. For Arabidopsis, there are over 1,800 microarray hybridizations representing many different experimental conditions on Affymetrix™ ATH1 gene chips alone. This huge amount of data offers a unique opportunity to infer the principles that govern the regulation of gene expression in plants.
We used bioinformatics methods to analyze publicly available data obtained using the ATH1 chip from Affymetrix. A total of 1887 ATH1 hybridizations were normalized and filtered to eliminate low-quality hybridizations. We classified and compared control and treatment hybridizations and determined differential gene expression. The largest differences in gene expression were observed when comparing samples obtained from different organs. On average, ten-fold more genes were differentially expressed between organs as compared to any other experimental variable. We defined "gene responsiveness" as the number of comparisons in which a gene changed its expression significantly. We defined genes with the highest and lowest responsiveness levels as hypervariable and housekeeping genes, respectively. Remarkably, housekeeping genes were best distinguished from hypervariable genes by differences in methylation status in their transcribed regions. Moreover, methylation in the transcribed region was inversely correlated (R2 = 0.8) with gene responsiveness on a genome-wide scale. We provide an example of this negative relationship using genes encoding TCA cycle enzymes, by contrasting their regulatory responsiveness to nitrate and methylation status in their transcribed regions.
Our results indicate that the Arabidopsis transcriptome is largely established during development and is comparatively stable when faced with external perturbations. We suggest a novel functional role for DNA methylation in the transcribed region as a key determinant capable of restraining the capacity of a gene to respond to internal/external cues. Our findings suggest a prominent role for epigenetic mechanisms in the regulation of gene expression in plants.
Understanding the regulation of gene expression is essential to understand the form and function of living systems. Microarray technology has been widely used in many organisms to understand genome-wide changes in gene expression in response to treatments , in different organs , cell-types  and along developmental time series . Therefore, a large amount of microarray data representing many different biological conditions has accumulated over recent years. This data has been used successfully to hypothesize on gene function on a global scale in different organisms, such as yeast and C. elegans [5–7], and to suggest shared regulatory mechanisms. Promoters of genes with strongly correlated expression patterns in multiple experiments are likely to be bound by a common transcription factor , and conserved regulatory motifs have been identified based solely on expression data . From a systems view, however, we believe that this data has been underutilized as a resource to understand the basic rules of gene expression.
To learn the general rules that govern gene expression in plants, we took advantage of a large microarray database available for Arabidopsis in the NASCarrays database . Using this data, we defined the internal and external cues that regulate the expression of all of the Arabidopsis genes that are represented in the Affymetrix ATH1 gene chips. We quantified the effect of the different experimental conditions on gene expression, which revealed tissue type to be the most influential variable. We also analyzed different structural features and correlated it with the capacity of the genes to respond to the different stimuli. We found evidence for a mechanistic relationship between DNA methylation in the body of the gene (i.e., the transcript region) and the regulation of gene expression, thus assigning a novel and important role for the methylation of the body of the gene in eukaryotic genomes.
Results and discussion
The Arabidopsis transcriptome is robust to most perturbations but strongly influenced by organ type
In an effort to discover new principles that govern gene expression in Arabidopsis thaliana, we integrated and analyzed publicly available whole-genome microarray data for this model plant. From this data, we defined 474 biologically relevant comparisons (i.e. control vs. treatment) as described in Materials and Methods (Additional File 1). These comparisons spanned a wide variety of experimental conditions and plant organs (Figure 1). We wished to evaluate the effect of the different experimental factors that defined each comparison on genome-wide gene expression patterns. To do so, we defined differential gene expression using the RankProducts method . This method outperformed other methods to determine regulation of gene expression in previous studies [11, 12] and in our own evaluation (see Materials and Methods), particularly in datasets with a small number of replicates.
We first examined the number of differentially regulated genes per comparison. We found their distribution to be far from normal. As shown in Figure 2A, some comparisons exhibit more than 4,000 differentially expressed genes. These outliers were exclusively comparisons between different organs. In fact, organ type was the strongest experimental factor contributing to the number of differentially expressed genes. Other experimental factors, regardless of their nature, showed an approximately 10-fold smaller impact on gene expression with an average of 337 genes regulated per comparison (Figure 2B). Moreover, approximately 10% of the Arabidopsis genes did not respond to any of the stimuli in the dataset and were only differentially expressed between organ samples. Thus, organ is by far the most important factor in determining genome-wide expression levels. Furthermore, the upper 5th percentile (ordered by the number of genes regulated) of the 77 mutant vs wt comparisons involved only genes whose mutations have well documented developmental phenotypes. These genes were AP2-6, ARR21, GLABROUS1 and LFY-12 mutations . They regulated 1475, 1420, 1379 and 1362 genes, respectively – a much more than the category average (471 genes). These results indicate that global gene expression patterns are established during plant development. The results also suggest that the Arabidopsis transcriptome is robust to most perturbations, with only an estimated 1.5% of the genome on average responding in a single experiment to experimental factors such as chemical or hormone treatments, pathogen challenges or environmental stress. A detail of the categories in which each of the Arabidopsis genes responds is presented in Additional File 2. Additional Files 3 to 10 contain the genes that respond in exclusively one category, including organ type.
Given its impact on global gene expression levels, we next wished to evaluate the importance of organ type in the context of typical experimental factors that are tested in the laboratory. We compared the number of genes responding in shoots or roots for each of the nine treatments in the AtGenExpress abiotic stress series. On average, only 13% of the total genes that responded to a treatment responded in both organs. By contrast, a much higher proportion of genes (88%) were regulated by the treatment in an organ-specific manner (Additional File 11). This data indicate that plant responses to external stimuli are strongly organ-dependent and underscore the need for a more thorough survey of organ-specific and, by extension, cell-specific responses in Arabidopsis and other plants .
Housekeeping and hypervariable genes possess marked structural differences
To identify properties that explain the capacity of a gene to respond to stimuli, we ranked genes based on the number of comparisons in which they are differentially expressed. As shown in Figure 2C, the Arabidopsis genome contains genes that are regulated in a wide range of comparisons, with an average of 14 comparisons, or 3% of the total comparisons in our dataset. The underlying data is provided in Additional File 12. We expect structural differences to be maximized at the extremes of this distribution. We defined housekeeping genes based on three criteria: (1) genes that were not differentially expressed in any of the 474 comparisons, (2) genes with signal intensities higher than the median intensity across the entire dataset and (3) genes with the lowest signal variability (measured with the interquartile range, see Materials and Methods) across the entire dataset. In contrast, we defined hypervariable genes based on the following three criteria: (1) genes that were within the top 1% of the gene responsiveness distribution, (2) genes with the largest signal variability, and (3) genes that show differential expression by stimuli from six of the eight categories described in Figure 1A. These criteria defined 384 housekeeping genes and 123 hypervariable genes (Additional files 13 and 14).
A previous study positively correlated expression levels with gene size in plants . To understand how gene responses to stimuli relate to gene size and other structural features, we analyzed the structure of housekeeping and hypervariable genes. Housekeeping genes were significantly larger and had more introns than do hypervariable genes and were above genome averages for both criteria (Table 1). By contrast, hypervariable genes were significantly shorter and contained fewer introns than average (Table 1). Interestingly, a functional annotation of the hypervariable gene set indicates that it is enriched for genes involved in responses to internal and external stimuli (Additional File 15). Most hypervariable genes were plant specific as defined in a previous study , and the set was enriched for genes that code for unstable transcripts  (Table 1). These results suggest that plants favored the evolution of small, hypervariable genes to respond quickly and economically to multiple environmental signals.
Eukaryotic genes are transcriptionally regulated by the coordinated interaction of multiple protein factors that interact with discrete binding sites and with each other . These binding sites are usually located upstream of the transcribed region they regulate . The promoters of hypervariable genes often have a TATA-box sequence and contain a larger number of predicted transcription factor binding sites as compared to the housekeeping genes or the genome average (Table 1 and Additional File 16). These data suggest that the presence of a TATA box and the number of transcription factor binding sites in the promoter region of some of the most responsive genes in Arabidopsis may explain their capacity to respond to stimuli, as was previously found in an analysis of a smaller expression dataset . However, it is clear that this simple rule does not always apply and that other factors are necessary to explain gene expression responses.
In addition to gene structure, epigenetic mechanisms such as DNA methylation are known to have an impact on gene expression in eukaryotes, particularly in heterochromatic regions [22, 23]. To evaluate the potential role of DNA methylation in the gene expression responses observed for housekeeping and hypervariable genes, we analyzed the methylation patterns of these two groups of genes. We used two recently published genome-wide methylation data sets [24, 25] to analyze methylation in the promoter and transcribed regions of each gene. Using the methylome data produced by Zhang et al. , we found that a large proportion of housekeeping genes were methylated in their transcribed regions (a significant enrichment compared to the expected genome frequency; p = 1.5E-35, Table 1). By contrast, only 8% of the hypervariable genes were methylated in their transcribed regions (a significant depletion; p = 2E-10, Table 1). Similar results were obtained with an independently generated methylome data set . These results suggest that the capacity of Arabidopsis housekeeping and hypervariable genes to respond to stimuli not only depends on structural features in their promoter or transcribed regions, such as transcription factor binding sites, but may also have an important epigenetic component.
Transcript region methylation is the most important factor to explain genome-wide responses to internal/external stimuli
To evaluate the importance of these features for gene expression responses on a genomic scale, we performed a regression analysis of the gene responsiveness for all Arabidopsis genes as a function of each of the structural features described above. We used a linear model of the form: Y ~ αX + β, where Y was the observed gene responsiveness of all genes and X was the structural feature under evaluation (e.g. presence of TATA-box, cis-acting binding sites in the promoter or gene body methylation). Thus, the effects detected were free from any bias arising from gene selection, as could be the case when analyzing this relatively small group of housekeeping and hypervariable genes.
Notably, using the two independently generated methylome datasets [24, 25], gene responsiveness showed a remarkably high negative correlation with the presence of methylation in the transcribed region of the gene. Both datasets generated models with a coefficient of determination (R2) of 0.8 (share of explained variability, Figure 3A–B). A similar result was obtained using average fold-change ≥ |2| (treatment versus control) as a criterion to determine gene responsiveness (Additional Files 17 and 18). This correlation was independent of the type of experimental factor, as similar trends were observed when analyzing each experimental category individually for both methylome datasets (Figure 3C–F and Additional File 19). Next, to transcript region methylation, the presence of a TATA-box was the second best factor to explain gene responsiveness, and it had a positive effect. R2 for two definitions of TATA-box [26, 27] were 0.49 or 0.68. Two factor models that included transcript region methylation and the presence of a TATA-box slightly improved the R2 over those obtained with methylation alone (Table 2). Two factor ANOVA models (Additional File 20) confirmed the stronger effect of gene body methylation on responsiveness, as determined by the Tukey comparison procedure . However, goodness of fit estimation by the Bayesian Information Criteria  suggests that additive models, including TATA-box and methylation, are better than one-factor ANOVA models. (Additional File 20). Interestingly, this also suggests that the effect of TATA-box and methylation are independent, as interaction terms are not significant in these models (not shown). None of the other structural features (gene size, presence of introns, number of binding sites, etc) yielded models with such high R2 on a genomic scale. Thus, gene body methylation and, to a lesser extent, TATA-box presence explained gene responsiveness on a global scale. It is not possible, however, to infer from this data the mechanistic relationships between TATA-related factors, gene body methylation status and regulation of gene expression.
The effect of DNA methylation on gene responsiveness could be explained by a simple transcriptional gene silencing effect [22, 23]. Silencing a gene would render it unable to be regulated. If so, transcript region methylation should correlate with expression levels. Comparing the frequency of methylation to the median expression level of the whole dataset revealed no such trend (Figure 4). The most and the least highly expressed genes are likely to lack methylation within their body, as previously reported . Similarly, no correlation was found between the presence of a TATA-box and gene expression levels. (Figure 4). Moreover, no relationship was evident between expression level and gene responsiveness in our data set (Additional File 21).
We also evaluated the relationship between the presence of modified histones and gene responsiveness. We used a recently published genomic survey of trimethylation in lysine 27 of histone H3 (H3K27me3) f. We found a weak correlation between the frequency of H3K27me3 gene targets and gene responsiveness, with an R2 of 0.12 (Figure 3F and Additional File 19). This finding is consistent with the hypothesis that H3K27me3 mostly acts in a DNA methylation-independent manner, as previously suggested . Other histone modifications, such as H3K4 or H3K9 methylation  or combinations thereof , may be related to gene body methylation in Arabidopsis, thus "marking" the corresponding chromatin region for or against the regulation of gene expression .
Gene body methylation and regulation of expression by nitrate in TCA cycle genes
As a case-study and to provide a concrete example of the influence of methylation patterns on the regulation of gene expression, we focused on a discrete biological process and experimental factor: nitrate. Nitrate has been shown to be a signal to regulate gene expression in plants . We chose four microarray experiments in which wild-type seedlings were treated with different nitrate concentrations. These nitrate experiments were not included in the microarray database used in the previous sections. We found that nitrate regulates many genes in central metabolic pathways such as the TCA cycle [34–37]. We analyzed responsiveness and nitrate regulation for all genes coding for TCA cycle enzymes. Most of the genes (29 out of 36, data not shown) did not respond to the nitrate treatments, as expected due to the robustness of expression patterns in Arabidopsis (see Figure 2B). Among the genes regulated by nitrate, we found a malate dehydrogenase gene (MDH, At3g47520), two genes coding for NAD+ dependent isocitrate dehydrogenases (At5g03290 and At4g35260) and a putative NADP+ dependent isocitrate dehydrogenase (At1g65930) (Table 3). Remarkably, these four genes were classified as unmethylated in studies by both Zhang et al.  and Zilberman et al. . Moreover, body methylated genes were enriched among the analyzed genes that were not regulated by nitrate (Table 3). For instance, among eight genes coding for malate dehydrogenase that are not regulated by nitrate, five are methylated according to the two methylome datasets. This is a much higher frequency than is expected by chance (p < 0.05), as only 20–34% of the genes were methylated according to the two methylome datasets. The same was true for the isocitrate dehydrogenases, with enrichment of methylated genes for those that did not respond to the nitrate treatment (p < 0.05). These results agree with the proposed relationship between gene body methylation and the regulation of gene expression in response to regulatory signals (in this case, nitrate). Moreover, it suggests gene body methylation plays a role in the regulation of gene expression in physiological processes such as the reprogramming of carbon metabolism in response to nitrogen nutrient availability .
The analysis of the large and heterogeneous whole-genome microarray dataset available in the public domain proved useful to evaluate principles that govern regulation of gene expression in plants. Our global and systematic analysis of the quantitative effect of different experimental factors (e.g., mutations, stress and organ identity) on the plant transcriptome revealed the key role of developmental processes for establishing mRNA levels throughout the plant. This process in turn determines how cells, organs and tissues respond to exogenous cues. Our data indicate that plant responses to external stimuli are strongly organ-dependent and underscore the need for a more thorough survey of organ-specific and, by extension, cell-specific responses in Arabidopsis and other plants .
The second part of our analysis provided a weighted insight into the role of different molecular mechanisms in the global regulation of gene expression in Arabidopsis. The data indicate that DNA methylation within the body of Arabidopsis genes is a key factor that may determine or negatively influence the capacity of genes to respond to internal or external cues. The presence of a TATA-box may favor gene responsiveness but to a lesser extent than the negative effect of DNA methylation. Surprisingly, our data indicate that other gene structural features (e.g., number of cis-acting elements, gene size, presence and number of introns) are less important than DNA methylation and the presence of a TATA-box. These results highlight the importance of epigenetic mechanisms for the global control of gene expression. As a concrete example, we found consistency between regulation by an external stimulus (nitrate) and gene body methylation for a discrete biological process, the TCA cycle, beyond what would be expected by chance. The results presented here suggest a model whereby gene body DNA methylation restrains the ability of a gene to be regulated, regardless of regulatory signals (e.g., binding sites for specific transcription factors in the promoter region). This effect would not be directly dependent on basal gene expression levels. Moreover, our results provide a plausible functional role for the DNA methylation that is found in the body of a large number of Arabidopsis genes. This new role differs from the proposed role for DNA methylation in suppressing spurious transcriptional initiation [25, 39] and reinforces the link between the regulation of gene expression and DNA methylation in eukaryotes.
The CEL data files comprising all ATH1 Affymetrix hybridizations through the end of 2005 were obtained from NASCArrays through the AffyWatch Subscription Service. This data comprised 1887 hybridizations corresponding to 108 different experiments. The entire hybridization set was normalized using the Robust Multiarray Analysis method  available from Bioconductor http://www.bioconductor.org. Once normalized, the hybridizations were quality-controlled using the method devised by Persson et al . Briefly, this method uses a Kolmogorov-Smirnov goodness-of-fit test to evaluate whether the distribution of deleted residuals for an individual hybridization deviates from a "t" distribution. According to Persson et al , this occurs when the value of the D statistic from the goodness-of-fit test is more than 0.15. The CEL files with a D statistic over this cut-off value were excluded from the analysis. This step resulted in the exclusion of 186 CEL files.
For the analysis of differential expression, the remaining 1701 hybridizations were mapped to their corresponding experiments. Controls and biologically meaningful tests were identified and grouped with their replicates. Comparisons in which the control or treatment hybridizations had less than 2 replicates were discarded. This process resulted in a list of 474 biologically meaningful comparisons (control versus test), including 1295 hybridizations. In the case of tissue comparisons, we used rosette leaves as a control, and all other tissues were considered tests. Rosette leaves were chosen as the reference because they are the prototypical organ system . We classified the comparisons according the experimental variable involved using the criteria defined by TAIR , and according to the RNA source organ (Figure 1)
Differential expression analysis
The comparisons were analyzed for differential gene expression using the RankProducts method , implemented as a Bioconductor package . This method outperformed other methods to define differential expression in a study comparing ten different methods , particularly in high-noise, low-replicate datasets. Our comparisons have a low number of replicates (average = 2.7) and a high variability (pooled variance of the whole dataset = 4.04). We also evaluated the performance of RankProducts as compared to other popular alternative methods based on biological criteria. We defined regulation using RankProducts, average fold change and t-test with different FDR corrections for multiple testing [44, 45]. To evaluate the methods, we randomly chose five test comparisons from different experimental categories (e.g. biotic, abiotic, tissue).
We evaluated the functional coherence of the differentially expressed genes by the different methods by evaluating enriched gene ontology (GO) terms in the resulting lists. For most of the comparisons tested, visual inspection revealed enriched GO terms that were obviously related to the experimental factor. This was not the case for the other methods. As an example, 245 genes were found to be differentially expressed in the comparison DO.1.1 (Additional File 1). Out of these 245 genes, 217 were previously identified as regulated in these experiments using a different method in a prior study . In addition, the 140 down-regulated genes determined by RankProducts showed an overrepresentation of "transport" and other functional terms previously known to be related to the experimental factor . Similarly, the abscisic acid response evaluated in comparison AQ.4.4 (Additional File 1) identified 241 differentially expressed genes. Among the up-regulated genes, we found that the 'abscisic acid response' functional term was overrepresented.
With the results of the differential expression analysis, a "regulation matrix" was created. This matrix contained the p-value for the down- and up-regulation of all of the ATH1 Affymetrix chip probes across the 474 comparisons. The cut-off for defining a probe as differentially expressed was 0.05. The complete data file with ratios is available from http://virtualplant.bio.puc.cl/cgi-bin/Lab/download.cgi. Additional data files are available upon request.
Housekeeping and hypervariable gene definition
The least responsive genes (housekeeping genes) were defined as follows: first, we selected genes which did not show differential expression in any comparison (5652 genes). Second, these genes were filtered for expression above the median of the entire NASC dataset (1758 genes). Third, we choose only those having a signal difference between the 1st and 3rd quartile (interquartile range) that was in the bottom 5 percentile of the signal interquartile ranges from the whole dataset. This ensured the selection of 384 expressed Arabidopsis genes that exhibit the lowest expression variability.
For the most responsive genes (hypervariable genes), we first choose genes that were regulated in 86 or more comparisons, corresponding to the top 1% most responsive genes from Figure 2C. Second, we selected genes that were regulated in at least six out of the eight categories defined in Figure 1A to avoid any bias due to large categories (e.g., abiotic stress experiments). We did not use an expression cutoff, since as expected hypervariable genes were sufficiently expressed, with a median signal of 8.4 across the NASC dataset (the global median is 7.4). From the 185 genes selected by these criteria, we choose those with a signal interquartile range in the upper 5% of the entire dataset. Thus, we defined a group of 123 "hypervariable genes".
Structural and phylogenetic analyses and correlation with gene responsiveness
Gene structural features (gene, CDS, exon, intron lengths and numbers) – were obtained from the TAIR 6.0 Arabidopsis genome . Phylogenetic classifications of the genes were obtained from the Plant-Specific Database . Methylation status of the different genes (body methylated, body unmethylated and promoter methylated) was obtained from Zhang et al.  or Zilberman et al. . TATA-box presence or absence in the promoter region of Arabidopsis genes was obtained from Molina and Grotewold. The number of transcription factor binding sites in gene promoters was calculated from the data in the AtCis Database from AGRIS . Unstable transcripts were extracted from the data generated by Gutierrez et al. . All data were processed using custom-made scripts in R http://www.R-project.org and Perl languages. Statistical analyses and graphs were done in R, GraphPad Prisma 4.0 software or Microsoft Excel.
Statistical and regression analysis
Calculation of significant enrichment or depletion was done in R using the hypergeometric distribution. t-tests were carried out with the GraphPad Prisma 4.0 software. Simple and multiple linear regression models used to predict gene responsiveness as a function of various structural parameters were done in R. We used simple models of the form: Y ~ αX + β, where Y, the response variable, is the gene responsiveness and X is the value of the structural feature under evaluation. In the case of categorical features, such as methylation or the presence of TATA-box, X represented the frequency of the feature in a group of genes sharing the same responsiveness. For multiple linear regressions, we used models of the form: Y ~ αX + βZ + γW... where Y was the gene responsiveness and X, Z, W, etc. corresponded to different features to evaluate. Models were fitted using the lm function from the R statistical software. We used the R2 parameter to evaluate the quality of the model, since R2 represents the extent of data variability explained by the model. As a complementary approach for categorical features, we used one factor ANOVA models. They have the form Y ~ αX + β, where X was a factor encoding the presence or absence of those features at two different levels. We used the 'aov' function in R to fit the model. We used the F statistic to estimate the significance of the contribution of the factors to the response. To estimate the differences between the levels of the factors, we followed the Tukey procedure, using the 'glht' function from the 'multcomp' package in R. The Bayesian Information Criteria was calculated in R using the 'BIC' function in the package 'nlme'. Graphs were done in R, GraphPad Prisma 4.0 software or Microsoft Excel.
Gene body methylation and regulation by nitrate for TCA cycle genes
We retrieved the genes corresponding to the TCA cycle from AraCyc . We then determined the gene responsiveness of these genes in four previously published microarray data sets [34–37] that were not included in the NASCarrays database and were therefore not used to derive our genome-wide conclusions. We intersected the methylation status [24, 25] and regulation by nitrate of the genes encoding malate dehydogenases and isocitrate dehydrogenases using the VirtualPlant software platform http://www.virtualplant.org. Statistical analysis of enrichment was performed as described above.
Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, Batistic O, D'Angelo C, Bornberg-Bauer E, Kudla J, Harter K: The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. Plant J. 2007, 50 (2): 347-363. 10.1111/j.1365-313X.2007.03052.x.
Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M, Scholkopf B, Weigel D, Lohmann JU: A gene expression map of Arabidopsis thaliana development. Nat Genet. 2005, 37 (5): 501-506. 10.1038/ng1543.
Birnbaum K, Shasha DE, Wang JY, Jung JW, Lambert GM, Galbraith DW, Benfey PN: A Gene Expression Map of the Arabidopsis Root. Science. 2003, 302 (5652): 1956-1960. 10.1126/science.1090022.
Spencer MW, Casson SA, Lindsey K: Transcriptional profiling of the Arabidopsis embryo. Plant Physiol. 2007, 143 (2): 924-940. 10.1104/pp.106.087668.
Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH: Functional discovery via a compendium of expression profiles. Cell. 2000, 102 (1): 109-126. 10.1016/S0092-8674(00)00015-5.
Kim SK, Lund J, Kiraly M, Duke K, Jiang M, Stuart JM, Eizinger A, Wylie BN, Davidson GS: A Gene Expression Map for Caenorhabditis elegans. Science. 2001, 293 (5537): 2087-10.1126/science.1061603.
Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003, 302 (5643): 249-255. 10.1126/science.1087447.
Allocco D, Kohane I, Butte A: Quantifying the relationship between co-expression, co-regulation and gene function. BMC Bioinformatics. 2004, 5 (1): 18-10.1186/1471-2105-5-18.
Harmer SL, Hogenesch JB, Straume M, Chang HS, Han B, Zhu T, Wang X, Kreps JA, Kay SA: Orchestrated transcription of a key pathway in Arabidopsis by the circadian clock. Science. 2000, 290: 2110-2113. 10.1126/science.290.5499.2110.
Craigon DJ, James N, Okyere J, Higgins J, Jotham J, May S: NASCArrays: a repository for microarray data generated by NASC's transcriptomics service. Nucleic Acids Res. 2004, D575-577. 10.1093/nar/gkh133. 32 Database
Breitling R, Armengaud P, Amtmann A, Herzyk P: Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett. 2004, 573 (1–3): 83-92. 10.1016/j.febslet.2004.07.055.
Jeffery IB, Higgins DG, Culhane AC: Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. BMC Bioinformatics. 2006, 7: 359-10.1186/1471-2105-7-359.
Kunst L, Klenz JE, Martinez-Zapater J, Haughn GW: AP2 Gene Determines the Identity of Perianth Organs in Flowers of Arabidopsis thaliana. Plant Cell. 1989, 1 (12): 1195-1208. 10.1105/tpc.1.12.1195.
Tajima Y, Imamura A, Kiba T, Amano Y, Yamashino T, Mizuno T: Comparative studies on the type-B response regulators revealing their distinctive properties in the His-to-Asp phosphorelay signal transduction of Arabidopsis thaliana. Plant Cell Physiol. 2004, 45 (1): 28-39. 10.1093/pcp/pcg154.
Schiefelbein J: Cell-fate specification in the epidermis: a common patterning mechanism in the root and shoot. Curr Opin Plant Biol. 2003, 6 (1): 74-78. 10.1016/S136952660200002X.
Weigel D, Alvarez J, Smyth DR, Yanofsky MF, Meyerowitz EM: LEAFY controls floral meristem identity in Arabidopsis. Cell. 1992, 69 (5): 843-859. 10.1016/0092-8674(92)90295-N.
Ren X-Y, Vorst O, Fiers MWEJ, Stiekema WJ, Nap J-P: In plants, highly expressed genes are the least compact. Trends in Genetics. 2006, 22 (10): 528-532. 10.1016/j.tig.2006.08.008.
Gutierrez RA, Green PJ, Keegstra K, Ohlrogge JB: Phylogenetic profiling of the Arabidopsis thaliana proteome: what proteins distinguish plants from other organisms?. Genome Biol. 2004, 5 (8): R53-10.1186/gb-2004-5-8-r53.
Gutierrez RA, Ewing RM, Cherry JM, Green PJ: Identification of unstable transcripts in Arabidopsis by cDNA microarray analysis: rapid decay is associated with a group of touch- and specific clock-controlled genes. Proc Natl Acad Sci USA. 2002, 99 (17): 11513-11518. 10.1073/pnas.152204099.
Orphanides G, Reinberg D: A Unified Theory of Gene Expression. Cell. 2002, 108 (4): 439-451. 10.1016/S0092-8674(02)00655-4.
Walther D, Brunnemann R, Selbig J: The regulatory code for transcriptional response diversity and its relation to genome structural properties in Arabidopsis thaliana. PLoS Genetics. 2006, preprint(2006):e11.eor.
Chan SW, Henderson IR, Jacobsen SE: Gardening the genome: DNA methylation in Arabidopsis thaliana. Nat Rev Genet. 2005, 6 (5): 351-360. 10.1038/nrg1601.
Gehring M, Henikoff S: DNA methylation dynamics in plant genomes. Biochim Biophys Acta. 2007
Zhang X, Yazaki J, Sundaresan A, Cokus S, Chan SW, Chen H, Henderson IR, Shinn P, Pellegrini M, Jacobsen SE, Ecker JR: Genome-wide high-resolution mapping and functional analysis of DNA methylation in arabidopsis. Cell. 2006, 126 (6): 1189-1201. 10.1016/j.cell.2006.08.003.
Zilberman D, Gehring M, Tran RK, Ballinger T, Henikoff S: Genome-wide analysis of Arabidopsis thaliana DNA methylation uncovers an interdependence between methylation and transcription. Nat Genet. 2007, 39 (1): 61-69. 10.1038/ng1929.
Molina C, Grotewold E: Genome wide analysis of Arabidopsis core promoters. BMC Genomics. 2005, 6 (1): 25-10.1186/1471-2164-6-25.
Shahmuradov IA, Gammerman AJ, Hancock JM, Bramley PM, Solovyev VV: PlantProm: a database of plant promoter sequences. Nucleic Acids Res. 2003, 31 (1): 114-117. 10.1093/nar/gkg041.
Tukey J: Multiple comparisons. J Am Stat Assoc. 1953, 48: 624-625.
Schwarz G: Estimating the dimension of a model. Annls Statistics. 1978, 6: 461-464. 10.1214/aos/1176344136.
Zhang X, Clarenz O, Cokus S, Bernatavichute YV, Pellegrini M, Goodrich J, Jacobsen SE: Whole-genome analysis of histone H3 lysine 27 trimethylation in Arabidopsis. PLoS Biol. 2007, 5 (5): e129-10.1371/journal.pbio.0050129.
Shi J, Dawe RK: Partitioning of the Maize Epigenome by the Number of Methyl Groups on Histone H3 Lysines 9 and 27. Genetics. 2006, 173 (3): 1571-1583. 10.1534/genetics.106.056853.
Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, Cuff J, Fry B, Meissner A, Wernig M, Plath K, Jaenisch R, Wagschal A, Feil R, Schreiber SL, Lander ES: A Bivalent Chromatin Structure Marks Key Developmental Genes in Embryonic Stem Cells. Cell. 2006, 125 (2): 315-326. 10.1016/j.cell.2006.02.041.
Garcia-Bassets I, Kwon Y-S, Telese F, Prefontaine GG, Hutt KR, Cheng CS, Ju B-G, Ohgi KA, Wang J, Escoubet-Lozach L, Rose DW, Glass CK, Fu X-D, Rosenfeld MG: Histone Methylation-Dependent Mechanisms Impose Ligand Dependency for Gene Activation by Nuclear Receptors. Cell. 2007, 128 (3): 505-518. 10.1016/j.cell.2006.12.038.
Wang R, Tischner R, Gutierrez RA, Hoffman M, Xing X, Chen M, Coruzzi G, Crawford NM: Genomic analysis of the nitrate response using a nitrate reductase-null mutant of Arabidopsis. Plant Physiol. 2004, 136 (1): 2512-2522. 10.1104/pp.104.044610.
Scheible WR, Morcuende R, Czechowski T, Fritz C, Osuna D, Palacios-Rojas N, Schindelasch D, Thimm O, Udvardi MK, Stitt M: Genome-wide reprogramming of primary and secondary metabolism, protein synthesis, cellular growth processes, and the regulatory infrastructure of Arabidopsis in response to nitrogen. Plant Physiol. 2004, 136 (1): 2483-2499. 10.1104/pp.104.047019.
Palenchar P, Kouranov A, Lejay L, Coruzzi G: Genome-wide patterns of carbon and nitrogen regulation of gene expression validate the combined carbon and nitrogen (CN)-signaling hypothesis in plants. Genome Biology. 2004, 5 (11): R91-
Wang R, Okamoto M, Xing X, Crawford NM: Microarray analysis of the nitrate response in Arabidopsis roots and shoots reveals over 1,000 rapidly responding genes and new linkages to glucose, trehalose-6-phosphate, iron, and sulfate metabolism. Plant Physiology. 2003, 132: 556-567. 10.1104/pp.103.021253.
Stitt M, Muller C, Matt P, Gibon Y, Carillo P, Morcuende R, Scheible WR, Krapp A: Steps towards an integrated view of nitrogen metabolism. J Exp Bot. 2002, 53 (370): 959-970. 10.1093/jexbot/53.370.959.
Suzuki MM, Kerr ARW, De Sousa D, Bird A: CpG methylation is targeted to transcription units in an invertebrate genome. Genome Res. 2007, 17 (5): 625-631. 10.1101/gr.6163007.
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostat. 2003, 4 (2): 249-264. 10.1093/biostatistics/4.2.249.
Persson S, Wei H, Milne J, Page GP, Somerville CR: Identification of genes required for cellulose synthesis by regression analysis of public microarray data sets. PNAS. 2005, 102 (24): 8633-8638. 10.1073/pnas.0503392102.
Rhee SY, Beavis W, Berardini TZ, Chen GH, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, Miller N, Mueller LA, Mundodi S, Reiser L, Tacklind J, Weems DC, Wu YH, Xu I, Yoo D, Yoon J, Zhang PF: The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Research. 2003, 31 (1): 224-228. 10.1093/nar/gkg076.
Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J: RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics. 2006, 22 (22): 2825-2827. 10.1093/bioinformatics/btl476.
Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B. 1995, 57: 289-300.
Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003, 100 (16): 9440-9445. 10.1073/pnas.1530509100.
Deeken R, Engelmann JC, Efetova M, Czirjak T, Muller T, Kaiser WM, Tietz O, Krischke M, Mueller MJ, Palme K, Dandekar T, Hedrich R: An integrated view of gene expression and solute profiles of Arabidopsis tumors: a genome-wide approach. Plant Cell. 2006, 18 (12): 3617-3634. 10.1105/tpc.106.044743.
Davuluri R, Sun H, Palaniswamy S, Matthews N, Molina C, Kurtz M, Grotewold E: AGRIS: Arabidopsis Gene Regulatory Information Server, an information resource of Arabidopsis cis-regulatory elements and transcription factors. BMC Bioinformatics. 2003, 4 (1): 25-10.1186/1471-2105-4-25.
Mueller LA, Zhang P, Rhee SY: AraCyc: A Biochemical Pathway Database for Arabidopsis. Plant Physiol. 2003, 132 (2): 453-10.1104/pp.102.017236.
We thank Xiaoyu Zhang, Dr. Steve Jacobsen and Dr. Joseph Ecker for kindly providing genome-wide DNA methylation data in a custom format, and Juanita Larraín-Linton for her proof-reading. This work was funded by grants from: ICGEB (CRPCHI0501), FONDECYT (1060457), MILLENNIUM NUCLEUS FOR PLANT FUNCTIONAL GENOMICS (P006-09-F), FUNDACION ANDES (C14060/62) and NSF (DBI0445666) to R.A.G. F.F.A. was funded by a Ph.D. fellowship from CONICYT.
FFA carried out the bioinformatics and statistical analyses and wrote the manuscript. NM and SYR revised the manuscript critically for important intellectual content. RAG carried out some of the bioinformatics analyses, wrote the manuscript and was responsible for the conception of the study, the design of the data analysis and the interpretation of the results. All authors read and approved the final manuscript.
Electronic supplementary material
Additional File 2: Gene responsiveness by categories. Table detailing the number of experiments, within the eight experimental categories, in which each Arabidopsis gene is regulated. The number in parenthesis in the header of the Table indicates the total number of experiments in each category. (XLS 4 MB)
Additional File 11: Importance of organ type in the response to abiotic stress in Arabidopsis. Percentage of genes responding to various stresses in either roots, shoots or both. Data corresponds to the AtGenExpress Abiotic Stress series present in the NASCarrays database. The black zone indicates the percentage of genes responding only in roots; the white zone indicates those responding only in shoots, and the black squares region indicates the genes responding in both tissues (PDF 40 KB)
Additional file 13: . List of Housekeeping and hypervariable genes, classified according their methylation status as defined in:Zhang X, et al: Genome-wide high-resolution mapping and functional analysis of DNA methylation in arabidopsis. Cell 2006, 126(6): 1189–1201. Gene annotation was provided by the VirtualPlant system http://www.virtualplant.org. (XLS 61 KB)
Additional file 14: . List of Housekeeping and hypervariable genes, classified according their methylation status as defined in: Zilberman et al: Genome-wide analysis of Arabidopsis thaliana DNA methylation uncovers an interdependence between methylation and transcription. Nat Genet 2007, 39(1): 61–69. Gene annotation was provided by the VirtualPlant system http://www.virtualplant.org (XLS 49 KB)
Additional file 15: . Analysis of over-representation of gene ontology functional terms in housekeeping and hypervariable genes (performed in VirtualPlant – http://www.virtualplant.org (XLS 15 KB)
Additional file 16: Enrichment of cis-acting motifs in the promoter of hypervariable genes. Frequency distribution of the number of predicted transcription binding sites in the promoter of housekeeping and hypervariable genes and the whole genome. The genes were ranked according to the number of cis-acting regulatory elements in their promoters according to the AGRIS database (X-axis). The points represent the fraction of genes in a bin of 10 motifs. (PDF 48 KB)
Additional file 19: . Description is as Table 2, see main text. (XLS 22 KB)
Additional file 20: ANOVA models for the effect of methylation and TATA-box presence on gene responsiveness, by category of experimental treatment. The models have the form Y ~ aX + b, where X was a factor encoding the presence or absence of those features as two different levels. We used the 'aov' function in R to fit the model. The F statistic estimates the significance of the contribution of the factors to the response. The differences between the levels of the factors were estimated by the Tukey procedure, using the 'glht' function from the 'multcomp' package in R. This is equivalent to comparing the coefficients of the factors. The Bayesian Information Criteria was calculated in R using the 'BIC' function in the package 'nlme'. This parameter represents the "a posteriori" probability of the model to be true, being maximized when the magnitude of the parameter is minimized. (XLS 28 KB)
Additional file 21: . Box plot of the signal of a gene across the whole NASC arrays dataset (X-axis) versus gene responsiveness (the number of comparisons in which it is significantly regulated, Y-axis). A simple linear regression model cannot explain the variability in the data (R2 = 0.04). (PDF 3 KB)
About this article
Cite this article
Aceituno, F.F., Moseyko, N., Rhee, S.Y. et al. The rules of gene expression in plants: Organ identity and gene body methylation are key factors for regulation of gene expression in Arabidopsis thaliana. BMC Genomics 9, 438 (2008). https://doi.org/10.1186/1471-2164-9-438
- Transcription Factor Binding Site
- Gene Responsiveness
- Arabidopsis Gene
- Gene Expression Response
- Gene Body Methylation