Inference of transcription modification in long-live yeast strains from their expression profiles

Background Three kinases: Sch9, PKA and TOR, are suggested to be involved in both the replicative and chronological ageing in yeast. They function in pathways whose down-regulation leads to life span extension. Several stress response proteins, including two transcription factors Msn2 and Msn4, mediate the longevity extension phenotype associated with decreased activity of either Sch9, PKA, or TOR. However, the mechanisms of longevity, especially the underlying transcription program have not been fully understood. Results We measured the gene expression profiles in wild type yeast and three long-lived mutants: sch9Δ, ras2Δ, and tor1Δ. To elucidate the transcription program that may account for the longevity extension, we identified the transcription factors that are systematically and significantly associated with the expression differentiation in these mutants with respect to wild type by integrating microarray expression data with motif and ChIP-chip data, respectively. Our analysis suggests that three stress response transcription factors, Msn2, Msn4 and Gis1, are activated in all the three mutants. We also identify some other transcription factors such as Fhl1 and Hsf1, which may also be involved in the transcriptional modification in the long-lived mutants. Conclusion Combining microarray expression data with other data sources such as motif and ChIP-chip data provides biological insights into the transcription modification that leads to life span extension. In the chronologically long-lived mutant: sch9Δ, ras2Δ, and tor1Δ, several common stress response transcription factors are activated compared with the wild type according to our systematic transcription inference.


Background
The yeast S.cerevisae has become one of the most valuable model organisms for ageing studies. In this uni-cellular eukaryote, two distinct paradigms are used to measure longevity. The first, replicative life span (RLS) is defined as the mean or maximum number of daughter cells generated by individual mother cells [1]. The second, chronological life span (CLS) is a measure of the mean or maximum survival time of populations of non-dividing yeast [2]. Yeast RLS has been proposed as a model for the ageing of dividing cells of higher eukaryotes, whereas CLS is believed to better model the ageing of post-mitotic cells [3][4][5]. RLS was the first paradigm to be used for ageing studies. Currently about 50 genes have been implicated in determining RLS. In comparison, fewer genes have been shown to regulate the chronological ageing. Recent studies have indicated three nutrient responsive yeast kinases: Sch9, PKA, and TOR, as major regulators of both types of ageing. Sch9 is a yeast kinase homologous to mammalian serine/threonine protein kinase Akt. Inactivation of Sch9 increases RLS by 30-40% [6] and extends CLS by nearly three folds [4]. Down-regulation of PKA activity obtained by introducing mutations in RAS2 and CYR1 (encoding proteins that regulate PKA activity) approximately doubles the CLS of yeast [4,5]. Recently, two high-throughput screenings were performed in yeast to identify genes that determine RLS and CLS, respectively. The first screening identified 10 gene deletions that increase RLS, and 6 of them (including the deletion of TOR1) correspond to genes encoding proteins in the TOR pathways [7]. The other screening identified several TOR-related gene deletions that increase CLS [8]. In yeast, as well as in higher eukaryotes, Sch9, PKA, and TOR coordinate signals from nutrients to regulate ribosome biogenesis, stress response, cell size, autophagy, and other cellular processes [9][10][11][12]. Of more importance, mutations that decrease the activity of the orthologs of these proteins in higher eukaryotes also extend life span, suggesting that the roles of these kinases in the regulation of life span are conserved along evolution [13][14][15][16][17].
Although the roles of Sch9, PKA, and TOR on life span extension are not fully understood, it is known that some stress response genes down-stream of these pathways are required for longevity. In the ras2∆ cells, the CLS extension is mediated by stress resistance transcription factor Msn2 and Msn4, which induce the expression of genes encoding for several heat shock proteins, catalase (Ctt1) and superoxide dismutase (Sod2). Transcription regulation of these genes by Msn2/Msn4 depends on the existence of a stress response element (STRE) in their promoter regions [5]. Sod2 is required for life span extension in ras2∆ and sch9∆ and over-expression of Sod2 extends longevity [18]. Moreover, longevity in the sch9∆ cells depends on the activity of Rim15 kinase [4]. The kinase Rim15 is known to integrate signals from TOR, PKA, and Sch9 [19], and activates Gis1, a transcription factor, which regulates genes containing a PDS (postdiauxic shift) element and is involved in the induction of theromotolerance and starvation resistance by a Msn2/Msn4-independent mechanism [20].
To better understand the function of Sch9, PKA and TOR kinases in yeast life span extension, we measured the gene expression profiles of wild type yeast as well as three long-lived mutants: sch9∆, ras2∆, and tor1∆ using the Affymetrix microarray technology. In this paper, we aim to address the question: what are the transcription factors that are involved in the longevity of these mutants? A number of methods have been proposed to answer this question. A straightforward method is to identity a set of differentially expressed or co-expressed genes, and then search their promoter sequences for known transcription factor binding sites or use de nova motif finding method to identify enriched motifs [21]. However, results obtained by this method are sensitive to the selection of the reference set, the cutoff value and some other factors. To overcome this problem, a systematic and statistical approach called PAP (promoter analysis pipeline) is proposed, which suggests an integrated model considering all of the promoters and characterized transcription factors in a genome [22]. Other two methods, REDUCER [23] and MOTIF REGRESSOR [24], identify regulatory motifs in response to a condition by associating log expression value of a gene with the motif abundance or motif-matching score in its promoter region using a linear model. In this paper, we apply two systematic strategies, as does PAP, to infer the regulatory transcription factor associated with longevity in sch9∆, ras2∆, and tor1∆ cells. The first strategy is based on motif analysis. We perform de novo identification of motifs from all the yeast promoter sequences and then test the enrichment of them in the up/ down-regulated genes of long-lived mutants using gene expression in wild type yeast as control. The second strategy is based on the ChIP-chip data that measures the connectivity of transcription factors with genes. We seek the transcription factors that are significantly associated with up/down-regulated genes in the long-lived mutants. The schematic representation of our transcriptional inference in the long-lived yeast mutants is shown in Figure 1. According to our analysis, several transcription factors including Msn2/Msn4 and Gis1 are likely to function at the down-stream of the Sch9, PKA, and TOR pathways and may account for the longevity of the corresponding long-lived mutants. Furthermore, our analysis suggests that it is useful to combine microarray gene expression profiles with other data sources such as ChIP-chip data or promoter sequences to extract more biological information.

Microarray data
We extract RNA samples from day 2.5 cells of wild-type as well as three long-lived yeast mutants: sch9∆, tor1∆, and ras2∆, and measured expression levels of 5841 genes using the Affymetrix Yeast2.0 array. It should be noted that all these yeast strains are cultured in minimal medium SDC (synthetic dextrose complete)according to the standard methods for chronological life span measurement [2]. In the SDC medium, a substantial proportion of yeast cells are still dividing before day 2. At older ages, such as day 3-5, most of the cells become hypometabolic, which is associated with a dramatic drop in transcription. Therefore, we harvest mRNA at day 2.5 so that we can extract enough mRNA for microarray experiment while avoid the noise introduced by the transcriptional activities of dividing cells. We compute the log expression ratios for all the genes in each mutant with respect to the wild-type. The expression profile for sch9∆, tor1∆, and ras2∆ show strong similarity with one another, suggesting that Sch9, TOR, and PKA may control the expression of a common set of genes that are crucial for the chronological ageing.

Motif enrichment analysis
We identify 539 putative regulatory motifs from sequences that include up to 800 bp upstream of all yeast genes. Among these putative motifs, 49 can be associated with a transcription factor according to literature or database [25]. Generally, if the activity of a transcription factor is changed as a consequence of some biological events, such as the deletion of SCH9 gene, we will expect to see an enrichment of its binding motif in the promoter regions of up-regulated or down-regulated genes. Motivated by this rationale, we performed enrichment analysis for these 539 putative regulatory motifs. First, we obtain the up-regulated and down-regulated gene sets for each mutant (with respect to the wild type) by setting the threshold to 1.8 fold. Then, for each of these gene set, we test the enrichment of a given motif in their promoter regions. These tests are carried out for all the 539 predicted motifs and a p-value of enrichment significance is assigned to each motif. Finally, we perform multiple testing correc-tions by calculating the corresponding q-values for all these enrichment tests.
In the up-regulated gene set for sch9∆, ras2∆, and tor1∆ (fold change with respect to wild type greater than 1.8), we identify 13, 5, and 8 enriched motifs out of the 537 motifs, respectively, at a significance level of 0.001 (qvalue < 0.001). If we set the significance level to 0.01, the numbers of enriched motifs increase to 77, 43, and 43, respectively. Among these significant motifs, 14 can be associated with known transcription factors including Msn2/Msn4 and Gis1. In the down-regulated gene set for sch9∆, ras2∆, and tor1∆ (fold change with respect to wild type greater than -1.8), no motif is found to be enriched after multiple testing correction even at a significance level of 0.01. The predominance of enriched motifs in up-regulated gene sets suggests that the life span extension of these mutants is mediated by activation rather than repression of some transcription factors. Table 1 shows the motifs that are enriched in the up-regulated gene set for at least one out of the three mutants and whose function or associated transcription factor is known. As shown, motifs associated with transcription factors Fhl1, Msn2/Msn4, and Gis1 are significantly enriched in up-regulated gene set for all the three longlived mutants. The first transcription factor, Fhl1, is known to regulate the transcription of ribosomal protein (RP) genes via TOR and PKA in yeast [10]. In sch9∆, ras2∆, and tor1∆, 233, 444, and 234 genes are up-regulated by at least 1.8 fold relative to wild type. Among them, 29, 27, and 27 are cytosolic RP genes, suggesting a significant enrichment of cytosolic RP genes in their up-regulated gene sets (p-values are 4.9E-15, 6.0E-7, and 3.4E-13, respectively). Although supported by the data, the up-regulation of RP genes is unexpected, considering that PKA and TOR are positive regulators of RP genes [10,12]. It is possible that RP gene expressions change along the chronological ageing process, and they are more expressed at day 2.5 in mutants compared the wild type. In addition, other factors than Fhl1 may be involved in the regulation of RP genes as well. The second transcription factor, Msn2 and the partially redundant factor Msn4, regulate the expression of many stress-responsive genes, including genes encoding heat shock proteins, catalase, and Sod2 [26,27]. Double deletion mutant msn2msn4∆ is highly sensitive to different stresses, including heat shock, carbon source starvation, and oxidative stresses. Activity of Msn2/Msn4 is negatively regulated by PKA kinase by nuclear exclusion [28][29][30]. Moreover, Msn2/Msn4 is required for the life span extension in mutations that decrease the activity of Ras2 (ras2∆) or Cyr1 (cyr1::mTn) [4,5]. The third transcription factor, Gis1, is also negatively regulated by the PKA activity, and mediates gene Scheme of transcriptional inference in the long-lived yeast mutants Figure 1 Scheme of transcriptional inference in the long-lived yeast mutants.

Significantly enriched motifs
expression during nutrient limitation [20]. Enrichment of motifs associated with Msn2/Msn4 and Gis1 in the upregulated gene set suggests the important roles played by stress response genes in life span extension of the three long-lived mutants: sch9∆, ras2∆, and tor1∆.
Other than Fhl1, Msn2/Msn4 and Gis1 binding motifs, motifs associated with other transcription factors are also enriched in the up-regulated gene set of sch9∆, ras2∆ or tor1∆. For example, the binding motif of transcription factor Hsf1, which is also involved in stress response, is significantly enriched in ras2∆ (q-value is 0.0090) [31,32]; the binding motif of transcription factor Mig1, which is involved in glucose repression, is significantly enriched in ras2∆ (q-value is 0.0037) [33]; the binding motif of transcription factor Sum1, which is a dominant suppressor of mutant of silent information regulator genes, is significantly enriched in ras2∆ (q-values is 0.0052) and tor1∆ (qvalues is 0.0043) [34,35]. Further studies of these transcription factors under the sch9∆, ras2∆, or tor1∆ background may shed new light on the mechanism of enhanced longevity in these mutants.

Stability of enrichment analysis
To show the effect of threshold setting, we performed motif enrichment analysis using different threshold values for up-regulation and down-regulation. For a wide range of thresholds from 2-fold to 1.4-fold, our analysis achieves similar results. First, the total number of significantly enriched motifs in the up-regulated gene set does not change much with different thresholds. Secondly, we identify almost the same set of significant enriched motifs using different thresholds. Thirdly, we do not identify any significantly enriched motifs in the down-regulated gene sets using all these threshold values.
As shown in Figure 2, Gis1 and Msn2/Msn4 binding motifs exhibit significant enrichment in the up-regulated gene sets corresponding to different thresholds. Generally, a smaller threshold results in larger up-and downregulated gene sets, based on which the enrichment analysis is more reliable and sensitive. As shown in Figure 2, the significance level of motif enrichment increases with the decrease of the threshold, suggesting a higher sensitivity for larger gene set. On the other hand, the threshold should be high enough to ensure that most genes in the up-or down-regulated gene set reflect real biological expression difference rather than background noises.

ChIP-chip based analysis
In the motif enrichment analysis, those 539 motifs are identified from DNA sequence using a de novo method.   The presence of a motif in the up-stream region of a gene is determined computationally and many of them may not be functional binding sites of transcription factors. For example, based on the computation, we find that 1849 out of the 5841 yeast genes have at least one Gis1 binding sites in their promoter (800 bp up-stream of translation initiation site) region. It is possible that the in silico method overestimates the number of genes regulated by a specific TF. If we know the target genes regulated by a transcription factor, we do not need to rely on computation methods for target gene identification and we may improve the accuracy of transcription inference.

Effect of cutoff value on enrichment analysis result
The ChIP-chip experiment provides us with the information about the interaction between transcription factors and genes on a genomic scale. In yeast, the genomic occupancy of 203 transcription factors in rich media conditions was determined in a systematic ChIP-chip experiment [36]. For some of the transcription factors, their target genes in different experimental conditions, such as in heat shock, rapamycin treatment etc, were also determined. Based on the ChIP-chip data, we define 350 gene sets, each corresponding to a transcription factor under a specific condition. Among them, 203 gene sets correspond to target genes of these 203 transcription factors in YPD medium; the rest of them correspond to ChIPchip results under other conditions. We performed gene set enrichment analysis (GSEA) for these 350 transcription factor gene sets (TF gene sets) [37]. GSEA analysis is a permutation based method to test if a set of genes tend to have high rank or low rank in a rank list, e.g. log expression rank list. In this work, we use GSEA analysis to test whether genes in a TF gene set tend to have high or low log expression values. By using this method, we identify 46 TF gene sets that are significantly enriched in at least one of the three long-lived yeast mutants at a FDR of 0.01. The accuracy of p-value assigned by GSEA analysis depends on the number of permutations which can not be too large considering the computation complexity. To obtain a more accurate p-value and to infer whether a gene set is positively or negatively affected, we carry out the Wilcoxon rank test to the 46 TF gene sets.
We compare the log expression values of genes in a TF gene set with the whole genome expression background (all the other genes) using the Wilcoxon rank test. In comparison with the wild type, if the target genes of a transcription factor tend to be up-regulated in a long-lived mutant according to the test, then we may infer that the activity of this transcription factor is positively affected in the mutant. Conversely, if the target genes of a transcription factor tend to be down-regulated, then we infer that the activity of this transcription factor is negatively affected. In total, we identify 29 positively affected TF gene sets involving 22 transcription factors from 7 ChIP-chip experiment conditions (see Table 2), and 6 negatively affected gene sets involving 5 transcription factors from 2 ChIP-chip experiment conditions (see Table 3). The ChIPchip experiment conditions includes YPD (rich nutrient medium), H2O2Hi (highly hyperoxic, 4 mM H2O2), H2O2Lo (moderately hyperoxic, 0.4 mM H2O2), SM (amino acid starvation, 0.2 mg/ml sulfometuron methyl), Acid (acidic medium, 0.05 M succinic acid), RAPA (nutrient deprivation, 100 nM rapamycin), BUT14 (filamentation inducing, 1% butanol).
Among the positively affected transcription factors shown in Table 2 Despite that many transcription factors are positively affected in the long-lived mutants, we identify only a few negatively affected TF gene set using ChIP-chip based analysis (see Table 3), similar to the results by motif enrichment analysis. Hap4 forms a glucose-repressed complex with Hap2, Hap3, and Hap5, which functions as a global positive regulator of respiratory gene expression [43]. Rtg3 involves in the retrograde regulation in response to a mitochondrial defect [44]. Abf1 is a multifunctional global regulator for genes involved in a diverse range of cellular processes including carbon source regulation, nitrogen utilization, sporulation, meiosis, and ribosomal function [45,46]. Gcn4 activates the expression of amino acid biosynthetic genes in response to amino acid starvation [47]. The decrease of activity of these transcription factors might reflect the reduced respiration and low metabolic rate in the long-lived mutants [43][44][45][46][47]. In support of this hypothesis, we find many genes involved the oxidative phosphorylation and the TCA pathway (tricarboxylic acid cycle) are down-regulated. Interestingly, Rtg3 and Gcn4 are shown to be activated by TOR inhibition in previous studies [48,49]. We note that our inference is about association but not about causality by nature. The inconsistency of our results with previous studies may indicates that confounding factors exist and further investigation is necessary to understand the complete story.

Comparison of the two methods
The results from motif based analysis and ChIP-chip based methods are roughly consistent with each other in the following way: (1) the transcription factors identified by motif based analysis tend to have small p-values in results of ChIP-chip based analysis, and vice versa; (2) both methods identify much more positively affected transcription factors, whereas no or only a few negatively affected transcription factors are identified; (3) both methods suggest stress response transcription factors may play important roles in the chronological life span extension of the mutants; (4) the transcription activator Fhl1 is found to be strongly associated with life span extension by both methods. On the other hand, there is also some inconsistency between the two methods, which may arise from the following reasons: (1) only about 50 motifs can be associated with known transcription factors according to literatures and transcription factor databases; (2) the binding information is not available for some transcrip-  tion factors in the ChIP-chip data, such as Gis1; (3) The binding targets for some transcription factors are condition-dependent and none of the condition in the ChIPchip data match our microarray experiment condition perfectly. As more data sets related to transcription factors are available, we would expect to improve substantially the accuracy of the analysis. For example, a perfect match between the conditions for microarray and ChIP-chip experiment would improve the results. We could infer the activity of more transcription factors by using the motif based method with the accumulation of binding information in the transcription databases, such as the TRANSFAC [50]. Fortunately, these kinds of data and information are accumulating rapidly and we can make more reliable inference based on new available data.

Conclusion
We have demonstrated how to infer the activity modification of transcription factors in the long-lived mutants with respect to wild-type yeast by integrating microarray expression data with promoter sequence data and ChIPchip data. Both the motif and ChIP-chip data based analysis suggest that some transcription factors related to stress response or ribosomal genes may play important roles in the yeast chronological ageing. Interestingly, based on our analysis, the activities of Msn2/Msn4 and Gis1 are positively regulated in all the three mutants: sch9∆, ras2∆, and tor1∆, which is consistent with previous studies. Moreover, we find some other interesting transcription factors that may also involve in the transcription regulation at the downstream of Sch9, PKA and TOR pathways. Finally, our analysis provides a framework for transcription inference by integrating microarray data with other data sources.

Microarray experiment and data processing
Gene expression in four yeast strains, wild type, sch9∆, ras2∆, and tor1∆, is measured using DNA microarray analysis. Yeast cells from all strains are cultured in nutrient Affy Package is adopted to process the microarray data [51]. The "Invariant Set" approach is used for normalization at the probe level, and the "Model based" method is used to summarize and obtain expression for each probe set [52]. High consistency is achieved between the replicates from the same strain, with the Pearson correlation coefficients greater than 0.96 at the gene level.
The Yeast 2.0 Array contains probe sets for both two yeast species: S.cerevisiae and S.pombe. Probe sets for S.pombe are excluded and only probe sets for S.cerevisiae are considered in later analysis. To calculate the gene expression change between two strains (each has 3 replicates), we compute the fold change for each pair of comparison. 3 × 3 comparisons result in 9 ratios, which are average to get the mean fold change (FC) of each probe set. For all the S.cerevisiae probe sets, the mean FC is calculated in three comparisons: sch9∆/wt, ras2∆/wt, and tor1∆/wt. Most genes are represented each by a single probe set in Yeast 2.0 Array. For genes represented by more than one probe sets, we average the mean FCs of probe sets associated with them to obtain the gene level expression change.

Motif identification and prediction
To obtain the potential regulatory motifs, we refer to the methods used by Beer et al. and downloaded the motif data from [53]. Beer et al. identified the significantly enriched motifs in the promoter regions of all yeast genes using AlignACE software [54,55]. The promoter region was defined as the DNA sequence from translation initiation site up to 800 bp upstream. Based on their motif data, we identify 539 significant motifs from the promoter regions after removing the redundancy. Among these motifs, 48 are associated with known transcriptional factors according to literatures.

Motif enrichment analysis
We use the so-called motif enrichment analysis to identify the motifs associated with gene expression change between two yeast strains, i.e. sch9∆/wt. If a motif is indeed related to expression change in sch9∆/wt, for example, as a consequence of activation/repression of its associated transcription factor in sch9∆, we would expect to see the enrichment of genes with the motif in the up/ down-regulated gene set. Specifically, we use the Fisher exact test to identify significant enriched motifs in a up/ down-regulated gene sets.
Suppose among all the N yeast genes, M genes contain a pre-defined motif, the remaining N -M genes do not contain this motif. We denote X as the number of genes that contain a given motif in a gene set of size K, and X follows a Hypergeometric distribution. That is, For each motif we calculate the p-value defined as Pr(X ≥ x|M, N -M), which is the probability of observing x or more genes with the motif in their promoter regions. We test the significance of enrichment for all the 539 motifs in the up-and down-regulated gene sets of sch9∆/wt, ras2∆/wt, and tor1∆/wt. To correct for the multiple testing, we compute the q-value using the "qvalue" package for R software [56].

GSEA analysis
Gene set enrichment analysis is an approach to testing whether a set of genes, as a whole, are up-regulated or down-regulated compared to other gene sets [37]. Give an expression profile, we rank the log expression value of all the genes e i , i = 1, ... N in the decreasing order as denoted by L = {g 1 , g 2 , ... , g N }, where N is the total number of genes in the list. Then we evaluate the fraction of genes in a gene set S of size N H (hit) weighted by absolute expression values, and the fraction of genes not in S (missing) present up to a given position i in the list L as following: The maximum deviation from zero of P hit -P miss is defined as enrichment score (ES) for the gene set S. Finally, the expression profile is permutated for many times to compute permutated ES values and the significance is determined by the percentage of permutated ES values that are larger or equal to the real ES value. In this paper, we do 10,000 permutations to estimate the significance of enrichment for a gene set. Again q-values are calculated for multiple testing correction.