Combined analysis of expression data and transcription factor binding sites in the yeast genome

  • Vijayalakshmi H Nagaraj1Email author,

    Affiliated with

    • Ruadhan A O'Flanagan2Email author,

      Affiliated with

      • Adrian R Bruning3,

        Affiliated with

        • Jonathan R Mathias3, 4,

          Affiliated with

          • Andrew K Vershon3 and

            Affiliated with

            • Anirvan M Sengupta1, 2Email author

              Affiliated with

              BMC Genomics20045:59

              DOI: 10.1186/1471-2164-5-59

              Received: 30 April 2004

              Accepted: 26 August 2004

              Published: 26 August 2004



              The analysis of gene expression using DNA microarrays provides genome wide profiles of the genes controlled by the presence or absence of a specific transcription factor. However, the question arises of whether a change in the level of transcription of a specific gene is caused by the transcription factor acting directly at the promoter of the gene or through regulation of other transcription factors working at the promoter.


              To address this problem we have devised a computational method that combines microarray expression and site preference data. We have tested this approach by identifying functional targets of thea1-α2 complex, which represses haploid-specific genes in the yeastSaccharomyces cerevisiae. Our analysis identified many known or suspected haploid-specific genes that are direct targets of thea1-α2 complex, as well as a number of previously uncharacterized targets. We were also able to identify a number of haploid-specific genes which do not appear to be direct targets of thea1-α2 complex, as well asa1-α2 target sites that do not repress transcription of nearby genes. Our method has a much lower false positive rate when compared to some of the conventional bioinformatic approaches.


              These findings show advantages of combining these two forms of data to investigate the mechanism of co-regulation of specific sets of genes.


              A bioinformatic approach to identifying cis-regulatory elements controlling transcription has become feasible with the availability of complete genome sequences and large scale expression data using high-throughput methods such as microarrays [1,2] and SAGE [3]. The expression data provides a list of genes whose expression is significantly modified under a particular condition. However, this data does not indicate whether these genes are direct targets of a particular transcription factor or if the changes in expression are the result of an indirect effect caused by altering the expression of other transcription factors that work directly at the promoter. Using information about sequence preference for binding of particular transcription factors, one can identify possible regulatory binding sites within a sequenced genome. However, this approach does not indicate if the sites are functional. We have therefore developed an algorithm that combines both of these approaches to distinguish between the direct and indirect targets that are regulated by a particular transcription factor.

              We have applied this methodology to study the transcriptional regulatory system that specifies cell mating-type in the yeastSaccharomyces cerevisiae[4]. Yeast have three cell types, haploidaandαcells, and thea/αdiploid, that differ in their ability to mate and in the proteins they express. Cell mating-type is determined in part byα2 anda1, which are cell-type-specific proteins that are members of the homeodomain (HD) DNA-binding family. In ana/αdiploid cell,α2 binds witha1 to form a heterodimer complex that represses transcription of haploid-specific genes [5]. The crystal structures of theα2 HD binding DNA alone and in complex witha1 have been solved, providing models for how these complexes bind DNA [6,7]. Biochemical and mutational analysis of each protein and their DNA-binding sites have defined the requirements for DNA recognition by this complex [810]. Genome-wide expression analysis has also been performed on each of the different cell types [11]. The combination of these resources has allowed us to develop and test algorithms to identify target sites for thea1-α2 complex. Previous work, using a relatively simple binding site search program identified targets for theα2-Mcm1 complex, which repressesa-cell-type specific genes inαanda/αcells [12]. The more advanced methods described in this paper have helped identify several novel targets of thea1-α2 complex that may be involved in cell-type specific processes. Interestingly, we identified several genes that are repressed in diploid cells but do not appear to be direct targets of thea1-α2 complex, suggesting that these genes are controlled by another transcriptional regulatory factor that is directly or indirectly regulated by thea1-α2 complex. We have also identified a number ofa1-α2 target sites that do not repress adjacent genes. The combination of site preference and expression data is therefore a valuable tool to identify direct functional targets of a transcription factor or complex.


              Development of a search algorithm for targets of the a1-α2 complex

              To generate an algorithm that combines microarray expression data and mutational analysis of binding sites, we first defined a scoring method that ranks gene expression data. We utilized the microarray expression data from Galitski and coworkers for gene expression in theaandαhaploid anda/ αdiploid cells, as well as various polyploids [11]. Since thea1-α2 complex should be absent in any of the homozygousaorαtype polyploids (a, aa, ...,α,αα,..., etc.) we expect the expression of haploid-specific genes in these cells to be much higher than in cells that are heterozygous for theMATlocus (a α,aa α,a αα,aa αα, etc.). Thus, one term in the scoring function rewards lower expression in heterozygous cell types compared to the homozygous cell types (see the Methods section for details). We expect that most of the haploid-specific genes will be expressed equally in both ofaandαcell types. Consequently, we have introduced a second term in the scoring function that penalizes such differences in expression in the two haploid cell types. This scoring function would identify haploid-specific genes that are repressed in diploid cells, but would not indicate if these genes are direct targets of thea1-α2 repressor complex.

              To identify genes from this ranking that are directly repressed by thea1-α2 complex we used the available mutational data on thea1-α2-binding site [9]. In these experiments, the effects of single base pair mutations of thea1-α2 consensus binding site were measured by assaying their ability to repress transcription of a heterologous promoter and by electrophoretic mobility shift DNA-binding assays (EMSA). Under the assumption that the level of expression is proportional to how often that site is unoccupied, we used the effects of the single base mutations to estimate the parameters for the binding energy of sites with different bases at each position. We then used this information to search for potentially strong binding sites in the promoter regions (in practice, 800 bp upstream of the translation start site) of every gene in the genome. This search provides us with a list of genes with putativea1-α2 binding sites in the promoter, irrespective of functionality.

              Even under ideal conditions, either of the above lists of genes would not directly indicate a haploid-specific gene directly repressed by thea1-α2 complex. In addition, our accuracy is limited by the noise in the microarray expression data, as well as by simplifying assumptions made to utilize single-base mutational data to score arbitrary sequences. We therefore decided to rank the genes using a scoring system that takes into account expression patterns across different mating types, as well as the likelihood of finding a gooda1-α2-binding site in the promoter region of the gene. To test the significance of these composite scores, we generated permuted data, which combined random promoters with the expression data. This analysis suggested that the top 10–15 predictions were significant. The results of experiments on the 24 genes with the highest combined scores from our analysis are shown in Table1.
              Table 1

              Potentiala1-α2 Binding Sites in Haploid-specific Genes



              Expression p-vala

              Binding p-valb

              Combined p-val

              a1-α2 ChIPc

















































































































































              YBR158W d






              YCL066W e

              MAT α 1





              aThe ranking of haploid-specific gene expression determined by analysis of microarray data [11].

              bThe ranking of potentiala1-α2 target of haploid-specific genes.

              cA + indicates that thea1-α2 binds to the promoter by ChIP assay.

              dIdentified in a search for sites with 1500 bp from the start of the ORF

              eIdentified if remove the penalty for expression in one haploid cell type but not the other.

              fIdentified as direct targets of thea1-α2 repressor complex in previous studies.

              a1-α2 binding to the promoter regions of genes identified in the search

              To evaluate the success of our computational algorithm for identifying direct targets of thea1-α2 repressor complex, we assayed binding by the complex to the identified promoters using chromatin-immunoprecipitation (ChIP) assays with polyclonal antibody directed against theα2 protein. Inαhaploid anda/ α-diploid cells theα2 protein combines with the MADS-box transcription factor Mcm1 to bind to elements in the promoters ofa-specific genes to repress their transcription [13,14]. We therefore included in each PCR a primer set for the promoter region of thea-specific geneSTE6to serve as a positive control for the ability to ChIPα2 in both haploidαand diploida/ αcells. This primer set also allowed us to rule out the possibility that the predicteda1-α2 target genes were immunoprecipitating because of binding by theα2-Mcm1 complex. The primer set for theYDL223Cpromoter, a gene not bound or repressed by thea1-α2 orα2-Mcm1 complexes, was included in the reaction as a negative control for non-specific immunoprecipitation of the DNA. A gel displaying the results for a few of the promoters that were assayed is shown in Figure1and the data summarized for all the promoters that were predicted to be directly repressed by thea1-α2 complex is listed in Table1. In general, there is a very good correlation between the experimental data with the predictions based on our computational algorithm. Almost all of the high scoring genes were bound by thea1-α2 complex in vivo. Genes that had a combined p-value higher than the threshold of 1.5 × 10-4(which corresponds to choosing the top 15 of the list in Table1) do not appear to be strongly bound by the complex. This p-value corresponds to roughly a probability of one in six thousand, indicating it is possible to get such combination by chance.
              Figure 1

              ChIP assays of promoter fragments that are predicted to be targets of thea1-α2 complex. ChIP assays with antibody to theα2 protein were performed on lysates fromMAT a/MATα(a/α) andmat aΔ/MATα(Δ/α) cells. Total chromatin (TC) and immunoprecipitated (IP) samples were subjected to multiplex PCR with primers flanking potentiala1-α2 sites (labeledhsg) in the indicated promoters. Primers for the promoter region ofSTE6, ana-specific gene that is repressed by theα2-Mcm1 complex in both cell types were used as a positive control for the ChIPs. Primers that hybridize inYDL223C, a gene that is not regulated byα2, was used as the negative control. The presence of a band over background levels from the test promoter froma/αlysates but notmatΔ/αindicates that thea1-α2 complex is specifically binding to the promoter. Assays for theGPA1, FAR1, NEJ1, RDH54, MET31, PDE1, AMN1/CST13andKAR1are shown. A summary of the all of the ChIPs performed is in Table 1.

              In comparison to higher eukaryotes, most yeast promoters are relatively small and contain activator or repressor binding sites within several hundred base pairs of the start site of the open reading frame (ORF) of the gene. However, there are a few genes, likeHO, whose regulation is controlled by a region several Kb long. Consequently, we did a separate search looking for additionala1-α2 binding sites that are within a region 1.5 Kb upstream of the ORF. Most of the sites identified in this search were well above the threshold value and were not bound by thea1-α2 complex in ChIP assays (data not shown). However, the search identified one site upstream of theAMN1/CST13gene that was a potential target site. The ChIP analysis verified this as a functional target site for thea1-α2 complex in vivo (Figure1, Table1).

              Analysis of haploid-specific genes predicted not to be bound bya1-α2

              Among the top 35 genes in the list ranked by haploid-specific expression score, more than half do not appear to contain an identifiablea1-α2 site by our analysis (lack of a significant binding site being defined as binding p-value greater than 2 × 10-4) (Table2). Although there were no apparent strong affinitya1-α2-binding sites in the promoters of these genes, it is possible that there are several weak affinity sites in the promoter that were not identified in the search. If present, these sites may work cooperatively to increase binding by thea1-α2 complex to the promoter and therefore repress transcription. In support of this model we have found that under some conditions weak affinitya1-α2 sites in theHOpromoter have a role in repression of the promoter (Mathias and Vershon, unpublished). Promoters with a number of weak affinity sites may therefore be directly regulated by thea1-α2 complex. However, it is also possible that these genes are indirectly repressed bya1-α2, through its ability to repress expression of another transcription factor that is required for expression of the genes identified in the microarray. To distinguish between these possibilities we assayed fora1-α2 binding to these promoters by ChIP (Figure2and Table2). As predicted from the site identification analysis, thea1-α2 complex did not appear to bind to most of these promoters. This result suggests that these genes are not directly repressed by the complex. The one exception to our predictions was that thea1-α2 complex appeared to weakly bind to theNEM1promoter in the ChIP assays (Figure2). Interestingly,NEM1is downstream ofGPA1, a gene that is strongly repressed by thea1-α2 complex (Fig1). Binding and repression bya1-α2 atGPA1may help binding to weak sites in theNEM1promoter. Alternatively, although we sheared the DNA used in the ChIP to an average of less than 500 bp, it is possible that there may have been some fragments that spanned the ~2 kb between the genes, thereby giving a positive result in the ChIP assay.
              Table 2

              Haploid-specific Genes that Do Not Containa1-α2 Target Sites



              Expression p-vala

              Binding p-valb

              Combined p-val

              a1-α2 ChIPc

























































































































              aThe ranking of haploid-specific gene expression determined by analysis of microarray data [11].

              bThe ranking of potentiala1-α2 target of haploid-specific genes.

              cA + indicates that thea1-α2 binds to the promoter by ChIP assay.

              Figure 2

              ChIP assays ofa1-α2 binding to promoter fragments that are not predicted to be targets of the complex. ChIP assays were performed as described in Figure 1. ChIP assays fora1-α2 binding to theFUS1,BUD3,NEM1,CAT2,YFL034WandGUP2promoters are shown.

              Comparison with binding site identification by the weight matrix method

              We compare the performance of our algorithm with that of the weight matrix method [1518]. In our study, we derived our parameters from a set of artificial sequences. Usually, the weight matrix has to be constructed from a set of known sites. We calculate the weight matrix fora1-α2 from regulatory elements upstream several known target genes:HO, GPA1, FUS3, AXL1, STE5, RME1andMAT α 1. As usual, one is faced with a choice of threshold weight matrix score for selecting putative sites in the yeast genome. For a stringent threshold that corresponds to the top 16 targets, we recovered all the genes, other thanRME1, used in construction of the weight matrix. However, we did not recover most of the other genuine targets identified, and verified, in this study. If we set the threshold to be lax enough to includeRME1, we obtained 55 candidate genes, includingSTE18andRDH54,but still miss targets likeSTE4. It is likely that most of the 55 putative targets are false positives, as evidenced by lack of haploid-specific regulation in the corresponding gene expression data.

              Overall, we find our method to be more successful than the weight matrix method. The use of mutational data as opposed to literature based data for sequence preference possibly accounts for part of the success (an advantage we may not have for some other transcription factors). However, much of our success has to do with cutting down of false positive rates by using microarray data judiciously.

              Analysis of all potentiala1-α2 target sites in the genome

              Among the genes identified in the computational analysis, there is a good correlation between the presence of stronga1-α2-binding sites in their promoter region and repression in diploid cells. This raises the question of whether alla1-α2-binding sites function as repressor sites. To address this question we searched for all potential binding sites in the yeast genome. As expected, many of the best sites are in the promoters of known or previously identified haploid-specific genes (Table1). However, we also identified a number of putativea1-α2-binding sites within ORFs (Table3). To test if thea1-α2 complex is able to bind to these sites we performed electrophoretic mobility shift assays (EMSAs) with purifiedα2 anda1 proteins and radiolabeled oligonucleotides containing these sites (Fig3A). Thea1-α2 complex bound to sites from theYKL162C,CDC25,PRM8,PRM9, andURB1ORFs with weaker affinity than to a strong binding site from theHOpromoter,HO(10). However, these sites did have slightly better binding affinity than to theHO(8)site, which we have shown is unable to repress transcription on its own (Mathias and Vershon, unpublished).
              Table 3

              Potentiala1-α2 target sites in ORFs



              Expression p-vala

              Binding p-valb

              Combined p-val

              a1-α2 ChIPc


































































































              aThe ranking of haploid-specific gene expression determined by analysis of microarray data [11].

              bThe ranking of potentiala1-α2 target of haploid-specific genes.

              cA + indicates that thea1-α2 binds to the promoter by ChIP assay.

              dRelative binding affinity in fold decrease in affinity compared to theHO(10)binding site.

              Figure 3

              a1-α2 binding in vitro and in vivo to sites in the ORF regions of the genome. (A) An EMSA of purifieda1 andα2 proteins binding to a stronga1-α2 binding site,HO(10)(lanes 1–8) and sites from theYKL162C(lanes 9–13),CDC25(lanes 14–18), a weaka1-α2 binding site from theHOpromoter,HO(8)(lanes 19–23),PRM8(lanes 24–28),PRM9(lanes 29–33) andURB1(lanes 34–38). The concentration of thea1 protein was held constant at 1.4 × 10-6M (lanes 2 and 4–38) and mixed with 5-fold dilutions of theα2 protein starting at 8.2 × 10-8M (lanes 3, 4, 9, 14, 19, 24, 29 and 34)). The EMSAs shown are phosphorimages of the gels. Lane 1 containsHO(10)probe alone, and lanes 2 and 3 contain 1.4 × 10-6Ma1 and 8.2 × 10-8Mα2 respectively. The fold repression by each site in the context of a heterologous promoter is shown below each gene/promoter. (B) ChIP assays for the genesYKL162C,CDC25,PRM8,PRM9andURB1are shown. ChIP assays were performed as described in Figure 1

              Since thea1-α2 complex was able to bind to these sites with weak to moderate affinity in vitro, it is possible these sites may partially repress transcription on their own. To test this model, we cloned these sites into the context of theCYC1promoter driving expression of alacZgene and measured the ability of the sites to repress transcription of the reporter in diploid cells [9]. The sites from theCDC25andURB1ORFs did not repress transcription of the reporter promoter in diploid cells (Fig3A). However, the site fromPRM8ORF, which showed the highest binding affinity among the sites found in ORF regions, weakly (2.8-fold) repressed the reporter promoter. This result indicates that this site can function as a repressor site in vivo if placed in the proper context. We next tested whethera1-α2 bound to these sites in the normal genomic context in vivo by ChIP assays. None of the sites in the ORF regions were bound by thea1-α2 complex (Fig3Band Table3). This result indicates that while they are competent for weak binding and repression in a heterologous promoter, they are unable to repress transcription in their normal genomic context.

              Our search also identified several potentiala1-α2 binding sites in the promoter regions of genes that do not appear to be repressed in diploid cells (Table4). Only theCOX13site had moderate binding affinity for thea1-α2 complex in the EMSAs (Fig4A). However, despite the relatively weak binding affinity of these sites, they were able to partially repress transcription of the reporter in diploid cells (Fig4A). In particular, the sites from theCOX13andREX2promoters showed significant levels of repression. Interestingly, although these sites functioned as repressor sites in the context of the heterologous reporter, except for theCOX13promoter, most of these sites were not bound by thea1-α2 complex at their genomic locations by ChIP assays (Fig4B). These results suggest that the genomic context of most of thesea1-α2 sites prevents binding by the complex.
              Table 4

              Potentiala1-α2 Target Sites in the Promoters of Non-Haploid-specific Genes




              Binding p-val

              Combined p-val

              a1-α2 ChIPb















































































              aThe ranking of haploid-specific gene expression determined by analysis of microarray data [11].

              bA + indicates that thea1-α2 binds to the promoter by ChIP assay. A +/- indicates weak (2-fold) enhancement of the band ina/αcells by ChIP assay.

              Figure 4

              a1-α2 binding in vitro and in vivo to putative binding sites in promoters associated with genes which are not expressed in a haploid-specific manner. (A) An EMSA of purifieda1 andα2 proteins binding to a stronga1-α2 binding site,HO(6)(lanes 1–7),COX13(lanes 8–11),REX2(lanes 12–15),LSM1(lanes 16–19) andFMP14(lanes 20–23). The concentration of thea1 protein was held constant at 1.4 × 10-6M (lanes 2 and 4–23) and mixed with 5-fold dilutions of theα2 protein starting at 8.2 × 10-8M (lanes 3, 4, 8, 12, 16 and 20). The EMSAs shown are phosphorimages of the gels. Lane 1 containsHO(6)probe alone, and lanes 2 and 3 contain 1.4 × 10-6Ma1 and 8.2 × 10-8Mα2 respectively. The fold repression by each site in the context of a heterologous promoter is shown. (B) ChIP assays forCOX13,REX2,LSM1andFMP14were performed as described in Figure 1.


              Genome-wide gene expression data using SAGE or DNA microarrays has provided a wealth of information on the regulation of genes under certain conditions or by specific transcription factors. The combination of this information with sequence analysis programs has enabled researchers to identify potential regulatory sites. For example, in a pioneering paper, Tavazoieet al.clustered expression data and used multiple local sequence alignment algorithms on the promoter regions of the co-clustered genes to discover regulatory motifs [19]. This approach has been further refined by using Bayesian networks to incorporate additional constraints regarding relative positions and the orientations of the motifs [20]. Another approach has been to break the genes into modules and perform module assignments and motif searches at the same time via an expectation maximization algorithm (as opposed to clustering first and finding motifs later) [21,22]. Although these approaches have worked well at identifying potential targets sites one drawback is that the expression patterns have to cluster well for these methods to work. For a small number of microarray experiments, this may always not be the case. A method that does not utilize clustering is a regression model based analysis to locate "words" in the promoter that correlates with modulation of expression [23]. However, this approach is restricted to retrieving functional consensus binding sites in the promoter regions and for transcription factors with low sequence specificity, this approach needs to be modified. Most of these approaches attack the difficult problem of what to do when relatively little is known about the regulatory system and sequence recognition by the protein. Consequently they develop pattern recognition algorithms that are essentially unsupervised. Our focus has been to take advantage, as much as possible, of knowledge about the biological system and use that information combined with expression analysis to identify potential target sites. The minor loss of generality of the tools resulting from such an approach is more than offset by its predictive power.

              To determine if the changes in expression of a specific gene are the result of a transcription factor working at the promoter we developed an algorithm that combines expression data with information on the binding site preference for a transcription factor. As a test for this algorithm we identified genes in yeast that are direct targets for regulation by thea1-α2 repressor complex. We also used this method to identify genes that are repressed in diploid cells but that are not direct targets of the complex, as well as functionala1-α2 binding sites that do not appear to repress transcription in their genomic context. The combination of these sets of findings has provided insight into the regulatory network and mechanism of repression by thea1-α2 complex.

              The primary goal of this study was to identify genes that are direct targets for repression by thea1-α2 complex. There are two major functional subsets among thea1-α2 target genes identified in this analysis (Table1). One, not surprisingly, involves genes that are required for various processes in mating of the two haploid cell-types. These include components of the mating pheromone signal transduction pathway, such asGPA1,STE18,STE4, andSTE5,which are activated in response to the binding of pheromone from the other cell type [24]. This group also includes genes further down that pathway, such asFAR1andFUS3, which are required for cell-cycle arrest before mating. A number of these genes have previously been shown or suspected to be under the control ofa1-α2 repressor complex [25,26]. Repression of these genes in diploid cells is biologically important because it prevents further mating by diploid cells. If diploid cells mate they would form triploids or higher ordered genomic polyploids, which are genetically unstable during meiosis and therefore detrimental to cell survival.

              The second subset of genes identified in the analysis is associated with mating type switching and recombination. TheHOgene is a known target of thea1-α2 complex and its promoter contains 10 binding sites of varying affinity [25]. Repression ofHOis essential in diploid cells because it prevents switching of one of theMATloci to form homozygousa/aorα/αdiploid cells. Although diploid in genomic content, cells homozygous for theMATloci are competent to mate and therefore would form higher order genomic polyploids that are genetically unstable. We have also shown thatNEJ1, which is involved in non-homologous end-joining (NHEJ), is a direct target for thea1-α2 complex [27,28]. It has been proposed that that repression of the NHEJ pathway may promote homologous recombination and crossing over in diploid cells. In addition, we found thatRDH54, a gene involved in double-stranded DNA break repair, is a direct target for thea1-α2 complex [29]. This result is somewhat unexpected becauseRDH54is required for meiosis and null mutants show significantly reduced spore viability. It is likely that thea1-α2 complex only partially reduces the level of expression of the gene and that diploid cells require a lower level of activity of the protein.

              We also identified several genes that fell outside of these two subsets. One isRME1, which encodes a transcriptional repressor ofIME1, the master regulator of meiosis [3032].a1-α2-mediated repression ofRME1is required to allow cells to enter the meiotic pathway in diploid cells. Interestingly, we also found thatPDE1andMET31are weakly, but reproducibly, direct targets for repression by thea1-α2 complex. The Pde1 protein is a low affinity cAMP phosphodiesterase that appears to have a role in response to stress and cell aging [33]. Repression ofPDE1in diploids may partially account for the difference of starvation response between haploids and diploids. Met31 is a zinc finger DNA-binding protein that activates genes involved in sulfur metabolism [34]. It is unclear why this gene would be a target for thea1-α2 complex.

              It is possible that the presence of ana1-α2 target site upstream of a gene that has lower expression in diploid cells was fortuitous and that these sites were not functional targets. However, if this was the case then there would be little pressure to conserve these binding sites through evolution. Several closely related species of yeast have been sequenced and comparison of the corresponding promoter regions has led to the discovery of conserved regulatory motifs [35,36]. Although lack of conservation does not imply non-functionality, significant conservation strongly argues for functionality of a putative regulatory element. To investigate this possibility, we performed a phylogenetic comparison to infer whether these sites are preserved among six sequencedSaccharomycesspecies using the PhyloGibbs program [37]. The program identified thea1-α2 binding site among a promoter set including many known haploid-specific genes (HO, NEJ1, GPA1, STE4,andSTE18). This analysis also showed that thea1-α2 binding sites in theRDH54, PDE1andMET31promoters are strongly conserved among multiple species, suggesting that these sites play an important functional role.

              Our analysis identified a number of haploid-specific genes that do not appear to be direct targets of thea1-α2 repressor complex (Table2). Genes in this list do not contain a recognizablea1-α2-binding site and, with the exception ofNEM1, are not detectably bound by thea1-α2 complex in the ChIP assays. It is possible thata1-α2 indirectly turns off these genes by repressing an activator protein that is required for their expression. However, besidesMET31, there were no obvious genes coding for activator proteins that were direct targets of thea1-α2 complex. It is possible that the haploid-specific genes withouta1-α2 sites are indirectly repressed through more complex mechanisms that involve repression ofRME1.

              We also identified potentiala1-α2-binding sites in the genome that do not appear to repress expression of nearby genes. Although sites from thePRM8,PRM9,CDC25, andLSM1promoters appear to be moderate binding sites for thea1-α2 complex in vitro, ChIP and heterologous reporter assays showed these sites are neither bound by the proteins nor are functional repressor sites in vivo. Many of these sites lie in open reading frames of actively transcribed genes and so it is possible that transcription through the binding site or the chromatin structure of the region prevents high affinity binding by the complex. The model that the genomic context of these sites is important for their regulatory activity is further supported by our results that show that some of these sites, such asCOX13andREX2, function as stronga1-α2 dependent repressor sites in the context of the heterologous promoter. Althougha1-α2 complex is bound to theCOX13site in vivo it does not appear to repress transcription of this gene in diploid cell. Interestingly, this binding site is very close to the end of the coding region ofIME4, an inducer of meiosis that is expressed in diploid cells [38]. TheIME4gene is only expressed in diploid cells and it was thought that thea1-α2 complex may be indirectly activating its expression by repressing a repressor protein, such asRME1. However, the fact thata1-α2 binds to the downstream region of this gene suggests that it may play a direct role in its expression.

              Our data shows that the algorithm we have developed is useful in sorting between direct and indirect targets of a transcription factor. Although we have used mutational data to define the binding site for thea1-α2 complex, in principal binding site sequences derived from site selection experiments may also be used. This analysis may also complement genome-wide ChIP studies to identify the target sites of the transcription factor.


              In summary, we show that combining microarray data with motif analysis, lets us distinguish between the genes that are direct targets of a transcription factor and those that are modulated because of secondary effects. We get excellent agreement of the computational predictions with location analysis by ChIP experiments. We find most of the direct targets ofa1/α2 complex to be involved in the mating pathway, mating type switching, recombination and meiosis. We also found a few weak targets that are possibly involved in sensing and control of the metabolic state. We also see that the sites we predict solely based on single species data are often evolutionarily conserved in other species ofSaccharomyces.


              Combined scoring of genes from microarray data and mutational analysis

              We define a scoring algorithm that ranks gene expression patterns. For geneg, the score is given in terms of the expression in different types of cells (a,αanda/α)

              Score(g) = sgn((X a (g) +X α (g))/2 -X a/α (g)) [(X a (g)) +X α (g))/2 -X a/α (g)]2-A(X a (g) -X α (g))2.

              We initially used the logarithm of expression level of the genegfor the three cell types for the variablesX t(g) with t indicating the type. We have since found that using the complete expression data for the polyploids is a better strategy. In the final results, shown in the paper,X a (g) is the average of the log expression fora,aaandaaa. Likewise,X α (g) is the average fromα, ααandααα.X a/α (g) comes from averaging log expression overaa α,a ααandaa αα. The polyploid averaged quantities tend to be less noisy (demonstrated, for example, by the quantitiesX a andX α being close to each other for the generic gene, which is not regulated by cell type. This, in turn, allows easier detection of genuine haploid-specific targets.

              An explanation of the purpose served by different terms in the overall score is described below. The first term scores well when expression in diploids is lower than the average expression in haploids. The second term penalizes the gene if the expressions in different types of haploids are very different.Ais chosen to be large enough so that knowna-specific genes andα-specific genes score worse than known haploid-specific genes, but not large enough to overwhelm the first term. The optimalAis about 10. Comparison of the performance of our algorithm forA= 1 andA= 10, shows that the biologically known sites almost always stay near the top but further down in the list the second choice is better. The exception is a special gene:MATα1. SinceMATα1 is not present in theMAT atype cell, there is a penalty for expression patterns. Cumulative probability for any gene to have higher score than a gene g isP exprn (g), namely, fraction of genes with score higher than g.

              This scoring function would rank haploid-specific genes high but may not select out genes that are directly regulated by thea1-α2 repressor complex. In order to select for genes with an upstream region with a stronga1-α2 repressor, we used the binding site mutational data available [9]. In these experiments, repression of a heterologous promoter, incorporating single site mutations of a consensus binding site of thea1-α2 repressor, was measured. Under the assumption that the degree of repression is inversely proportional to how often that site is occupied, we derived the expression: 1/Repression ∝ [1 - 1/(1 + e βE(S)/z)] ≈ e βE(S)/zassuming near saturation of binding. The symbolzrepresents the fugacity andβis inverse ofk B T,k B being the boltzman constant. The binding (free) energy is given byE(S) = Σ ib ε ib S ib , within the single base model [15,16]. The indexiruns over the positions in the motif andbruns over the bases A, C, G, T.S ib is 1 or 0 depending upon whether thei-th base isbor not. The parametersε ib represent effects of single base changes on the binding (free) energy. They are related to the weight matrix parameters [17,18] widely used to characterized variable motifs. Note that e βE(S)/zwould more commonly be represented as (K/ [Protein])•exp(ΔG(S)/RT) in the biochemistry literature [18,39]. The independent base model is only an approximation and mutations in nearest neighbor sites could produce effects that we cannot estimate from the existing data. There is a better separation between well-known sites and generic sequences if an extra penalty is added to the score for neighboring base pairs which both differ from the consensus. In this way every base different from consensus and neighboring another base different from consensus draws an additional penalty to the binding energy score. This parameter was set to be ln(2), by experience. Although this method prevented many false positives, it also penalized a few genuine candidates, such as the binding sites in the promoters ofFAR1andMATα1. Thus, from the effect of the single base mutations, we estimated the parametersε ib . Armed with these parameters, we found the probabilityP(E|ε,L) that a random sequence of a certain length,L, would have a subsequence of binding energy greater thanE. For gene g, the strongest site in the upstream region of lengthLwould have binding energyE g . Low values ofP binding (g) =P(E g |ε,L), indicated the presence of a good binding site.

              The genes were ordered according to the lowest value of a combined p-value,P exprn (g)P binding (g), and then ranked as candidates for haploid-specific genes that are directly repressed by thea1-α2 complex. One of the issues in such studies is how to decide on how many of the top candidates are significant. This problem occurs even in solely, sequence based analysis as well [40]. In our study, we generated random combinations of expression p-values with scrambled binding p-values, so that we could choose a cutoff threshold by comparing the ordered p-values with the ordered "scrambled" p-values. Figure 5 plots these sorted p-values against average (log) sorted p-values for the random combinations. The comparison suggests that only about 10–15 top candidates in the list are significant.
              Figure 5

              Significance of combined p-values. Natural logarithm of combined p-values for twenty permutations/scrambles generated, sorted and plotted (in blue) against average of log p-value. The genuine combined value is plotted in red.

              Weight matrix based search

              A weight matrix [1518] search for binding sites was performed using a set of known sites [9] to construct the matrix. Each matrix entry,w ia , was set to log((f ia +δ)/(P a +δ)), wheref ia is the frequency with which base a appears in theith position in the known sites,P a is the frequency with which base a appears in the promoters of genes andδis a small number added to ensure that the weight matrix score is finite even whenf ia = 0. Each subsequence,S ia , of length 20 in each promoter was assigned a weight matrix score Σ ia S ia w ia . After a threshold score is chosen, sites scoring above that threshold are declared to be binding sites.

              Automated primer generation

              An automated procedure for generating primers flanking a specified site in the genome sequence,σ, was implemented. To each pair of numbers,d u , andd d , representing primer distances upstream and downstream of the candidate binding site respectively, and primer lengthsl u , andl d , a score is assigned via

              S(d u ,d d ,l u ,l d ,σ) = - Σ a k a (P a (d u ,d d ,l u ,l d ,σ) -
              where theP a are functions including the melting temperatures, distances of upstream and downstream primers from the candidate site, total length of the bound region, and the difference between the primer melting temperatures. The
              s are the preferred values of those functions. Thek a are adjusted to reflect the relative importance of the parameters; for example it is more important that the difference in melting temperatures be close to zero than it is that the distance to the upstream primer match the distance to the downstream primer.

              Values ofd u ,d d ,l u andl d are restricted to those whose corresponding primers have GG, GC, CG, or CC at the end nearest the candidate site. Primers are identified by selecting the values ofd u ,d d ,l u andl d which maximizeS.

              [A web-based interface to this algorithm is available athttp://​hill-226-174.​rutgers.​edu/​]

              Chromatin immunoprecipitation

              Chromatin immunoprecipitation (ChIP) was carried out as described previously [41] with the following modifications. One liter of JRY103 (MAT α /MAT a ade2-1/ADE2 HIS3/his3-11,15 leu2-3,112/leu2-3,112 trp1-1/trp1-1 ura3-1/ura3-1 ash1Δ::LEU2/ash1Δ::LEU2) and JRY118 (MAT α /mat a Δ::TRP1 ade2-1/ADE2 HIS3/his3-11,15 leu2-3,112/leu2-3,112 trp1-1/trp1-1 ura3-1/ura3-1 ash1Δ::LEU2/ash1Δ::LEU2) cultures were grown to an A600 of 0.5 and treated with 1% formaldehyde for 20 min at RT on a rotating shaker at low speed. Cells were collected, washed 2X with cold 1XTBS. Equal volumes of cells were aliquoted into ten 1.5 ml microfuge tubes, washed once with 1.5 ml of cold 1X TBS. The pellets in each tube were resuspended with 400μl of lysis buffer (50 mM HEPES, pH 7.4, 150 mM NaCl, 1 mM EDTA, 1% Triton X-100, 0.1% Na-Deoxycholate) plus 1 mM PMSF, 1 mM benzamidine, and 1X Protease inhibitor cocktail from Roche (Cat No. 1873580) and also manufacturer recommended concentration of protease inhibitor cocktail from SIGMA (Cat No., P 8215). To this 200μl of glass beads were added to each tube and lysed using a multitube vortexer at full speed for 30 min at 4°C. The lysate was transferred in a new tube and 400μl of lysis buffer was added and vortexed briefly. The lysates were centrifuged at 12,000 g for 10 min at 4°C and the supernatants were sonicated at 30% output for four 10 sec cycles with intermittent cooling on ice.

              The lysates were cleared by centrifugation at 12,000 g for 10 min and 1 mM PMSF was added to the samples. A 1/10thvolume aliquot was removed and frozen to be used as total chromatin control. The remaining sample was precleared by the addition of 25μl recombinant protein G-agarose beads, incubated while nutating for 30 min and the supernatant was collected after centrifugation at 12,000 for 5 min. 1μl of rabbit anti-α2 antiserum (a gift from A. Johnson, UCSF) was added to each supernatant of the samples and incubated 12 h on a nutator at 4°C. To immunopreciptiateα2 50μl of recombinant protein G agarose beads (Roche) was added to the samples and nutated for 90 minutes at 4°C. The protein G beads were pelleted, washed once in low salt buffer (0.1%SDS, 1% Triton X-100, 20 mM Tris pH8.0, 2 mM EDTA and 150 mM NaCl), once in high salt (composition same as lowsalt + 500 mM NaCl), once in LiCl buffer (0.25 M LiCl, 1% IGEPAL, 1XTE and 1% Na-Deoxycholate) and twice with 1XTE (pH8.0). The immunoprecipitated DNA was eluted twice with 250μl of elution buffer (1%SDS and 0.1 M NaHCO3) and the eluates were pooled (500μl final volume). To this 20μl of 5 M NaCl was added and incubated 12 h at 65°C. To remove the crosslinks, 10μl of 0.5 M EDTA, 20μl of 1 M Tris-HCl, pH 7.5 and 2μl of proteinase K (10 mg/ml) was added and incubated for 45 minutes at 45°C. The DNA samples were extracted once with Phenol:chloroform:Isoamylalcohol and the DNA was ethanol precipitated, washed once with 70% ethanol and resuspended in 50μl (IP) or 500μl (TC) TE.

              Purified DNA from the immunoprecipitated samples was subjected to multiplex PCR amplification with primers specific for theSTE6promoter as a positive control for the immunoprecipitation ofα2 and theYDL223CORF as a negative control for nonspecific immunoprecipitation, along with the specific primers for candidateα2-a1 target sites. PCRs were carried out in 50μl containing 10 pmols of each primer, 0.2 mM dNTPs, 2 mM MgCl2, 1X Eppendorf Taq buffer, 0.5X Taq Master buffer and 2.5 U of Eppendorf Taq polymerase. The amplifications were carried out at 94°C for 1 min and 30 secs, followed by 25 cycles of 94°C for 30 secs, 52°C for 1 min, and 72°C for 30 secs and a final extension step of 7 min at 72°C. The PCR products were separated on 2.5% agarose gels.

              Electrophoretic mobility shift assays

              Oligonucleotides containing the predicteda1-α2 binding sites from within the ORFs ofURB1,PRM8,PRM9,YKL162CandCDC25and the promoters ofCOX13, REX2, LSM1, andFMP14were synthesized, one strand was end-labeled with [γ-32P]-ATP, and then annealed with excess cold complementary oligonucleotide. TheHO(10) andHO(8)a1-α2 sites within Upstream Regulatory Sequence 1 (URS1) of theHOpromoter were used as strong and weak binding sites respectively. The EMSA was performed as described previously [42], using a constant 1.4μMa1 and five-fold titrations ofα2 starting at 82 nM in protein dilution buffer (50 mM Tris pH 7.6. 1 mM EDTA, 500 mM NaCl, 10 mM 2-mercaptoethanol, 10 mg/ml bovine serum albumin).

              β-galactosidase assays

              Oligonucleotides containinga1-α2 binding sites were synthesized with 5' overhangs to allow cloning into theXhoI site of pTBA23 (2μ URA3Ampr), a reporter plasmid containing aCYC1-lacZfusion [43]. Reporter constructs were transformed into JRY103 and JRY118 and theβ-galactosidase activity was measured on three independent transformants, as described previously [14].

              List of Abbreviations


              Serial Analysis of Gene Expression




              Open Reading Frame


              Chromatin immunoprecipitation


              Non-Homologous End Joining


              Total Chromatin


              Immunoprecipitated sample


              Electrophoretic Mobility Shift Assay


              Polymerase Chain Reaction


              Phenyl Methyl Sulfonyl Fluoride


              Ethylenediaminetetraacetic Acid



              We thank Alexander Johnson for the rabbitα2 antibody and Yvette Green for help with some of the initial ChIP assays. We also thank Rahul Siddharthan for running PhyloGibbs on the candidate direct target promoter set. JM was supported by a Charles and Johanna Busch predoctoral fellowship. This work was partially supported by a grant from the National Institutes of Health to AKV (GM49265).

              Authors’ Affiliations

              BioMaPS Institute, Rutgers University
              Department of Physics and Astronomy, Rutgers University
              Waksman Institute and Department of Molecular Biology and Biochemistry, Rutgers University
              Medical Sciences Center, University of Wisconsin


              1. Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, Brown EL:Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol1996,14:1675–80.View ArticlePubMed
              2. DeRisi JL, Iyer VR, Brown PO:Exploring the metabolic and genetic control of gene expression on a genomic scale. Science1997,278:680–6.View ArticlePubMed
              3. Velculescu VE, Vogelstein B, Kinzler KW:Analysing uncharted transcriptomes with SAGE. Trends Genet2000,16:423–5.View ArticlePubMed
              4. Johnson A:A combinatorial regulatory circuit in budding yeast. In Transcriptional Regulation (Edited by: McKnight SL, Yamamoto KR).Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press 1992, 975–1006.
              5. Goutte C, Johnson AD:Recognition of a DNA operator by a dimer composed of two different homeodomain proteins. EMBO J1994,13:1434–42.PubMed
              6. Li T, Stark MR, Johnson AD, Wolberger C:Crystal structure of the MATa1/MATα2 homeodomain heterodimer bound to DNA. Science1995,270:262–269.View ArticlePubMed
              7. Wolberger C, Vershon AK, Liu B, Johnson AD, Pabo CO:Crystal structure of a MATα2 homeodomain-operator complex suggests a general model for homeodomain-DNA interactions. Cell1991,67:517–528.View ArticlePubMed
              8. Zhong H, Vershon AK:The yeast homeodomain protein MATalpha2 shows extended DNA binding specificity in complex with Mcm1. J Biol Chem1997,272:8402–9.View ArticlePubMed
              9. Jin Y, Zhong H, Vershon AK:The yeast a1 and alpha2 homeodomain proteins do not contribute equally to heterodimeric DNA binding. Mol Cell Biol1999,19:585–93.PubMed
              10. Mathias JR, Zhong H, Jin Y, Vershon AK:Altering the DNA-binding specificity of the yeast matalpha 2 homeodomain protein. J Biol Chem2001,276:32696–703.View ArticlePubMed
              11. Galitski T, Saldanha AJ, Styles CA, Lander ES, Fink GR:Ploidy regulation of gene expression. Science1999,285:251–4.View ArticlePubMed
              12. Zhong H, McCord R, Vershon AK:Identification of target sites of the alpha2-Mcm1 repressor complex in the yeast genome. Genome Res1999,9:1040–7.View ArticlePubMed
              13. Passmore S, Maine GT, Elble R, Christ C, Tye BK:Saccharomyces cerevisiae protein involved in plasmid maintenance is necessary for mating of MAT alpha cells. J Mol Biol1988,204:593–606.View ArticlePubMed
              14. Keleher CA, Passmore S, Johnson AD:Yeast repressor alpha 2 binds to its operator cooperatively with yeast protein Mcm1. Mol Cell Biol1989,9:5228–30.PubMed
              15. Berg OG, von Hippel PH:Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J Mol Biol1987,193:723–50.View ArticlePubMed
              16. Berg OG, von Hippel PH:Selection of DNA binding sites by regulatory proteins. II. The binding specificity of cyclic AMP receptor protein to recognition sites. J Mol Biol1988,200:709–23.View ArticlePubMed
              17. Staden R:Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res1984,12:505–19.View ArticlePubMed
              18. Stormo GD, Fields DS:Specificity, free energy and information content in protein-DNA interactions. Trends Biochem Sci1998,23:109–13.View ArticlePubMed
              19. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM:Systematic determination of genetic network architecture. Nat Genet1999,22:281–5.View ArticlePubMed
              20. Beer MA, Tavazoie S:Predicting gene expression from sequence. Cell2004,117:185–98.View ArticlePubMed
              21. Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N:Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet2003,34:166–76.View ArticlePubMed
              22. Segal E, Yelensky R, Koller D:Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics2003,19(Suppl 1):i273–82.View ArticlePubMed
              23. Bussemaker HJ, Li H, Siggia ED:Regulatory element detection using correlation with expression. Nat Genet2001,27:167–71.View ArticlePubMed
              24. Sprague GF, Thorner JW:Pheromone response and signal transduction during mating process of Saccharomyces cerevisiae. In The Molecular and Cellular Biology of the Yeast Saccharomyces - Gene Expression (Edited by: Broach JR).Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press 1992, 657–744.
              25. Miller AM, MacKay VL, Nasmyth KA:Identification and comparison of two sequence elements that confer cell-type specific transcription in yeast. Nature1985,314:598–603.View ArticlePubMed
              26. Liu H, Styles CA, Fink GR:Elements of the yeast pheromone response pathway required for filamentous growth of diploids. Science1993,262:1741–4.View ArticlePubMed
              27. Valencia M, Bentele M, Vaze MB, Herrmann G, Kraus E, Lee SE, Schar P, Haber JE:NEJ1 controls non-homologous end joining in Saccharomyces cerevisiae. Nature2001,414:666–9.View ArticlePubMed
              28. Kegel A, Sjostrand JO, Astrom SU:Nej1p, a cell type-specific regulator of nonhomologous end joining in yeast. Curr Biol2001,11:1611–7.View ArticlePubMed
              29. Klein HL:Mutations in recombinational repair and in checkpoint control genes suppress the lethal combination of srs2Delta with other DNA repair genes in Saccharomyces cerevisiae. Genetics2001,157:557–65.PubMed
              30. Mitchell AP, Herskowitz I:Activation of meiosis and sporulation by repression of the RME1 product in yeast. Nature1986,319:738–42.View ArticlePubMed
              31. Kassir Y, Granot D, Simchen G:IME1, a positive regulator gene of meiosis in S. cerevisiae. Cell1988,52:853–62.View ArticlePubMed
              32. Covitz PA, Herskowitz I, Mitchell AP:The yeast RME1 gene encodes a putative zinc finger protein that is directly repressed by a1-alpha 2. Genes Dev1991,5:1982–9.View ArticlePubMed
              33. Ma P, Wera S, Van Dijck P, Thevelein JM:The PDE1-encoded low-affinity phosphodiesterase in the yeast Saccharomyces cerevisiae has a specific function in controlling agonist-induced cAMP signaling. Mol Biol Cell1999,10:91–104.PubMed
              34. Blaiseau PL, Isnard AD, Surdin-Kerjan Y, Thomas D:Met31p and Met32p, two related zinc finger proteins, are involved in transcriptional regulation of yeast sulfur amino acid metabolism. Mol Cell Biol1997,17:3640–8.PubMed
              35. Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen BA, Johnston M:Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science2003,301:71–6.View ArticlePubMed
              36. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES:Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature2003,423:241–54.View ArticlePubMed
              37. Siddharthan R, van Nimwegen E, Siggia ED:PhyloGibbs: A Gibbs sampler incorporating phylogenetic information. In The First Annual RECOMB Satellite Workshop on Regulatory GenomicsUniversity of California, San Diego 2004.
              38. Shah JC, Clancy MJ:IME4, a gene that mediates MAT and nutritional control of meiosis in Saccharomyces cerevisiae. Mol Cell Biol1992,12:1078–86.PubMed
              39. Liu X, Clarke ND:Rationalization of gene regulation by a eukaryotic transcription factor: calculation of regulatory region occupancy from predicted binding affinities. J Mol Biol2002,323:1–8.View ArticlePubMed
              40. Djordjevic M, Sengupta AM, Shraiman BI:A biophysical approach to transcription factor binding site discovery. Genome Res2003,13:2381–90.View ArticlePubMed
              41. Kuo MH, Allis CD:In vivo cross-linking and immunoprecipitation for studying dynamic Protein:DNA associations in a chromatin environment. Methods1999,19:425–33.View ArticlePubMed
              42. Hart B, Mathias JR, Ott D, McNaughton L, Anderson JS, Vershon AK, Baxter SM:Engineered improvements in DNA-binding function of the MATa1 homeodomain reveal structural changes involved in combinatorial control. J Mol Biol2002,316:247–56.View ArticlePubMed
              43. Acton TB, Zhong H, Vershon AK:DNA-binding specificity of Mcm1: operator mutations that alter DNA-bending and transcriptional activities by a MADS box protein. Mol Cell Biol1997,17:1881–9.PubMed


              © Nagaraj et al. 2004

              This article is published under license to BioMed Central Ltd. This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://​creativecommons.​org/​licenses/​by/​2.​0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.