Combined scoring of genes from microarray data and mutational analysis
We define a scoring algorithm that ranks gene expression patterns. For gene g, the score is given in terms of the expression in different types of cells (a, α and a/α)
Score(g) = sgn((X
a
(g) + X
α
(g))/2 - X
a/α
(g)) [(X
a
(g)) + X
α
(g))/2 - X
a/α
(g)]2 - A(X
a
(g) - X
α
(g))2.
We initially used the logarithm of expression level of the gene g for the three cell types for the variables X
t(g) with t indicating the type. We have since found that using the complete expression data for the polyploids is a better strategy. In the final results, shown in the paper, X
a
(g) is the average of the log expression for a, aa and aaa. Likewise, X
α
(g) is the average from α, αα and ααα. X
a/α
(g) comes from averaging log expression over aa α, a αα and aa αα. The polyploid averaged quantities tend to be less noisy (demonstrated, for example, by the quantities X
a
and X
α
being close to each other for the generic gene, which is not regulated by cell type. This, in turn, allows easier detection of genuine haploid-specific targets.
An explanation of the purpose served by different terms in the overall score is described below. The first term scores well when expression in diploids is lower than the average expression in haploids. The second term penalizes the gene if the expressions in different types of haploids are very different. A is chosen to be large enough so that known a-specific genes and α-specific genes score worse than known haploid-specific genes, but not large enough to overwhelm the first term. The optimal A is about 10. Comparison of the performance of our algorithm for A = 1 and A = 10, shows that the biologically known sites almost always stay near the top but further down in the list the second choice is better. The exception is a special gene: MATα 1. Since MATα 1 is not present in the MAT a type cell, there is a penalty for expression patterns. Cumulative probability for any gene to have higher score than a gene g is P
exprn
(g), namely, fraction of genes with score higher than g.
This scoring function would rank haploid-specific genes high but may not select out genes that are directly regulated by the a 1-α 2 repressor complex. In order to select for genes with an upstream region with a strong a 1-α 2 repressor, we used the binding site mutational data available [9]. In these experiments, repression of a heterologous promoter, incorporating single site mutations of a consensus binding site of the a 1-α 2 repressor, was measured. Under the assumption that the degree of repression is inversely proportional to how often that site is occupied, we derived the expression: 1/Repression ∝ [1 - 1/(1 + eβE(S)/z)] ≈ eβE(S)/z assuming near saturation of binding. The symbol z represents the fugacity and β is inverse of k
B
T, k
B
being the boltzman constant. The binding (free) energy is given by E(S) = Σ
ib
ε
ib
S
ib
, within the single base model [15, 16]. The index i runs over the positions in the motif and b runs over the bases A, C, G, T.S
ib
is 1 or 0 depending upon whether the i- th base is b or not. The parameters ε
ib
represent effects of single base changes on the binding (free) energy. They are related to the weight matrix parameters [17, 18] widely used to characterized variable motifs. Note that eβE(S)/z would more commonly be represented as (K/ [Protein])•exp(ΔG(S)/RT) in the biochemistry literature [18, 39]. The independent base model is only an approximation and mutations in nearest neighbor sites could produce effects that we cannot estimate from the existing data. There is a better separation between well-known sites and generic sequences if an extra penalty is added to the score for neighboring base pairs which both differ from the consensus. In this way every base different from consensus and neighboring another base different from consensus draws an additional penalty to the binding energy score. This parameter was set to be ln(2), by experience. Although this method prevented many false positives, it also penalized a few genuine candidates, such as the binding sites in the promoters of FAR1 and MATα 1. Thus, from the effect of the single base mutations, we estimated the parameters ε
ib
. Armed with these parameters, we found the probability P(E|ε, L) that a random sequence of a certain length, L, would have a subsequence of binding energy greater than E. For gene g, the strongest site in the upstream region of length L would have binding energy E
g
. Low values of P
binding
(g) = P(E
g
|ε,L), indicated the presence of a good binding site.
The genes were ordered according to the lowest value of a combined p-value, P
exprn
(g) P
binding
(g), and then ranked as candidates for haploid-specific genes that are directly repressed by the a 1-α 2 complex. One of the issues in such studies is how to decide on how many of the top candidates are significant. This problem occurs even in solely, sequence based analysis as well [40]. In our study, we generated random combinations of expression p-values with scrambled binding p-values, so that we could choose a cutoff threshold by comparing the ordered p-values with the ordered "scrambled" p-values. Figure 5 plots these sorted p-values against average (log) sorted p-values for the random combinations. The comparison suggests that only about 10–15 top candidates in the list are significant.
Weight matrix based search
A weight matrix [15–18] search for binding sites was performed using a set of known sites [9] to construct the matrix. Each matrix entry, w
ia
, was set to log((f
ia
+ δ)/(P
a
+ δ)), where f
ia
is the frequency with which base a appears in the i th position in the known sites, P
a
is the frequency with which base a appears in the promoters of genes and δ is a small number added to ensure that the weight matrix score is finite even when f
ia
= 0. Each subsequence, S
ia
, of length 20 in each promoter was assigned a weight matrix score Σ
ia
S
ia
w
ia
. After a threshold score is chosen, sites scoring above that threshold are declared to be binding sites.
Automated primer generation
An automated procedure for generating primers flanking a specified site in the genome sequence, σ, was implemented. To each pair of numbers, d
u
, and d
d
, representing primer distances upstream and downstream of the candidate binding site respectively, and primer lengths l
u
, and l
d
, a score is assigned via
S(d
u
, d
d
, l
u
, l
d
,σ) = - Σ
a
k
a
(P
a
(d
u
, d
d
, l
u
, l
d
,σ) -
)2
where the P
a
are functions including the melting temperatures, distances of upstream and downstream primers from the candidate site, total length of the bound region, and the difference between the primer melting temperatures. The
s are the preferred values of those functions. The k
a
are adjusted to reflect the relative importance of the parameters; for example it is more important that the difference in melting temperatures be close to zero than it is that the distance to the upstream primer match the distance to the downstream primer.
Values of d
u
, d
d
, l
u
and l
d
are restricted to those whose corresponding primers have GG, GC, CG, or CC at the end nearest the candidate site. Primers are identified by selecting the values of d
u
, d
d
, l
u
and l
d
which maximize S.
[A web-based interface to this algorithm is available at http://hill-226-174.rutgers.edu/]
Chromatin immunoprecipitation
Chromatin immunoprecipitation (ChIP) was carried out as described previously [41] with the following modifications. One liter of JRY103 (MAT α /MAT a ade2-1/ADE2 HIS3/his3-11,15 leu2-3,112/leu2-3,112 trp1-1/trp1-1 ura3-1/ura3-1 ash1Δ::LEU2/ash1Δ::LEU2) and JRY118 (MAT α /mat a Δ::TRP1 ade2-1/ADE2 HIS3/his3-11,15 leu2-3,112/leu2-3,112 trp1-1/trp1-1 ura3-1/ura3-1 ash1Δ::LEU2/ash1Δ::LEU2) cultures were grown to an A600 of 0.5 and treated with 1% formaldehyde for 20 min at RT on a rotating shaker at low speed. Cells were collected, washed 2X with cold 1XTBS. Equal volumes of cells were aliquoted into ten 1.5 ml microfuge tubes, washed once with 1.5 ml of cold 1X TBS. The pellets in each tube were resuspended with 400 μ l of lysis buffer (50 mM HEPES, pH 7.4, 150 mM NaCl, 1 mM EDTA, 1% Triton X-100, 0.1% Na-Deoxycholate) plus 1 mM PMSF, 1 mM benzamidine, and 1X Protease inhibitor cocktail from Roche (Cat No. 1873580) and also manufacturer recommended concentration of protease inhibitor cocktail from SIGMA (Cat No., P 8215). To this 200 μ l of glass beads were added to each tube and lysed using a multitube vortexer at full speed for 30 min at 4°C. The lysate was transferred in a new tube and 400 μ l of lysis buffer was added and vortexed briefly. The lysates were centrifuged at 12,000 g for 10 min at 4°C and the supernatants were sonicated at 30% output for four 10 sec cycles with intermittent cooling on ice.
The lysates were cleared by centrifugation at 12,000 g for 10 min and 1 mM PMSF was added to the samples. A 1/10th volume aliquot was removed and frozen to be used as total chromatin control. The remaining sample was precleared by the addition of 25 μ l recombinant protein G-agarose beads, incubated while nutating for 30 min and the supernatant was collected after centrifugation at 12,000 for 5 min. 1 μ l of rabbit anti-α 2 antiserum (a gift from A. Johnson, UCSF) was added to each supernatant of the samples and incubated 12 h on a nutator at 4°C. To immunopreciptiate α 2 50 μ l of recombinant protein G agarose beads (Roche) was added to the samples and nutated for 90 minutes at 4°C. The protein G beads were pelleted, washed once in low salt buffer (0.1%SDS, 1% Triton X-100, 20 mM Tris pH8.0, 2 mM EDTA and 150 mM NaCl), once in high salt (composition same as lowsalt + 500 mM NaCl), once in LiCl buffer (0.25 M LiCl, 1% IGEPAL, 1XTE and 1% Na-Deoxycholate) and twice with 1XTE (pH8.0). The immunoprecipitated DNA was eluted twice with 250 μ l of elution buffer (1%SDS and 0.1 M NaHCO3) and the eluates were pooled (500 μ l final volume). To this 20 μ l of 5 M NaCl was added and incubated 12 h at 65°C. To remove the crosslinks, 10 μ l of 0.5 M EDTA, 20 μ l of 1 M Tris-HCl, pH 7.5 and 2 μ l of proteinase K (10 mg/ml) was added and incubated for 45 minutes at 45°C. The DNA samples were extracted once with Phenol:chloroform:Isoamylalcohol and the DNA was ethanol precipitated, washed once with 70% ethanol and resuspended in 50 μ l (IP) or 500 μ l (TC) TE.
Purified DNA from the immunoprecipitated samples was subjected to multiplex PCR amplification with primers specific for the STE6 promoter as a positive control for the immunoprecipitation of α 2 and the YDL223C ORF as a negative control for nonspecific immunoprecipitation, along with the specific primers for candidate α 2-a 1 target sites. PCRs were carried out in 50 μ l containing 10 pmols of each primer, 0.2 mM dNTPs, 2 mM MgCl2, 1X Eppendorf Taq buffer, 0.5X Taq Master buffer and 2.5 U of Eppendorf Taq polymerase. The amplifications were carried out at 94°C for 1 min and 30 secs, followed by 25 cycles of 94°C for 30 secs, 52°C for 1 min, and 72°C for 30 secs and a final extension step of 7 min at 72°C. The PCR products were separated on 2.5% agarose gels.
Electrophoretic mobility shift assays
Oligonucleotides containing the predicted a 1-α 2 binding sites from within the ORFs of URB1, PRM8, PRM9, YKL162C and CDC25 and the promoters of COX13, REX2, LSM1, and FMP14 were synthesized, one strand was end-labeled with [γ-32P]-ATP, and then annealed with excess cold complementary oligonucleotide. The HO(10) and HO(8) a 1-α 2 sites within Upstream Regulatory Sequence 1 (URS1) of the HO promoter were used as strong and weak binding sites respectively. The EMSA was performed as described previously [42], using a constant 1.4 μ M a 1 and five-fold titrations of α 2 starting at 82 nM in protein dilution buffer (50 mM Tris pH 7.6. 1 mM EDTA, 500 mM NaCl, 10 mM 2-mercaptoethanol, 10 mg/ml bovine serum albumin).
β-galactosidase assays
Oligonucleotides containing a 1-α 2 binding sites were synthesized with 5' overhangs to allow cloning into the Xho I site of pTBA23 (2μ URA3 Ampr), a reporter plasmid containing a CYC1-lacZ fusion [43]. Reporter constructs were transformed into JRY103 and JRY118 and the β-galactosidase activity was measured on three independent transformants, as described previously [14].