- Research article
Genome-wide analysis of alternative promoters of human genes using a custom promoter tiling array
BMC Genomicsvolume 9, Article number: 349 (2008)
Independent lines of evidence suggested that a large fraction of human genes possess multiple promoters driving gene expression from distinct transcription start sites. Understanding which promoter is employed in which cellular context is required to unravel gene regulatory networks within the cell.
We have developed a custom microarray platform that tiles roughly 35,000 alternative putative promoters from nearly 7,000 genes in the human genome. To demonstrate the utility of this array platform, we have analyzed the patterns of promoter usage in 17β-estradiol (E2)-treated and untreated MCF7 cells and show widespread usage of alternative promoters. Most intriguingly, we show that the downstream promoter in E2-sensitive multiple promoter genes tends to be very close to the 3'-terminus of the gene, suggesting exotic mechanisms of expression regulation in these genes.
The usage of alternative promoters greatly multiplies the transcriptional complexity available within the human genome. The fact that many of these promoters are incapable of driving the synthesis of a meaningful protein-encoding transcript further complicates the story.
The regulation of human gene expression is known to be an extraordinarily complex process, including transcription, mRNA processing, mRNA transport, mRNA stability, mRNA translation, protein modification and protein stability. Nevertheless, the picture that has emerged over the past two to three decades is one in which the process of transcription itself is a highly regulated process , and one could easily believe that the combinatorial interaction of multiple transcription factors within the gene promoter is sufficient to explain this complexity. However, genes with more than one promoter have been known for some time , and recent studies using independent lines of evidence have suggested that a large proportion of the human genome is transcribed from both strands  and numerous human genes have more than one promoter allowing gene transcription in different cellular conditions [4–7]. As summarized in Figure 1A, alternative promoters can take many different forms, producing a wide variety of transcripts and proteins from a single gene locus. Moreover, the use of alternative transcription initiation sites also affects the splicing pattern of downstream exons, creating a variety of different transcripts and protein products . It is needless to say that these various promoters greatly increase the regulatory control that the cell has over the expression of the gene.
Alternative promoters are of particular interest because their aberrant expression has been linked to a number of diseases, particularly cancer. There are a number of experimentally well-characterized multiple promoters for known genes, for example TP53 , MYC , CYP19A1 , BRCA1 , P73 , MID1 , CTSB , SRC , KLK6  and TGFB3 , to name a few. CYP19A1 is a well-characterized example that has five known alternative promoters, many of which are separated by more than 10 kb and are therefore regulated by completely non-overlapping promoters . Alternative first exons Ex-1.1, Ex-1.3/Ex-1.4, and Ex-1f splice with Ex-2 to encode the 5' prime untranslated regions (UTR) of CYP19A1 mRNA in the placenta, adipose tissue, and brain, respectively. Additionally, in gonads, the transcription starts just 39 bp upstream of translation initiation codon in exon-2. The use of alternative non-coding first exons in the CYP19A1 transcripts does not alter the protein sequence, as the different 5'UTRs splice into a common second exon (exon-2) that contains the translation initiation codon. It is known that theses various promoters are used in a tissue-dependent manner , but the promoter upstream of exon Ex-1.4 is aberrantly expressed in breast cancer tissue, aggravating the disease .
Many putative gene promoters have been identified either through mapping of expressed sequence tags (ESTs) to the genome (Acembly , ECGene ), through sequence conservation studies with other organisms  or de novo computational prediction (e.g., FirstEF , DragonGSF ). Databases such as MPromDb  and H-DBAS  provide information about well-curated promoters and alternative spliced transcripts identified by aligning completely sequenced and precisely annotated full-length cDNAs . Recently, intensive efforts have been invested in establishing genome-wide profiling methods to identify the regulatory regions, including alternative transcription start sites and the upstream promoter regions in human and mouse genomes . Currently, three ways were applied for this purpose. One is based on the decreased nucleosome occupancy and increased sensitivity to DNase of the active promoter regions. The two approaches, called DNase-chip and DNase-array, have been created to detect those transcribed promoters and transcripts [28, 29]. The second one is called the cap analysis gene expression (CAGE), combining full-length cDNA library with SAGE technology to screen those 5' parts of transcripts . The third one is using ChIP-chip (Chromatin ImmunoPrecipitation followed by microarray analysis) to profile the binding position of the RNA polymerase II preinitiation complex . The data from these studies provide evidence of large-scale alternative splicing and wide-spread use of alternative promoters throughout the mammalian genomes. Most of these methods cannot predict the mRNA sequence produced from that promoter, and therefore constructing a traditional cDNA microarray to detect their expression is impossible. Moreover, two promoters may produce mRNA isoforms that are nearly indistinguishable, again making expression microarrays difficult to design. One alternative is to use ChIP-chip to detect the binding of RNA polymerase II to the genome. Although there is evidence that the presence of RNA polymerase II in the promoter does not perfectly correlate with active transcription , there does exist a correlation between the two events and therefore RNA polymerase II binding is a good approximation of transcriptional activity [33, 34]. Here, we have taken an intermediate approach, where we first annotated all possible putative promoters in the human genome by integrative bioinformatics analyses. Using these annotations, we designed 60-mer probes complementary to sequences and tiling the core promoter regions (both known and putative) of a subset of genes that have at least two annotated promoters. We tested this array by conducting ChIP-chip using antibody against RNA polymerase II (RNA Pol II) in MCF7 cells without and with E2 treatment. It is well known that estrogen receptor can act both as an activator and repressor of specific target genes, and that these events can then affect cell division and breast cancer progression [35, 36]. Knowledge about which of the alternative promoters of the ER regulated genes are active and inactive in E2 treated and untreated conditions in MCF7 cells would lead to better understanding of their effects in breast cancer development. Several novel putative promoters were found to be active before and after E2 treatment. Interestingly, we found that in genes with more than one putative promoter, downstream promoters are much more likely to be affected by E2 treatment than upstream promoters, suggesting interesting mechanisms of gene regulation in multiple promoter genes.
Alternative promoter array
In order to design an alternative promoter array, we first used a bioinformatics approach to annotate all known and putative promoters in the human genome. Using evidence from three sources: UCSC Known Genes , FirstEF , and Riken CAGE tags , we found evidence for more than 185,000 transcription start sites separated by 500 bases or more in the human genome. We took a gene-centric approach to our microarray design, choosing genes that had two or more known or putative promoters. In the end, about 34,000 known or putative promoters were selected for our array, covering about 7,000 genes. The median number of promoters per gene is three (Figure 1B). 60 mer oligonucleotide probes were designed to tile a region -200 to +200 surrounding each known and putative transcription start site. Because of limitations on probe design, not all regions could be effectively covered but on average the spacing is approximately 80 bases from the end of one probe to the beginning of the next.
Genome-wide profile of potential promoter usage
In order to identify potential active promoters, we conducted ChIP-chip with antibody against RNA Pol II in MCF-7 cell lines with and without E2 treatment for 3 hours, as described in the Methods. The amplified immunoprecipitated DNA and input control, after labeling with Cy5 and Cy3 fluorescent dyes respectively, was used to probe the alternative promoter microarray (Figure 2A). Each experiment was repeated once to determine the reproducibility of the probe hybridization intensities. After filtering the low quality spots, we performed intensity dependent Lowess normalization. The MA plot for normalized data is shown in Figure 2B for one control (before E2 treatment) experiment. We, then, plotted the distribution of the normalized log ratios of red and green intensities. The histogram in Figure 2C presents the log ratios for one control experiment, which shows a clear bi-modal distribution. The distribution with mode close to zero represents the probes that are non-responsive and the distribution with mode close to 2.5 represents the probes of responsive promoters. The Expectation Maximization (EM) algorithm of Khalili et al  was modified from the original Gamma-Normal-Gamma fit to a simple Gamma-Normal fit that appeared to be more appropriate for our data. The algorithm clearly defines two distinct distributions in Figure 2C, representing the unbound probes (in red) and the bound probes (in green). See Additional File 1 for the MA plots and log ratio distribution of data from other experiments. A nice feature of the algorithm is that probes can be assigned to each distribution with a certain probability, allowing us to increase or reduce the stringency of our assignments easily. We defined strong candidates for RNA Polymerase II activity as those probes that fell within the green distribution with a p-value of at most 0.05. However, we also defined a second, weaker condition: those probes that are not significantly part of the larger unbound (red) distribution at a p-value of 0.05. This latter group would encompass the "grey area" that lies between the two distributions. The "best" probe from each promoter was used to evaluate the activity of the promoter as a whole. Figure 2D shows the proportion of active promoters in MCF7 cells at different quality thresholds. At least 65% of the promoters (both putative and known) are inactive in this cell line, whereas ~17% of the promoters have strong evidence for being active. This is roughly in accordance with previous genome-wide studies of promoter activity. For example, Kim et al  found ~9,300 active promoters in IMR90 cells, which corresponds to ~23% of the unique annotated transcription start sites in the UCSC known genes . When we map these promoters back to genes, we find that 3,210 genes had at least one promoter active in at least on experimental condition, out of a total 6,500 genes for which we were able to recover data – roughly 50%.
We validated a total of 18 promoters, 10 promoters that we predicted to be active with high confidence and 8 promoters that were predicted to be inactive in MCF7 cells. ChIP-PCR experiments showed that these predictions were for the most part accurate (Figure 3) – seven out of the ten positive targets microarray analysis were confirmed to be bound to RNA polymerase II. Similarly, all but one of the negative samples showed no evidence of RNA polymerase II binding. Although the binding of RNA polymerase II to the promoter region needn't correlate to gene expression because of posttranscriptional events, we find that a rough correspondence does exist. For example, two promoters in the gene NCOA7 were shown to bind to RNA polymerase II with a "low" level of confidence, although in the absence of E2 the upstream promoter was predicted to be "strongly off" (Figure 4A). These qualitative results were verified by quantitative reverse transcriptase-polymerase chain reaction (qRT-PCR) (Figures 4B and 4C). By comparing these results to the gene EIF3S9, whose most upstream promoter was "highly on" in both treatments (Figure 5A), we found that the qRT-PCR experiments showed a correspondingly high level of expression of the corresponding gene isoform (Figure 5B).
Alternative promoters and CpG islands
Wang et al.  recently noted that the 5'-most promoter of a gene tends to be CpG related, while more downstream promoters are less likely to be. We identified promoters that were active in one or both of our treatments, and classified them as either being associated with the 5'-end of the gene (if the promoter was located < 500 bases of the 5' end of the gene's annotation) or downstream promoters (> 500 bases away from the 5' end of the gene). Similar to the findings of Wang et al., we found that 92% of all 5'-end promoters are associated with a CpG island, whereas only 23% of downstream promoters are.
Identification of novel promoters
As shown in Table 1, each promoter on the array is supported by different lines of evidence. The most common promoters are those that are supported by multiple CAGE tags. However, only 14% of the 18,902 such promoters supported by only CAGE tags on the array were found to be active at "high" or "medium" confidence levels. Of course, it is important to note that a negative result does not necessarily indicate an inaccurate promoter prediction; these promoters may be active in different cell types, or under different environmental conditions. Therefore, these numbers should be seen as a lower limit. By far, the greatest concordance was found for CpG-related promoters that are supported by all lines of evidence (UCSC Known Genes, CAGE tags, and FirstEF predictions), of these 68% were found to be active. The data also indicate that the CpG-related promoters that are supported by both CAGE tags and FirstEF predictions enjoy a higher rate of success than those promoters that are exclusively supported by either CAGE tags or FirstEF predictions. 16% of non-CpG-related promoters in this category were found to be active, while an impressive 39% of CpG-related promoters supported by CAGE and FirstEF results were found to be active. In all, if we consider all promoters not supported by KnownGenes to be "novel", then out of 20,879 promoters, 3,172 (15%) were active in at least one treatment. If we eliminate promoters supported only by CAGE tags, then 601 out of 1,977 promoters (30%) are found to be active. Of the ten genes selected for validation in Figure 4, eight fall into the novel category (i.e., no mRNA evidence) and six of these were confirmed (see Table 2). These surprising results indicate that large numbers of undiscovered, unannotated promoters exist within human genes. Notably, we have discovered 303 new and active promoters that are situated more than 500 bases upstream of the currently-defined 5' end of the gene, suggesting that a significant fraction of the current gene annotations may not be 5'-complete. One of these promoters was upstream of SOX12, and was verified to bind to RNA polymerase II (Figure 4). These results also strongly support the recent reports of high frequency of alternative promoter in mammalian genomes [41, 42]. In addition, the complicated distribution patterns of these alternative promoters might be easily overlooked by previous expression array analyses.
Differential use of multiple promoters with estrogen stimulation
Our hypothesis was that treatment with E2 affects the promoter activity of a sub-set of genes in the genome. For this analysis, we defined "active" as promoters with "high", "medium" or "low" confidence. For the subset of genes that have single active promoter, we found that 2,697 promoters were active in both E2- and E2+ treated conditions (see Additional File 2). Whereas only 178 promoters were inactivated and 77 promoters were activated by E2. This bias towards inactivation is highly significant (p = 2.5e-10 in a chi-squared test), indicating that more promoters are inactivated by E2 than are activated, which supports the previous report about estrogen-mediated early-down regulated genes . Some of the genes associated with these promoters have previously been identified as being estrogen sensitive, such as GREB1, HSPB8, and WFS1  (see Additional Files 2 and 3 for a complete list). We next considered those genes that have two active alternative promoters and checked for the differential activation or inactivation of the promoters. We found 993 genes with both promoters active and not affected by E2 treatment (see Additional File 3). More interesting are the cases where one promoter is affected by E2 treatment. The upstream promoters of 25 such genes are activated by E2 (Figure 6A; also see the gene NCOA7 in Figure 4), whereas in 61 genes the upstream promoter is inactivated by E2 (Figure 6B) – a more than 2:1 bias in favor of inactivation, which is quite similar to what we found in the single active promoter gene case, and also significant (p = 0.000175 in a chi-squared test). Curiously, this same bias is not present when we examined the downstream promoters, where we found 62 were activated by E2 (Figure 6C) and 64 were inactivated by E2 (Figure 6D).
In terms of the overall differential usage (either activation or inactivation) of alternative promoters due to E2 treatment, we found that the downstream promoters are more often affected by E2 treatment than the upstream promoter. We found that there were a total of 127 downstream promoters affected by E2 treatment, while only 87 upstream promoters were affected – a significant bias (p = 0.00625 in a chi-squared test). These intriguing patterns provide some insight into the regulatory control of genes and their isoforms by E2. To investigate this phenomenon further, we examined the locations of active promoters within each gene. As shown in Figure 7A, for genes with a single active promoter that is insensitive to E2 treatment there is a strong tendency for that promoter to be located at the 5' end of the annotated gene. Similar trends are observed in genes with two active promoters that are insensitive to E2 treatment, where the upstream promoter is again located near the 5' end of the gene, while the location of the downstream promoter is uniformly distributed throughout the length of the gene (Figure 7B). However, a surprising change is observed if one of the promoters is E2-sensitive, where we found that there was a very strong tendency for the downstream promoter to be close to the 3' end of the gene (Figure 7C). In keeping with our finding that downstream promoters tend to not be associated with CpG islands (in contrast with promoters at the 5'-end of the gene), E2-sensitive promoters are overall less likely to be associated with CpG islands than active promoters taken as a whole: 50% of all active promoters are CpG-related, while only 37% of E2-sensitive promoters are (p = 1.2e-11 in a Fisher's Exact test).
Although genome tiling arrays are increasingly becoming a viable alternative to focused microarrays, they remain significantly more expensive than focused microarrays, and their signal-to-noise ratio is very high due to the large numbers of inactive probes and lack of probe design considerations . Another alternative mechanism for studying alternative promoters is the use of traditional expression arrays that have been designed to specifically interrogate particular gene isoforms. Unfortunately, in a large number of cases, mRNA isoforms are not known for putative promoters, and many isoforms that originate at different promoters differ only in the first exon – a small percentage of the entire molecule, making it difficult to distinguish between the various isoforms. High-throughput sequencing techniques are a recent advance that provide an attractive alternative to microarray-based techniques , however there is evidence that ChIP-chip is more sensitive than ChIP-sequencing techniques .
Traditionally, expression analysis was used to define promoter activity. However, one recent report has found that a number of genes experience transcription initiation but show not detectable full-length transcripts . Nevertheless, additional reports indicate that a correlation does exist between Pol II occupancy and gene activation [33, 34]. The findings of Guenther et al. may be explained by post-transcriptional regulation, but in any case we believe that the presence of Pol II in the promoter is a good approximation of promoter activity, although further experiments are still necessary to define and characterize this relationship.
Here, we have presented a novel 244 k microarray that is capable of measuring alternative promoter usage in over 34,000 putative promoters from nearly 7,000 genes. This platform is suitable for indirect expression analyses using RNA polymerase II ChIP-chip as we have shown in this paper, but it is also suitable for methylation based studies using DMH or meDIP experiments (since more than 5000 of the putative promoters fall within CpG islands), or for ChIP-chip experiments using other proteins of interest, such as transcription factors or histone modification signatures. We have demonstrated clear evidence for alternative promoter activities within genes, including the verification of a number of putative promoters. These results suggest that a large fraction of genes in the human genome possess undiscovered promoters and transcription start sites, which agrees with findings based on the mapping of ESTs to the genome [20, 21], and the mapping of 5' oligo cap cDNA libraries to the genome .
Most intriguingly, we discovered that there is a distinct bias for the downstream promoter in E2-sensitive two-promoter genes to be very close to the 3' end of the gene, whereas no such bias exists in E2-insensitive genes. These promoters are very unlikely to produce a functional transcript of any sort, and we therefore speculate that its purpose is merely to regulate the expression of the transcript initiated at the upstream promoter by "blocking" the progression of the RNA polymerase II complex. This "stalling" mechanism has been observed in other contexts. For example, inhibiting DNA replication was recently found to cause RNA polymerase II to stall during the transcription of p21 . Similarly, the cofactor of BRCA1 (COBRA1) is known to cause stalling of the RNA polymerase II complex proximal to the promoter [48, 49]. However, we can think of no reason for "blocking" promoters to have a bias towards the 3' end of the gene, since this blocking action could be realized at any point relative to the primary promoter. An alternative possibility is that promoters near the 3' end of the gene are driving expression of an interfering RNA, either antisense to the primary transcript or that is capable of inhibiting the formation and progression of the RNA polymerase II complex at the primary promoter . Such noncoding, interfering RNAs are known to regulate expression of the DHFR gene in humans, for example, although in this case the interfering RNA is transcribed from a promoter that lies upstream of the primary promoter [51, 52]. Much more work will need to be performed in the future to identify the regulatory action that these 3'-UTR promoters have on their primary transcripts, if any.
We have demonstrated clear evidence of alternative promoter activity for approximately 7,000 human genes, using a 244 K custom microarray that span across 34,000 putative promoters. Our results suggest that a large fraction of genes in the human genome possess undiscovered alternative promoters, which agrees with findings based on the mapping of ESTs and CAGE tags to the human genome. We found that a significantly more number of downstream promoters were affected by E2 treatment than the upstream promoters. And, there is a distinct bias for the downstream promoter in E2-sensitive two-promoter genes to be very close to the 3' end of the gene, whereas no such bias exists in E2-insensitive genes. The custom microarray can also be used for epigenome analyses, such as methylation based studies using DMH or meDIP experiments. The present data will help discovery of novel promoters and ongoing annotation of alternative promoters of human genes in different experimental conditions.
We considered three sources of evidence for identifying promoter targets for our microarray. The first was the 5'-end of genes as identified in the UCSC Known Gene track, which is largely based on the alignment of RefSeq mRNAs to the human genome . A second line of evidence was the database of CAGE tags sequenced by the Riken group . These tags capture ~20 bases at the 5' end of messenger RNAs, and have been mapped back to the human genome. We used the UCSC LiftOver tool to convert Riken's hg17 human genome coordinates to the more recent hg18 genome. Our final line of evidence was ab initio promoter predictions generated by the FirstEF program .
Each line of evidence identifies a transcription start site (TSS). We considered TSSs separated by > 500 bp to be distinct promoters – a commonly used criterion. Although there are undoubtedly transcription factor binding sites that extend beyond this region, this distance is great enough for the core promoters of each TSS to be distinct , and we can therefore consider these TSSs to be independently regulated to a large extent. TSSs were clustered using a neighbor-joining algorithm  until all clusters were separated by at least 500 bases. The coordinates of these clusters were then extended 200 bases up- and downstream.
Each promoter region was aligned to the genome using BLAT  in order to discover regions that are not unique. Alignments that were longer than 55 bases (90% of the probe length) were masked, as were 60 mers within the sequence that had > 85% or < 50% G+C. From the remaining unmasked regions of each promoter, probes were selected such that the average spacing would be roughly 100 bases, but that the spacing between two successive probes would be no more than 300 bases. In the end, the true average spacing is 80 bases.
Gene and promoter selection
Not all genes could be put on the array, so to prioritize we assigned each gene a score. Three points were awarded for each promoter supported by "known gene" evidence, two points for those supported by CAGE tag evidence , and one point for FirstEF  evidence. Genes were then ranked by their total score, and only the best-scoring genes were included on the array. In the end, the roughly 244,000 probes cover 34,486 promoter regions from 6,949 genes, with a median tiling coverage of 5 probes per promoter. The median number of promoters per gene on the array is 3, although the range is from 1 to over 30 (Figure 1B)
MCF-7 human breast cancer cells (American Type Culture Collection, Manassas, VA) were maintained in growth medium (MEM with 2 mM L-glutamine, 0.1 mM non-essential amino acids, 50 units/ml penicillin, 50 μg/ml streptomycin, 6 ng/ml insulin, and 10% FBS) as described by Fan et al . Prior to all experiments, cells were cultured in hormone-free basal basal medium (phenol-red free MEM with 2 mM L-glutamine, 0.1 mM non-essential amino acids, 50 units/ml penicillin, 50 μg/ml streptomycin, and 3% charcoal-dextran stripped FBS) for three days.
Chromatin immunoprecipitation on microarray (ChIP-chip) assay
Five million MCF-7 cells with and without E2 treatment (10 nM, 3 h) were crosslinked with 1% formaldehyde for 10 min, at which point 0.125 M glycine was used to stop the cross-linking. Chromatin immunoprecipitation was performed using a ChIP assay kit (Upstate Biotechnology, Charlottesville, VA) as described . The antibodies, which specifically target against the initiation form of Pol II, were purchased from Santa Cruz Biotechnology (sc-899X, Santa Cruz, CA). Ligation-mediated PCR was applied to 20 ng of ChIP DNA and input control as described by Ren et al . Briefly, after cross-linking, cells were lysed and then sonication was used to shear the chromatin to fragments of around 500 bp. Cell lysis was then subject to immunoprecipitation. After immunoprecipitation, part of supernatant was removed from the lysis as input control. The primers used in ligation-mediated PCR were: oligo JW102, 5'-GCGGTGACCCGGGAGATCTGAATTC-3' and JW103 5'-GAATTCAGATC-3'. Tow μg of amplified ChIP DNA and input control were then labeled by Cy5 and Cy3 fluorescent dyes (Amersham, Buckinghamshire, UK) and were then cohybridized to the custom alternative promoter array. Technical duplication was performed for each sample of ChIP DNA. The slides were washed with three wash buffers (Buffer I. 6× SSPE + 0.005% sarcosine; Buffer 2, 0.06× SSPE; Buffer 3, anti-oxidant mixture in acetonitrile purchased from Agilent) in series at room temperature.
Chromatin immunoprecipitation- quantitative polymerase chain reaction
ChIP was conducted in the same manner as in the ChIP-chip experiments, described above. The pooled DNA from ChIP and input control were first measured by spectrophotometer (NanoDrop, Wilmington, DE). Quantitative PCR with SYBR green-based detection (Applied Biosystems, Foster City, CA) was performed as described previously. In brief, primers were designed using Primer Express software (Applied Biosystems, Foster City, CA). Quantitative ChIP-PCR values were normalized against values from a standard curve (50 to 0.08 ng, R-squared > 0.99) constructed by input control with the same primer sets.
Quantitative reverse transcription-polymerase chain reaction
Qiagen RNeasy kit (Valencia, CA) was used to extract total RNA from MCF-7 cells with or without E2 treatment according to the manufacture's manual. Two μg of RNA was first treated with DNase I (Invitrogen, Carlsbad, CA) to remove potential DNA contamination and then was reverse transcribed with SuperScript II reverse transcriptase (Invitrogen, Carlsbad, CA). Quantitative RT-PCR was performed by using SYBR green (Applied Biosystems, Foster City, CA) as a marker for DNA amplification on a 7500 Real-Time PCR System apparatus (Applied Biosystems, Foster City, CA). The relative mRNA level of a given locus was calculated by relative quantization of gene expression (Applied Biosystems, Foster City, CA) with glucose phosphate isomerase mRNA as an internal control.
The washed slides were scanned by a GenePix 4000A scanner (Axon, Union City, CA) and the acquired microarray images were analyzed using GenePix 6.0 software. Briefly, the user-selectable laser power settings for Cy5 (635 nm, red) and Cy3 (532 nm, green) were adjusted so that the overall Cy5 to Cy3 ratios were close to 1 and that the signal intensities spanned the entire spectrum with minimal signal saturation at the high intensity range. When these conditions were satisfied, the microarray was scanned and a grid file was loaded to mark the general location of the scanned image. The GenePix 6.0 software performed a spot finding function and captured intensity-related information in a GPR file.
The complete array dataset can be viewed in the ArrayExpress microarray database (accession number E-MEXP-1644). GPR files were passed through a custom-built quality control filter which flagged all probes that didn't meet all of the following criteria in both the green and red channels: (1) % > B + 2SD greater than 30; (2) median – background > 0; (3) signal-to-noise ratio greater than 1.5. These filtered results were then normalized using the default parameters (plus Lowess normalization) in Agilent's Chip Analytics software version 1.3. A post-normalization MA plot is shown in Figure 2B. We then used a modified version of the mixture model (reducing their gamma+normal+gamma model to a simple gamma+normal model) described by Khalili et al  to classify probes into one of two groups: bound or not bound. Figure 2C shows the fit of the gamma+normal mixture model to our data. One benefit of this type of analysis is that we are able to directly estimate our false positive rates based on a probe's probability of assignment to the "unbound" distribution.
Since each promoter contains many probes, for each promoter region we chose the probe with the best p-value for inclusion into the "bound" distribution and compared these across the various experimental treatments and replicates. We found that between replicates these "best probes" were within 80 bases of each other (the resolution of the array) 90% of the time. We used the following criteria to classify promoters within individual experiments: strongly bound promoters had probes that were classified in the "bound" distribution with a p-value less than 0.05. Weakly bound promoters were those that did not significantly fall within the "unbound" distribution with a p-value of 0.05. Unbound promoters were those whose probes fell within the "unbound" distribution with a p-value less than 0.05. As Figure 2D illustrates, by combining replicate experiments, we were able to classify each promoter into "highly on" (both replicates were strongly bound), "medium on" (one replicate was strongly bound, the other weakly bound), "low on" (both replicates are weakly bound), "weakly off" (replicates don't agree, so we fall back on the null hypothesis of no binding), or "strongly off" (both replicates show an unbound state).
transcription start site
Chromatin Immunoprecipitation (ChIP) followed by microarray analysis
cap analysis gene expression
- RNA Pol II:
RNA polymerase II.
Hochheimer A, Tjian R: Diversified transcription initiation complexes expand promoter selectivity and tissue-specific gene expression. Genes Dev. 2003, 17 (11): 1309-1320. 10.1101/gad.1099903.
Ayoubi TA, Van De Ven WJ: Regulation of gene expression by alternative promoters. Faseb J. 1996, 10 (4): 453-460.
Sandelin A, Carninci P, Lenhard B, Ponjavic J, Hayashizaki Y, Hume DA: Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet. 2007, 8 (6): 424-436. 10.1038/nrg2026.
Kimura K, Wakamatsu A, Suzuki Y, Ota T, Nishikawa T, Yamashita R, Yamamoto J, Sekine M, Tsuritani K, Wakaguri H, Ishii S, Sugiyama T, Saito K, Isono Y, Irie R, Kushida N, Yoneyama T, Otsuka R, Kanda K, Yokoi T, Kondo H, Wagatsuma M, Murakawa K, Ishida S, Ishibashi T, Takahashi-Fujii A, Tanase T, Nagai K, Kikuchi H, Nakai K, Isogai T, Sugano S: Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes. Genome Res. 2006, 16 (1): 55-65. 10.1101/gr.4039406.
Bajic VB, Tan SL, Christoffels A, Schonbach C, Lipovich L, Yang L, Hofmann O, Kruger A, Hide W, Kai C, Kawai J, Hume DA, Carninci P, Hayashizaki Y: Mice and men: their promoter properties. PLoS Genet. 2006, 2 (4): e54-10.1371/journal.pgen.0020054.
Cooper SJ, Trinklein ND, Anton ED, Nguyen L, Myers RM: Comprehensive analysis of transcriptional promoter structure and function in 1% of the human genome. Genome Res. 2006, 16 (1): 1-10. 10.1101/gr.4222606.
Davuluri RV, Suzuki Y, Sugano S, Plass C, Huang TH: The functional consequences of alternative promoter use in mammalian genomes. Trends Genet. 2008
Zavolan M, Kondo S, Schonbach C, Adachi J, Hume DA, Hayashizaki Y, Gaasterland T: Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome. Genome Res. 2003, 13 (6B): 1290-1300. 10.1101/gr.1017303.
Bourdon JC, Fernandes K, Murray-Zmijewski F, Liu G, Diot A, Xirodimas DP, Saville MK, Lane DP: p53 isoforms can regulate p53 transcriptional activity. Genes Dev. 2005, 19 (18): 2122-2137. 10.1101/gad.1339905.
Marcu KB, Bossone SA, Patel AJ: myc function and regulation. Annu Rev Biochem. 1992, 61: 809-860. 10.1146/annurev.bi.61.070192.004113.
Agarwal VR, Bulun SE, Leitch M, Rohrich R, Simpson ER: Use of alternative promoters to express the aromatase cytochrome P450 (CYP19) gene in breast adipose tissues of cancer-free and breast cancer patients. J Clin Endocrinol Metab. 1996, 81 (11): 3843-3849. 10.1210/jc.81.11.3843.
Orban TI, Olah E: Emerging roles of BRCA1 alternative splicing. Mol Pathol. 2003, 56 (4): 191-197. 10.1136/mp.56.4.191.
Li CY, Zhu J, Wang JY: Ectopic expression of p73alpha, but not p73beta, suppresses myogenic differentiation. J Biol Chem. 2005, 280 (3): 2159-2164. 10.1074/jbc.M411194200.
Landry JR, Rouhi A, Medstrand P, Mager DL: The Opitz syndrome gene Mid1 is transcribed from a human endogenous retroviral promoter. Mol Biol Evol. 2002, 19 (11): 1934-1942.
Podgorski I, Sloane BF: Cathepsin B and its role(s) in cancer progression. Biochem Soc Symp. 2003, 263-276.
Dehm SM, Bonham K: Regulation of alternative SRC promoter usage in HepG2 hepatocellular carcinoma cells. Gene. 2004, 337: 141-150. 10.1016/j.gene.2004.04.021.
Pampalakis G, Kurlender L, Diamandis EP, Sotiropoulou G: Cloning and characterization of novel isoforms of the human kallikrein 6 gene. Biochem Biophys Res Commun. 2004, 320 (1): 54-61. 10.1016/j.bbrc.2004.04.205.
Archey WB, Sweet MP, Alig GC, Arrick BA: Methylation of CpGs as a determinant of transcriptional activation at alternative promoters for transforming growth factor-beta3. Cancer Res. 1999, 59 (10): 2292-2296.
Kamat A, Hinshelwood MM, Murry BA, Mendelson CR: Mechanisms in tissue-specific regulation of estrogen biosynthesis in humans. Trends Endocrinol Metab. 2002, 13 (3): 122-128. 10.1016/S1043-2760(02)00567-2.
Thierry-Mieg D, Thierry-Mieg J, Potdevin M, Sienkiewicz M: AceView: Identification and functional annotation of cDNA-supported genes in higher organisms. Unpublished. 2007
Kim N, Shin S, Lee S: ECgene: genome-based EST clustering and gene modeling for alternative splicing. Genome Res. 2005, 15 (4): 566-576. 10.1101/gr.3030405.
Siepel A, Haussler D: Computational identification of evolutionarily conserved exons. RECOMB '04. 2004
Davuluri RV, Grosse I, Zhang MQ: Computational identification of promoters and first exons in the human genome. Nat Genet. 2001, 29 (4): 412-417. 10.1038/ng780.
Bajic VB, Seah SH: Dragon gene start finder: an advanced system for finding approximate locations of the start of gene transcriptional units. Genome Res. 2003, 13 (8): 1923-1929.
Sun H, Palaniswamy SK, Pohar TT, Jin VX, Huang TH, Davuluri RV: MPromDb: an integrated resource for annotation and visualization of mammalian gene promoters and ChIP-chip experimental data. Nucleic Acids Res. 2006, 34 (Database issue): D98-103. 10.1093/nar/gkj096.
Takeda J, Suzuki Y, Nakao M, Kuroda T, Sugano S, Gojobori T, Imanishi T: H-DBAS: alternative splicing database of completely sequenced and manually annotated full-length cDNAs based on H-Invitational. Nucleic Acids Res. 2007, 35 (Database issue): D104-9. 10.1093/nar/gkl854.
Kapranov P, Willingham AT, Gingeras TR: Genome-wide transcription and the implications for genomic organization. Nat Rev Genet. 2007, 8 (6): 413-423. 10.1038/nrg2083.
Crawford GE, Davis S, Scacheri PC, Renaud G, Halawi MJ, Erdos MR, Green R, Meltzer PS, Wolfsberg TG, Collins FS: DNase-chip: a high-resolution method to identify DNase I hypersensitive sites using tiled microarrays. Nat Methods. 2006, 3 (7): 503-509. 10.1038/nmeth888.
Sabo PJ, Kuehn MS, Thurman R, Johnson BE, Johnson EM, Cao H, Yu M, Rosenzweig E, Goldy J, Haydock A, Weaver M, Shafer A, Lee K, Neri F, Humbert R, Singer MA, Richmond TA, Dorschner MO, McArthur M, Hawrylycz M, Green RD, Navas PA, Noble WS, Stamatoyannopoulos JA: Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nat Methods. 2006, 3 (7): 511-518. 10.1038/nmeth890.
Kodzius R, Kojima M, Nishiyori H, Nakamura M, Fukuda S, Tagami M, Sasaki D, Imamura K, Kai C, Harbers M, Hayashizaki Y, Carninci P: CAGE: cap analysis of gene expression. Nat Methods. 2006, 3 (3): 211-222. 10.1038/nmeth0306-211.
Kim TH, Barrera LO, Zheng M, Qu C, Singer MA, Richmond TA, Wu Y, Green RD, Ren B: A high-resolution map of active promoters in the human genome. Nature. 2005, 436 (7052): 876-880. 10.1038/nature03877.
Guenther MG, Levine SS, Boyer LA, Jaenisch R, Young RA: A chromatin landmark and transcription initiation at most promoters in human cells. Cell. 2007, 130 (1): 77-88. 10.1016/j.cell.2007.05.042.
Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell. 2007, 129 (4): 823-837. 10.1016/j.cell.2007.05.009.
Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, Lee W, Mendenhall E, O'Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE: Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007, 448 (7153): 553-560. 10.1038/nature06008.
Frasor J, Danes JM, Komm B, Chang KC, Lyttle CR, Katzenellenbogen BS: Profiling of estrogen up- and down-regulated gene expression in human breast cancer cells: insights into gene networks and pathways underlying estrogenic control of proliferation and cell phenotype. Endocrinology. 2003, 144 (10): 4562-4574. 10.1210/en.2003-0567.
Pearce ST, Jordan VC: The biological role of estrogen receptors alpha and beta in cancer. Crit Rev Oncol Hematol. 2004, 50 (1): 3-22. 10.1016/j.critrevonc.2003.09.003.
Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D: The UCSC Known Genes. Bioinformatics. 2006, 22 (9): 1036-1046. 10.1093/bioinformatics/btl048.
Kawaji H, Kasukawa T, Fukuda S, Katayama S, Kai C, Kawai J, Carninci P, Hayashizaki Y: CAGE Basic/Analysis Databases: the CAGE resource for comprehensive promoter analysis. Nucleic Acids Res. 2006, 34 (Database issue): D632-6. 10.1093/nar/gkj034.
Khalili A, Potter D, Yan P, Li L, Gray J, Huang T, Lin S: Gamma-Normal-Gamma Mixture Model for Detecting Differentially Methylated Loci in Three Breast Cancer Cell Lines. Cancer Informatics. 2007, 2: 43-54.
Wang J, Ungar LH, Tseng H, Hannenhalli S: MetaProm: a neural network based meta-predictor for alternative human promoter prediction. BMC Genomics. 2007, 8: 374-10.1186/1471-2164-8-374.
Tsuritani K, Irie T, Yamashita R, Sakakibara Y, Wakaguri H, Kanai A, Mizushima-Sugano J, Sugano S, Nakai K, Suzuki Y: Distinct class of putative "non-conserved" promoters in humans: comparative studies of alternative promoters of human and mouse genes. Genome Res. 2007, 17 (7): 1005-1014. 10.1101/gr.6030107.
Baek D, Davis C, Ewing B, Gordon D, Green P: Characterization and predictive discovery of evolutionarily conserved mammalian alternative promoters. Genome Res. 2007, 17 (2): 145-155. 10.1101/gr.5872707.
Carroll JS, Meyer CA, Song J, Li W, Geistlinger TR, Eeckhoute J, Brodsky AS, Keeton EK, Fertuck KC, Hall GF, Wang Q, Bekiranov S, Sementchenko V, Fox EA, Silver PA, Gingeras TR, Liu XS, Brown M: Genome-wide analysis of estrogen receptor binding sites. Nat Genet. 2006, 38 (11): 1289-1297. 10.1038/ng1901.
Cheng AS, Jin VX, Fan M, Smith LT, Liyanarachchi S, Yan PS, Leu YW, Chan MW, Plass C, Nephew KP, Davuluri RV, Huang TH: Combinatorial analysis of transcription factor partners reveals recruitment of c-MYC to estrogen receptor-alpha responsive promoters. Mol Cell. 2006, 21 (3): 393-404. 10.1016/j.molcel.2005.12.016.
Peng S, Alekseyenko AA, Larschan E, Kuroda MI, Park PJ: Normalization and experimental design for ChIP-chip data. BMC Bioinformatics. 2007, 8: 219-10.1186/1471-2105-8-219.
Euskirchen GM, Rozowsky JS, Wei CL, Lee WH, Zhang ZD, Hartman S, Emanuelsson O, Stolc V, Weissman S, Gerstein MB, Ruan Y, Snyder M: Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array- and sequencing-based technologies. Genome Res. 2007, 17 (6): 898-909. 10.1101/gr.5583007.
Mattia M, Gottifredi V, McKinney K, Prives C: p53-Dependent p21 mRNA elongation is impaired when DNA replication is stalled. Mol Cell Biol. 2007, 27 (4): 1309-1320. 10.1128/MCB.01520-06.
Aiyar SE, Sun JL, Blair AL, Moskaluk CA, Lu YZ, Ye QN, Yamaguchi Y, Mukherjee A, Ren DM, Handa H, Li R: Attenuation of estrogen receptor alpha-mediated transcription through estrogen-stimulated recruitment of a negative elongation factor. Genes Dev. 2004, 18 (17): 2134-2146. 10.1101/gad.1214104.
Aiyar SE, Blair AL, Hopkinson DA, Bekiranov S, Li R: Regulation of clustered gene expression by cofactor of BRCA1 (COBRA1) in breast cancer cells. Oncogene. 2007, 26 (18): 2543-2553. 10.1038/sj.onc.1210047.
Martens JA, Laprade L, Winston F: Intergenic transcription is required to repress the Saccharomyces cerevisiae SER3 gene. Nature. 2004, 429 (6991): 571-574. 10.1038/nature02538.
Martianov I, Ramadass A, Serra Barros A, Chow N, Akoulitchev A: Repression of the human dihydrofolate reductase gene by a non-coding interfering transcript. Nature. 2007, 445 (7128): 666-670. 10.1038/nature05519.
Blume SW, Meng Z, Shrestha K, Snyder RC, Emanuel PD: The 5'-untranslated RNA of the human dhfr minor transcript alters transcription pre-initiation complex assembly at the major (core) promoter. J Cell Biochem. 2003, 88 (1): 165-180. 10.1002/jcb.10326.
Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4 (4): 406-425.
Kent WJ: BLAT--the BLAST-like alignment tool. Genome Res. 2002, 12 (4): 656-664.
Fan M, Yan PS, Hartman-Frey C, Chen L, Paik H, Oyer SL, Salisbury JD, Cheng AS, Li L, Abbosh PH, Huang TH, Nephew KP: Diverse gene expression and DNA methylation profiles correlate with differential adaptation of breast cancer cells to the antiestrogens tamoxifen and fulvestrant. Cancer Res. 2006, 66 (24): 11954-11966. 10.1158/0008-5472.CAN-06-1666.
Leu YW, Yan PS, Fan M, Jin VX, Liu JC, Curran EM, Welshons WV, Wei SH, Davuluri RV, Plass C, Nephew KP, Huang TH: Loss of estrogen receptor signaling triggers epigenetic silencing of downstream targets in breast cancer. Cancer Res. 2004, 64 (22): 8184-8192. 10.1158/0008-5472.CAN-04-2045.
Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, Young RA: Genome-wide location and function of DNA binding proteins. Science. 2000, 290 (5500): 2306-2309. 10.1126/science.290.5500.2306.
The authors would like to thank Dustin Potter for providing us with source code, and Sandya Liyaranarachchi for her statistical advice. This work was supported by National Human Genome Research Institute grant R01HG003362 to RVD.
GACS designed the computational methods and performed the statistical analyses. JW designed the experimental methods and performed the ChIP-chip experiments. PY coordinated the microarray experiments. CP participated in the design of the study. RVD and THMH formulated and directed the design of the study. All authors read and approved the final manuscript.
Gregory AC Singer, Jiejun Wu contributed equally to this work.