Genome-wide analysis of alternative promoters of human genes using a custom promoter tiling array
© Singer et al. 2008
Received: 10 December 2007
Accepted: 25 July 2008
Published: 25 July 2008
Skip to main content
© Singer et al. 2008
Received: 10 December 2007
Accepted: 25 July 2008
Published: 25 July 2008
Independent lines of evidence suggested that a large fraction of human genes possess multiple promoters driving gene expression from distinct transcription start sites. Understanding which promoter is employed in which cellular context is required to unravel gene regulatory networks within the cell.
We have developed a custom microarray platform that tiles roughly 35,000 alternative putative promoters from nearly 7,000 genes in the human genome. To demonstrate the utility of this array platform, we have analyzed the patterns of promoter usage in 17β-estradiol (E2)-treated and untreated MCF7 cells and show widespread usage of alternative promoters. Most intriguingly, we show that the downstream promoter in E2-sensitive multiple promoter genes tends to be very close to the 3'-terminus of the gene, suggesting exotic mechanisms of expression regulation in these genes.
The usage of alternative promoters greatly multiplies the transcriptional complexity available within the human genome. The fact that many of these promoters are incapable of driving the synthesis of a meaningful protein-encoding transcript further complicates the story.
Alternative promoters are of particular interest because their aberrant expression has been linked to a number of diseases, particularly cancer. There are a number of experimentally well-characterized multiple promoters for known genes, for example TP53 , MYC , CYP19A1 , BRCA1 , P73 , MID1 , CTSB , SRC , KLK6  and TGFB3 , to name a few. CYP19A1 is a well-characterized example that has five known alternative promoters, many of which are separated by more than 10 kb and are therefore regulated by completely non-overlapping promoters . Alternative first exons Ex-1.1, Ex-1.3/Ex-1.4, and Ex-1f splice with Ex-2 to encode the 5' prime untranslated regions (UTR) of CYP19A1 mRNA in the placenta, adipose tissue, and brain, respectively. Additionally, in gonads, the transcription starts just 39 bp upstream of translation initiation codon in exon-2. The use of alternative non-coding first exons in the CYP19A1 transcripts does not alter the protein sequence, as the different 5'UTRs splice into a common second exon (exon-2) that contains the translation initiation codon. It is known that theses various promoters are used in a tissue-dependent manner , but the promoter upstream of exon Ex-1.4 is aberrantly expressed in breast cancer tissue, aggravating the disease .
Many putative gene promoters have been identified either through mapping of expressed sequence tags (ESTs) to the genome (Acembly , ECGene ), through sequence conservation studies with other organisms  or de novo computational prediction (e.g., FirstEF , DragonGSF ). Databases such as MPromDb  and H-DBAS  provide information about well-curated promoters and alternative spliced transcripts identified by aligning completely sequenced and precisely annotated full-length cDNAs . Recently, intensive efforts have been invested in establishing genome-wide profiling methods to identify the regulatory regions, including alternative transcription start sites and the upstream promoter regions in human and mouse genomes . Currently, three ways were applied for this purpose. One is based on the decreased nucleosome occupancy and increased sensitivity to DNase of the active promoter regions. The two approaches, called DNase-chip and DNase-array, have been created to detect those transcribed promoters and transcripts [28, 29]. The second one is called the cap analysis gene expression (CAGE), combining full-length cDNA library with SAGE technology to screen those 5' parts of transcripts . The third one is using ChIP-chip (Chromatin ImmunoPrecipitation followed by microarray analysis) to profile the binding position of the RNA polymerase II preinitiation complex . The data from these studies provide evidence of large-scale alternative splicing and wide-spread use of alternative promoters throughout the mammalian genomes. Most of these methods cannot predict the mRNA sequence produced from that promoter, and therefore constructing a traditional cDNA microarray to detect their expression is impossible. Moreover, two promoters may produce mRNA isoforms that are nearly indistinguishable, again making expression microarrays difficult to design. One alternative is to use ChIP-chip to detect the binding of RNA polymerase II to the genome. Although there is evidence that the presence of RNA polymerase II in the promoter does not perfectly correlate with active transcription , there does exist a correlation between the two events and therefore RNA polymerase II binding is a good approximation of transcriptional activity [33, 34]. Here, we have taken an intermediate approach, where we first annotated all possible putative promoters in the human genome by integrative bioinformatics analyses. Using these annotations, we designed 60-mer probes complementary to sequences and tiling the core promoter regions (both known and putative) of a subset of genes that have at least two annotated promoters. We tested this array by conducting ChIP-chip using antibody against RNA polymerase II (RNA Pol II) in MCF7 cells without and with E2 treatment. It is well known that estrogen receptor can act both as an activator and repressor of specific target genes, and that these events can then affect cell division and breast cancer progression [35, 36]. Knowledge about which of the alternative promoters of the ER regulated genes are active and inactive in E2 treated and untreated conditions in MCF7 cells would lead to better understanding of their effects in breast cancer development. Several novel putative promoters were found to be active before and after E2 treatment. Interestingly, we found that in genes with more than one putative promoter, downstream promoters are much more likely to be affected by E2 treatment than upstream promoters, suggesting interesting mechanisms of gene regulation in multiple promoter genes.
In order to design an alternative promoter array, we first used a bioinformatics approach to annotate all known and putative promoters in the human genome. Using evidence from three sources: UCSC Known Genes , FirstEF , and Riken CAGE tags , we found evidence for more than 185,000 transcription start sites separated by 500 bases or more in the human genome. We took a gene-centric approach to our microarray design, choosing genes that had two or more known or putative promoters. In the end, about 34,000 known or putative promoters were selected for our array, covering about 7,000 genes. The median number of promoters per gene is three (Figure 1B). 60 mer oligonucleotide probes were designed to tile a region -200 to +200 surrounding each known and putative transcription start site. Because of limitations on probe design, not all regions could be effectively covered but on average the spacing is approximately 80 bases from the end of one probe to the beginning of the next.
Wang et al.  recently noted that the 5'-most promoter of a gene tends to be CpG related, while more downstream promoters are less likely to be. We identified promoters that were active in one or both of our treatments, and classified them as either being associated with the 5'-end of the gene (if the promoter was located < 500 bases of the 5' end of the gene's annotation) or downstream promoters (> 500 bases away from the 5' end of the gene). Similar to the findings of Wang et al., we found that 92% of all 5'-end promoters are associated with a CpG island, whereas only 23% of downstream promoters are.
Activity of promoters with various combinations of supporting evidence
Number of promoters
Number of active promoters
KnownGene, CAGE, FirstEF
Promoters validated by ChIP-PCR and the lines of evidence used to identify them
Genomic location (hg18)
FirstEF + CAGE tags
FirstEF + CAGE tags
FirstEF + CAGE tags
FirstEF + CAGE tags
FirstEF + CAGE tags
Although genome tiling arrays are increasingly becoming a viable alternative to focused microarrays, they remain significantly more expensive than focused microarrays, and their signal-to-noise ratio is very high due to the large numbers of inactive probes and lack of probe design considerations . Another alternative mechanism for studying alternative promoters is the use of traditional expression arrays that have been designed to specifically interrogate particular gene isoforms. Unfortunately, in a large number of cases, mRNA isoforms are not known for putative promoters, and many isoforms that originate at different promoters differ only in the first exon - a small percentage of the entire molecule, making it difficult to distinguish between the various isoforms. High-throughput sequencing techniques are a recent advance that provide an attractive alternative to microarray-based techniques , however there is evidence that ChIP-chip is more sensitive than ChIP-sequencing techniques .
Traditionally, expression analysis was used to define promoter activity. However, one recent report has found that a number of genes experience transcription initiation but show not detectable full-length transcripts . Nevertheless, additional reports indicate that a correlation does exist between Pol II occupancy and gene activation [33, 34]. The findings of Guenther et al. may be explained by post-transcriptional regulation, but in any case we believe that the presence of Pol II in the promoter is a good approximation of promoter activity, although further experiments are still necessary to define and characterize this relationship.
Here, we have presented a novel 244 k microarray that is capable of measuring alternative promoter usage in over 34,000 putative promoters from nearly 7,000 genes. This platform is suitable for indirect expression analyses using RNA polymerase II ChIP-chip as we have shown in this paper, but it is also suitable for methylation based studies using DMH or meDIP experiments (since more than 5000 of the putative promoters fall within CpG islands), or for ChIP-chip experiments using other proteins of interest, such as transcription factors or histone modification signatures. We have demonstrated clear evidence for alternative promoter activities within genes, including the verification of a number of putative promoters. These results suggest that a large fraction of genes in the human genome possess undiscovered promoters and transcription start sites, which agrees with findings based on the mapping of ESTs to the genome [20, 21], and the mapping of 5' oligo cap cDNA libraries to the genome .
Most intriguingly, we discovered that there is a distinct bias for the downstream promoter in E2-sensitive two-promoter genes to be very close to the 3' end of the gene, whereas no such bias exists in E2-insensitive genes. These promoters are very unlikely to produce a functional transcript of any sort, and we therefore speculate that its purpose is merely to regulate the expression of the transcript initiated at the upstream promoter by "blocking" the progression of the RNA polymerase II complex. This "stalling" mechanism has been observed in other contexts. For example, inhibiting DNA replication was recently found to cause RNA polymerase II to stall during the transcription of p21 . Similarly, the cofactor of BRCA1 (COBRA1) is known to cause stalling of the RNA polymerase II complex proximal to the promoter [48, 49]. However, we can think of no reason for "blocking" promoters to have a bias towards the 3' end of the gene, since this blocking action could be realized at any point relative to the primary promoter. An alternative possibility is that promoters near the 3' end of the gene are driving expression of an interfering RNA, either antisense to the primary transcript or that is capable of inhibiting the formation and progression of the RNA polymerase II complex at the primary promoter . Such noncoding, interfering RNAs are known to regulate expression of the DHFR gene in humans, for example, although in this case the interfering RNA is transcribed from a promoter that lies upstream of the primary promoter [51, 52]. Much more work will need to be performed in the future to identify the regulatory action that these 3'-UTR promoters have on their primary transcripts, if any.
We have demonstrated clear evidence of alternative promoter activity for approximately 7,000 human genes, using a 244 K custom microarray that span across 34,000 putative promoters. Our results suggest that a large fraction of genes in the human genome possess undiscovered alternative promoters, which agrees with findings based on the mapping of ESTs and CAGE tags to the human genome. We found that a significantly more number of downstream promoters were affected by E2 treatment than the upstream promoters. And, there is a distinct bias for the downstream promoter in E2-sensitive two-promoter genes to be very close to the 3' end of the gene, whereas no such bias exists in E2-insensitive genes. The custom microarray can also be used for epigenome analyses, such as methylation based studies using DMH or meDIP experiments. The present data will help discovery of novel promoters and ongoing annotation of alternative promoters of human genes in different experimental conditions.
We considered three sources of evidence for identifying promoter targets for our microarray. The first was the 5'-end of genes as identified in the UCSC Known Gene track, which is largely based on the alignment of RefSeq mRNAs to the human genome . A second line of evidence was the database of CAGE tags sequenced by the Riken group . These tags capture ~20 bases at the 5' end of messenger RNAs, and have been mapped back to the human genome. We used the UCSC LiftOver tool to convert Riken's hg17 human genome coordinates to the more recent hg18 genome. Our final line of evidence was ab initio promoter predictions generated by the FirstEF program .
Each line of evidence identifies a transcription start site (TSS). We considered TSSs separated by > 500 bp to be distinct promoters - a commonly used criterion. Although there are undoubtedly transcription factor binding sites that extend beyond this region, this distance is great enough for the core promoters of each TSS to be distinct , and we can therefore consider these TSSs to be independently regulated to a large extent. TSSs were clustered using a neighbor-joining algorithm  until all clusters were separated by at least 500 bases. The coordinates of these clusters were then extended 200 bases up- and downstream.
Each promoter region was aligned to the genome using BLAT  in order to discover regions that are not unique. Alignments that were longer than 55 bases (90% of the probe length) were masked, as were 60 mers within the sequence that had > 85% or < 50% G+C. From the remaining unmasked regions of each promoter, probes were selected such that the average spacing would be roughly 100 bases, but that the spacing between two successive probes would be no more than 300 bases. In the end, the true average spacing is 80 bases.
Not all genes could be put on the array, so to prioritize we assigned each gene a score. Three points were awarded for each promoter supported by "known gene" evidence, two points for those supported by CAGE tag evidence , and one point for FirstEF  evidence. Genes were then ranked by their total score, and only the best-scoring genes were included on the array. In the end, the roughly 244,000 probes cover 34,486 promoter regions from 6,949 genes, with a median tiling coverage of 5 probes per promoter. The median number of promoters per gene on the array is 3, although the range is from 1 to over 30 (Figure 1B)
MCF-7 human breast cancer cells (American Type Culture Collection, Manassas, VA) were maintained in growth medium (MEM with 2 mM L-glutamine, 0.1 mM non-essential amino acids, 50 units/ml penicillin, 50 μg/ml streptomycin, 6 ng/ml insulin, and 10% FBS) as described by Fan et al . Prior to all experiments, cells were cultured in hormone-free basal basal medium (phenol-red free MEM with 2 mM L-glutamine, 0.1 mM non-essential amino acids, 50 units/ml penicillin, 50 μg/ml streptomycin, and 3% charcoal-dextran stripped FBS) for three days.
Five million MCF-7 cells with and without E2 treatment (10 nM, 3 h) were crosslinked with 1% formaldehyde for 10 min, at which point 0.125 M glycine was used to stop the cross-linking. Chromatin immunoprecipitation was performed using a ChIP assay kit (Upstate Biotechnology, Charlottesville, VA) as described . The antibodies, which specifically target against the initiation form of Pol II, were purchased from Santa Cruz Biotechnology (sc-899X, Santa Cruz, CA). Ligation-mediated PCR was applied to 20 ng of ChIP DNA and input control as described by Ren et al . Briefly, after cross-linking, cells were lysed and then sonication was used to shear the chromatin to fragments of around 500 bp. Cell lysis was then subject to immunoprecipitation. After immunoprecipitation, part of supernatant was removed from the lysis as input control. The primers used in ligation-mediated PCR were: oligo JW102, 5'-GCGGTGACCCGGGAGATCTGAATTC-3' and JW103 5'-GAATTCAGATC-3'. Tow μg of amplified ChIP DNA and input control were then labeled by Cy5 and Cy3 fluorescent dyes (Amersham, Buckinghamshire, UK) and were then cohybridized to the custom alternative promoter array. Technical duplication was performed for each sample of ChIP DNA. The slides were washed with three wash buffers (Buffer I. 6× SSPE + 0.005% sarcosine; Buffer 2, 0.06× SSPE; Buffer 3, anti-oxidant mixture in acetonitrile purchased from Agilent) in series at room temperature.
ChIP was conducted in the same manner as in the ChIP-chip experiments, described above. The pooled DNA from ChIP and input control were first measured by spectrophotometer (NanoDrop, Wilmington, DE). Quantitative PCR with SYBR green-based detection (Applied Biosystems, Foster City, CA) was performed as described previously. In brief, primers were designed using Primer Express software (Applied Biosystems, Foster City, CA). Quantitative ChIP-PCR values were normalized against values from a standard curve (50 to 0.08 ng, R-squared > 0.99) constructed by input control with the same primer sets.
Qiagen RNeasy kit (Valencia, CA) was used to extract total RNA from MCF-7 cells with or without E2 treatment according to the manufacture's manual. Two μg of RNA was first treated with DNase I (Invitrogen, Carlsbad, CA) to remove potential DNA contamination and then was reverse transcribed with SuperScript II reverse transcriptase (Invitrogen, Carlsbad, CA). Quantitative RT-PCR was performed by using SYBR green (Applied Biosystems, Foster City, CA) as a marker for DNA amplification on a 7500 Real-Time PCR System apparatus (Applied Biosystems, Foster City, CA). The relative mRNA level of a given locus was calculated by relative quantization of gene expression (Applied Biosystems, Foster City, CA) with glucose phosphate isomerase mRNA as an internal control.
The washed slides were scanned by a GenePix 4000A scanner (Axon, Union City, CA) and the acquired microarray images were analyzed using GenePix 6.0 software. Briefly, the user-selectable laser power settings for Cy5 (635 nm, red) and Cy3 (532 nm, green) were adjusted so that the overall Cy5 to Cy3 ratios were close to 1 and that the signal intensities spanned the entire spectrum with minimal signal saturation at the high intensity range. When these conditions were satisfied, the microarray was scanned and a grid file was loaded to mark the general location of the scanned image. The GenePix 6.0 software performed a spot finding function and captured intensity-related information in a GPR file.
The complete array dataset can be viewed in the ArrayExpress microarray database (accession number E-MEXP-1644). GPR files were passed through a custom-built quality control filter which flagged all probes that didn't meet all of the following criteria in both the green and red channels: (1) % > B + 2SD greater than 30; (2) median - background > 0; (3) signal-to-noise ratio greater than 1.5. These filtered results were then normalized using the default parameters (plus Lowess normalization) in Agilent's Chip Analytics software version 1.3. A post-normalization MA plot is shown in Figure 2B. We then used a modified version of the mixture model (reducing their gamma+normal+gamma model to a simple gamma+normal model) described by Khalili et al  to classify probes into one of two groups: bound or not bound. Figure 2C shows the fit of the gamma+normal mixture model to our data. One benefit of this type of analysis is that we are able to directly estimate our false positive rates based on a probe's probability of assignment to the "unbound" distribution.
Since each promoter contains many probes, for each promoter region we chose the probe with the best p-value for inclusion into the "bound" distribution and compared these across the various experimental treatments and replicates. We found that between replicates these "best probes" were within 80 bases of each other (the resolution of the array) 90% of the time. We used the following criteria to classify promoters within individual experiments: strongly bound promoters had probes that were classified in the "bound" distribution with a p-value less than 0.05. Weakly bound promoters were those that did not significantly fall within the "unbound" distribution with a p-value of 0.05. Unbound promoters were those whose probes fell within the "unbound" distribution with a p-value less than 0.05. As Figure 2D illustrates, by combining replicate experiments, we were able to classify each promoter into "highly on" (both replicates were strongly bound), "medium on" (one replicate was strongly bound, the other weakly bound), "low on" (both replicates are weakly bound), "weakly off" (replicates don't agree, so we fall back on the null hypothesis of no binding), or "strongly off" (both replicates show an unbound state).
transcription start site
Chromatin Immunoprecipitation (ChIP) followed by microarray analysis
cap analysis gene expression
RNA polymerase II.
The authors would like to thank Dustin Potter for providing us with source code, and Sandya Liyaranarachchi for her statistical advice. This work was supported by National Human Genome Research Institute grant R01HG003362 to RVD.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.