Diversity of core promoter elements comprising human bidirectional promoters
© Yang and Elnitski. 2008
Published: 16 September 2008
Skip to main content
© Yang and Elnitski. 2008
Published: 16 September 2008
Bidirectional promoters lie between adjacent genes, which are transcribed from opposite strands of DNA. The functional mechanisms underlying the activation of bidirectional promoters are currently uncharacterised. To define the core promoter elements of bidirectional promoters in human, we mapped motifs for TATA, INR, BRE, DPE, INR, as well as CpG-islands.
We found a consistently high correspondence between C+G content, CpG-island presence and an average expression level increasing the median level for all genes in bidirectional promoters. These CpG-rich promoters showed discrete initiation patterns rather than broad regions of transcription initiation, as are typically seen for CpG-island promoters. CpG-islands encompass both TSSs within bidirectional promoters, providing an explanation for the symmetrical co-expression patterns of many of these genes. In contrast, TATA motifs appear to be asymmetrically positioned at one TSS or the other.
Our findings demonstrate that bidirectional promoters utilize a variety of core promoter elements to initiate transcription. CpG-islands dominate the regulatory landscape of this group of promoters.
The complexities of promoter regions are slowly being revealed with help from a series of groundbreaking studies on vast collections of promoter sequences . Proximal promoter regions (~500 bp upstream and 100 bp downstream of the TSS) typically contain the features necessary for basal levels of gene expression. Within the proximal promoter region, core promoter elements (CPEs) such as TATA, CCAAT, the initiator element (INR), TFIIB recognition element, downstream promoter element (DPE), represent distinct functional entities, along with CpG-islands, responsible for basal promoter activity. Computational studies of large collections of promoters classify them by these components, either individually or in combination. Thus far discrete functional mechanisms have not been fully elucidated for each class of promoter. However patterns of transcription initiation have been defined for CpG-islands, which are typically broad stretches of DNA with numerous start sites, and for TATA box motifs, which have single well-defined start sites .
A relatively new category of promoters comprises bidirectional promoters. These regulatory regions fall between two genes and regulate transcription of the genes in opposite directions from the promoter region, i.e. the bidirectional nature contrasts that of a typical uni-directional promoter. These promoters represent a subclass of the larger gene of promoter sequences [2, 3]. Previous studies have shown that bidirectional promoters are enriched in the genome  (Adachi and Lieber 2003), tend to be co-expressed  and bind ets proteins [2, 6, 7]. One approach to elucidating the molecular mechanisms regulating bidirectional promoters is to map the content of CPEs. Since co-expression of both genes happens more frequently than random events , an explanatory model would suggest symmetry of the promoter elements near the TSSs. This manuscript addresses the distribution of CPEs in the human bidirectional promoters using computational analyses of current large-scale experimental datasets as well as motif analyses. We address the issues of C+G content, patterns of transcription initiation and symmetry of CPEs near the TSSs.
The earliest descriptions of functional promoter elements focused on the importance of a TATA-motif to recruit the essential RNA polymerase II molecule to the transcription start site (TSS). We now understand that the TATA-centric view of promoters represents only a minor proportion of promoters in eukaryotic cells .
The INR  is a conserved sequence that encompasses the TSS, which functions to direct accurate transcription initiation either by itself or in conjunction with TATA or DPE. We found that 25.3% of bidirectional promoters contain the INR motif while 30.8% of non-bidirectional promoters contained this motif at the functional position (Fig. 2B). The presence of the INR in both type of promoters was significantly larger than the frequency expected by random chance, which was 9.28%% and 14.10%% (p-values < 2.2e-16), respectively.
The BRE is located immediately upstream of TATA box  of some promoters containing TATA. We found that 16.5% of bidirectional and 11.1% of non-bidirectional promoters contained this motif at the functional position (Fig. 2C). The presence of BRE in both types of promoters was significantly larger than the frequency expected by random chance, which was 5.2% and 2.1% (p-values < 2.2e-16), respectively.
The CCAAT motif represents a consensus sequence that occurs upstream of the TSS by 75–80 bases. We found that 12.9% of bidirectional promoters contain CCAAT element while 6.9% of non-bidirectional promoters contained this motif at the functional position (Fig 2D). Presence of CCAAT in both types of promoters was significantly larger than frequency expected by random chance 0.66% and 0.91% (p-value < 2.2e-16).
Increasingly, high-throughput experimental studies are providing a wealth of information that is useful for deducing biologically relevant themes. Assays such as ChIP-chip or ChIP-seq are powerful investigative tools for revealing the presence of a protein bound to DNA. The cost and labor involved with such studies are large; however the significance of these experimental results far exceeds any other method for obtaining binding information at this scale. For example, ChIP-chip data revealed the binding of RNA polymerase II at the collection of active promoters in the cell, providing a snapshot of the inner workings of the cell . We used the ChIP-seq data of Barski et al.  for RNA polymerase II to determine which promoters were occupied by the transcription machinery.
Core promoter elements at left and right TSSs
Overlap TSS of left gene
Overlap TSS of right gene
Present at both TSS
Recently sets of transcripts with precise initiation sites have been produced and mapped onto their positions in the genome. This experimental technique, known as cap-trapping or CAGE, precisely defines TSSs by capturing all transcripts at their first nucleotide (recognized by its methylated cap). This cap is "worn" at the beginning of the transcript, which corresponds to the "head" or beginning of the gene. Data generated by cap-trapping assays promise to significantly advance our knowledge of the transcriptome in any given cell type, refine our knowledge of the start sites of genes, and, by inference, pave the way for promoter analyses that examine the sequences immediately upstream and downstream of the captured TSSs.
Bidirectional promoters comprise a diverse set of core-promoter regulatory elements. A subset of these promoters contain TATA motifs, with notable enrichment at histone promoters. We did not find a balanced representation of TATA at both the left and right TSSs of gene pairs, including the histone genes. This result indicated that bidirectional promoters can employ different methods of regulation within a pair of genes. Furthermore, we found that 45% of genes were co-expressed - by virtue of the CAGE tags. This approach excluded signals from downstream alternative promoters, which complicate measurements in microarray analyses; confirm that a large proportion of these promoters are co-expressed from the neighbouring TSSs. Bidirectional promoters coincided with CpG-islands more often than non-bidirectional promoters. This genomic feature may play a significant role in marking these regions as promoters as well as participating in Pol II recruitment.
We downloaded 56,722 protein-coding gene annotations from UCSC genome browser hg18 database. These collapsed into 25,147 unique and non-overlapping gene clusters. Of these, 1,369 bidirectional gene pairs were present, defining bidirectional promoters (for 2,738 genes). Each gene in a bidirectional gene pair formed a head-to-head arrangement with its closest neighbour and the intergenic distances between the TSS of a gene and its neighbour had to be within 1,000 bp . After excluding those pairs with too large an intergenic distance and those with anti-sense overlap at the 5' ends of the transcripts, we obtained 13,302 genes, which did not form head-to-head arrangement with the closest neighbour. These were designated non-bidirectional promoters. We also defined a negative control set. When a gene and its closest neighbour were transcribed in convergent directions, ending within 1000 bp of each other, they were designated as tail-to-tail regions.
For bidirectional promoters we extracted the intervening DNA sequence between the TSSs and extending 100bp downstreanm of the TSS for each gene. For non-bidirectional promoters, sequence was extracted 500bp upstream and downstream 100bp of the TSS site. For sequences between the 3' ends of tail-to-tail gene we extracted the region between the genes plus 100bp into the genes. We mapped the distributions and frequencies of five regulatory sites: TATA, CCAAT, DPE, INR and BRE in these three type of genomic regions. Furthermore, we measured the occurrence of these promoter elements within restricted intervals that are known to be functional leaving a small window on either side for slightly imprecise localization. We searched TATA [A|T]A [A|G|T] for TATA at the regions between -40 to -20, [A|G] [A|G]CCAAT [A|C|G] [A|G] for CCAAT between -108 and +9, [A|G|T] [C|G] [A|T] [C|T] [A|C|G] [C|T] for DPE between +24 to +34 for DPE, [C|T] [C|T]AN [A|T] [C|T] [C|T] for INR -15 to +15, and [G|C] [G|C] [G|A]CGCC for BRE between -49 to -18. Then the observed occurrence rate was calculated for each promoter element respectively. Using the nucleotide frequency in the promoter sequences, we obtained probability of finding a CPE by chance per promoter. The χ2 4test was performed to determine whether the difference between the occurrence rate by random events and by measured observation was significant or not.
Gene Expression Altas2 data is from the USCS Human Genome Browser. The dataset consists of expression data for 79 human tissues produced by Genomics Institute of the Novartis Research Foundation (GNF) . Compared to the median expression ratio, values larger and smaller than 1 were classified as over-expression under-expression, respectively.
Tag density for RNA polymerase II binding sites were obtained by the total number of Pol II tags divided by number of promoters.
CAGE tags are available at the Riken website http://fantom.gsc.riken.go.jp/. The dataset contains CAGE tags in 1,057,486 positions of hg17 assembly. After converting the genomic coordinates of bidirectional promoters in hg18 to hg17 assembly by liftover, we mapped the CAGE data to the bidirectional promoters.
This work was supported by the Intramural Program of the National Human Genome Research Institute.
This article has been published as part of BMC Genomics Volume 9 Supplement 2, 2008: IEEE 7th International Conference on Bioinformatics and Bioengineering at Harvard Medical School. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/9?issue=S2
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.