Links between core promoter and basic gene features influence gene expression
© Moshonov et al; licensee BioMed Central Ltd. 2008
Received: 09 October 2007
Accepted: 25 February 2008
Published: 25 February 2008
Diversity in rates of gene expression is essential for basic cell functions and is controlled by a variety of intricate mechanisms. Revealing general mechanisms that control gene expression is important for understanding normal and pathological cell functions and for improving the design of expression systems. Here we analyzed the relationship between general features of genes and their contribution to expression levels.
Genes were divided into four groups according to their core promoter type and their characteristics analyzed statistically. Surprisingly we found that small variations in the TATA box are linked to large differences in gene length. Genes containing canonical TATA are generally short whereas long genes are associated with either non-canonical TATA or TATA-less promoters. These differences in gene length are primarily determined by the size and number of introns. Generally, gene expression was found to be tightly correlated with the strength of the TATA-box. However significant reduction in gene expression levels were linked with long TATA-containing genes (canonical and non-canonical) whereas intron length hardly affected the expression of TATA-less genes. Interestingly, features associated with high translation are prevalent in TATA-containing genes suggesting that their protein production is also more efficient.
Our results suggest that interplay between core promoter type and gene size can generate significant diversity in gene expression.
The wide range of gene expression levels in the cell is strictly controlled by a variety of complicated mechanisms. Major contributors to gene expression rates are DNA regulatory elements that vary between individual genes. Among these sequences key roles are played by specific combinations of enhancer elements and their binding factors. Revealing additional features that influence gene expression level is central to understanding how cells work and what has been changed in diseased states.
A common element important to the overall expression level of the gene is the core promoter. The core promoter is located around the transcription start point and serves as the site of pre-initiation complex formation by RNA polymerase II and general transcription factors . The TATA-box is a highly conserved core promoter element occurring in approximately 20–30% of protein encoding genes in various species [2–6]. The presence or absence of a TATA-box in core promoters has been linked in yeast [2, 7–10] and human  to two pathways of pre-initiation complex assembly, one which is TFIID dependent (weak TATA or TATA-less) and the other is TFIID independent and SAGA dependent. A number of studies demonstrated that TATA-box occurrence or deficiency in genes is also correlated with additional traits. For example genes with a TATA box are overrepresented among inducible, stress response, developmental regulation and tissue specific genes [2, 12] and TATA containing genes also show increased divergence among species, most likely due to higher rate of evolution . Most recently the presence or absence of a TATA box was linked to differential regulation by transcription elongation factors  and the TATA box was found to be associated with higher sensitivity of gene expression to mutations . Thus the core promoter is associated with features that are far beyond transcription initiation per se. In the present study we investigated possible connections between the core promoter, structural features of genes and levels of gene expression.
Analysis of genes with canonical, weak and TATA-less promoters
Analysis of DBTSS genes for core promoter type.
No. of genes
TATA-containing genes are enriched with specific biological functions
Enrichment of functional categories of genes according to core promoter type.
Nucleosome and chromatin assembly
Response to wounding
Response to external stimulus
Gene length is inversely correlated with the strength of the TATA box
Gene length differences are determined by introns
The difference in size of genes with different core promoters could be due to their mature mRNA (exons), their introns or both. To examine this we compared lengths of mRNAs and introns in the TATA, TATA-1, TATA-2 and TATA-less groups. The results show small but highly significant differences in the median mRNA length between the TATA containing groups (Fig. 1B and 1D). These differences reflect the extension of the 5' and 3' untranslated regions of the non-canonical TATA and TATA-less groups (see below) rather than the coding sequences. However large differences are seen in their total intron length, up to 3.2 fold (TATA-2 vs. TATA, p = 1.2 × 10-45) (Fig. 1C and 1D) which are negatively correlated with the strength of the TATA. The total intron size of the TATA-less group is also longer than that of the TATA and is similar to that of the TATA-2. Thus the length of introns is the most significant variant that contributes to the difference in the gene size between the groups.
The influence of core promoter and intron size on gene expression
While the core promoter is well known for its critical role in transcription, much less is known about the impact of intron size on gene expression. A previous analysis revealed that short introns are correlated with high expression levels . However, given the association of small intron size with TATA promoters, the high level of gene expression could result from the presence of a TATA promoter, the short size of introns or both. To assess the relative contribution of core promoter and intron length on gene expression we retrieved expression data for the different gene groups from the GNF Gene Expression Atlas, which has expression data from 78 human cell types. Since we wished to estimate the impact of intron size on gene expression the intron-less genes were not included in this analysis.
The overall level of expression correlates well with the strength of the TATA box as the expression is gradually increased with the compatibility to the TATA box consensus (Fig. 3A). Thus it appears that the strength of the TATA box is highly significant for gene expression, a result consistent with previous studies showing that most variations of the TATA consensus reduce transcription of reporter genes in vitro and in vivo [16–18].
Median expression values of short (intron length <8000) or long (intron length >8000) genes grouped according to core promoter type. n is the number of genes in each set, Fold is the magnitude of the difference between short and long, and p is the p-value of the difference.
Correlation between gene expression (average and maximum) and intron length for each gene set. Number of genes: TATA 385, TATA-1 476, TATA-2 2581, TATA-less 5663.
In addition we see that the advantage of having a TATA core promoter for expression is more significant for genes bearing short introns compared to genes with long introns. For instance, the median expression of short intron genes in the TATA group is 3–4 fold higher than short intron genes from TATA-2 or TATA-less groups, but this advantage is reduced to only 1.6 fold in long genes. It therefore appears that for the TATA groups (TATA, TATA-1, TATA-2) both the strength of the TATA box and the intron length strongly influence expression. By contrast, genes with TATA-less promoters are weakly expressed regardless of their intron length. It is likely that these genes have specific features that overcome the negative effect of long introns.
Translational features of gene groups
The statistical analysis of structural and functional features of mammalian genes associated with core promoter variants revealed surprisingly close ties between different steps of gene expression from transcription initiation and elongation to protein production. These findings extend further the links found in yeast between core promoter type and features that are seemingly not directly associated with transcription initiation such as gene function , evolution rates  and sensitivity of gene expression to mutations .
The most remarkable observation of the statistical analysis of genes divided according to their core promoters is that small variations relative to the TATA-box consensus are associated with large differences in gene length as a consequence of intron size and number. Specifically genes containing canonical TATA were found to be significantly shorter than genes bearing non-canonical TATA or TATA-less promoters. Analysis of expression of genes in the different groups highlighted several points. First, the level of expression is highly correlated with the strength of the TATA box confirming previous gene-specific studies showing that deviations from the TATA consensus reduce transcription [16–18]. Second, the expression level of the genes is affected by the type of the core promoter in a length dependent manner: long genes generally display lower levels of expression compared to short genes. However the negative effect of gene length is correlated with the strength of the TATA element, such that genes with canonical TATA are strongly suppressed, genes bearing non-canonical TATA are moderately inhibited and TATA-less genes are hardly affected by length. Third, it appears that having a TATA box in the core promoter is mostly beneficial for expression of short genes, the advantage of a strong TATA-box being diminished in long genes. We therefore propose that substantial variations in gene expression levels can be achieved through different combinations of TATA promoters with varying intron lengths. A TATA-less promoter, on the contrary, ensures similar levels of expression regardless of gene length.
Given the high cost of transcription, extending intron size in higher eukaryotes must be beneficial. Varying intron size and core promoter type may be an economic way for utilizing the same cellular constituents to modulate gene expression levels. In addition large introns with small exons may serve to reduce mutational rate in coding sequences (dilution effect). This possibility is consistent with the observations from yeast that TATA containing genes evolve at a higher rate [13, 14].
The mechanistic basis underlying the links between core promoter and gene length remains to be investigated but is likely to involve RNA polymerase II, an entity present both in the core promoter and throughout the gene. Our previous findings that core promoter variants dictate recruitment of different elongation factors upon activation by NF-κB, through formation of distinct pre-initiation complexes , provide an initial molecular basis for the findings reported here.
Selection of genes and criteria for inclusion in specific core promoter group
The 14,728 H. sapiens genes appearing in the DBTSS version 6, with experimentally-determined transcription start site (TSS), were used in the current study. The genes were divided into groups according to their core promoter type. Core promoters with a minimal canonical TATA box (TATA), a TATA-box with one mismatch (TATA-1), a TATA-box with two mismatches (TATA-2) and the remainder of the genes (TATA-less) comprised the four gene groups. Our criteria demanded that the TATA motif (TATAWA) and its alternatives were strictly located between -40 and -15 with respect to the TSS. Using the 'pattern matching' tool of the Regulatory Sequence Analysis Tools (available on the internet) we were able to direct each gene to its appropriate group. The TATA, TATA-1, TATA-2 and TATA-less groups finally comprised 527, 694, 3916 and 9491 genes respectively.
Measurements of basic gene features
Genes were classified according to their function using the gene-annotation enrichment analysis (DAVID Bioinformatics Resources 2007). The data of the various gene features were retrieved from the UCSC genome browser March 2006 assembly. In this browser we used the 'Tables' to obtain coordinates of different tracks (e.g. gene start, gene end, exon count, UTRs exons etc.) in order to calculate the following features: gene length, mRNA length, intron length, exon number, 5' UTR length and 3' UTR length with the Excel program. UCSC Table Browser was also used to retrieve DNA sequences of the translation initiation site of each mRNA which was scanned for the presence of the Kozak motif (RNNAUGG).
Gene expression data (gcRMA) for each group of genes was downloaded from SymAtlas v1.2.4 (available on the internet at the Novartis Research Foundation site). For each gene only the major probe set was used. Expression values below 200 were considered background and omitted from the analysis. The average and maximum expression values of each gene were calculated. Genes were then divided into two sets according to their intron size (less or more than 8000 nt) and the median, 25% and 75% quartiles of average expression of each set were determined and are presented as boxplots.
Statistical analyses of gene features
The range of measurements for each gene feature within the groups was wide, making it statistically inappropriate to compare their means, particularly since we did not wish to exclude any gene that did not appear to fit the general observed pattern. We therefore determined the 25%, median and 75% quartile values for the different features. We measured the significance of the differences between the median values (6 groups) by the Kruskal-Wallis test with the Bonferroni correction using the Statistics Toolbox of the MATLAB program (The MathWorks). The Spearman's rank correlation coefficient analysis between expression levels and intron size was performed using the MATLAB program. The likelihood of there being significant differences between the groups regarding the absence of introns and the presence of the Kozak motif sequences was calculated using the Chi-square test.
We thank Dr. Hillary Voet (Hebrew University, Israel) for her advice and assistance in the statistical analyses, Dr. Shalev Itzkovitz (Weizmann Institute of Science, Israel) for his guidance in use of the MATLAB program, Ofer Rahat (Weizmann Institute of Science, Israel) and Eliezer Dikstein for assistance in programming. This work was supported by a grant from the Israel Science Foundation.
- Smale ST, Kadonaga JT: The RNA polymerase II core promoter. Annu Rev Biochem. 2003, 72: 449-479. 10.1146/annurev.biochem.72.121801.161520.PubMedView ArticleGoogle Scholar
- Basehoar AD, Zanton SJ, Pugh BF: Identification and distinct regulation of yeast TATA box-containing genes. Cell. 2004, 116 (5): 699-709. 10.1016/S0092-8674(04)00205-3.PubMedView ArticleGoogle Scholar
- Gershenzon NI, Ioshikhes IP: Synergy of human Pol II core promoter elements revealed by statistical sequence analysis. Bioinformatics. 2005, 21 (8): 1295-1300. 10.1093/bioinformatics/bti172.PubMedView ArticleGoogle Scholar
- Kim TH, Barrera LO, Zheng M, Qu C, Singer MA, Richmond TA, Wu Y, Green RD, Ren B: A high-resolution map of active promoters in the human genome. Nature. 2005, 436 (7052): 876-880. 10.1038/nature03877.PubMedPubMed CentralView ArticleGoogle Scholar
- Ohler U, Liao GC, Niemann H, Rubin GM: Computational analysis of core promoters in the Drosophila genome. Genome Biol. 2002, 3 (12): RESEARCH0087-10.1186/gb-2002-3-12-research0087.PubMedPubMed CentralView ArticleGoogle Scholar
- Yang C, Bolotin E, Jiang T, Sladek FM, Martinez E: Prevalence of the initiator over the TATA box in human and yeast genes and identification of DNA motifs enriched in human TATA-less core promoters. Gene. 2007, 389 (1): 52-65. 10.1016/j.gene.2006.09.029.PubMedPubMed CentralView ArticleGoogle Scholar
- Cheng JX, Floer M, Ononaji P, Bryant G, Ptashne M: Responses of four yeast genes to changes in the transcriptional machinery are determined by their promoters. Curr Biol. 2002, 12 (21): 1828-1832. 10.1016/S0960-9822(02)01257-5.PubMedView ArticleGoogle Scholar
- Huisinga KL, Pugh BF: A genome-wide housekeeping role for TFIID and a highly regulated stress-related role for SAGA in Saccharomyces cerevisiae. Mol Cell. 2004, 13 (4): 573-585. 10.1016/S1097-2765(04)00087-5.PubMedView ArticleGoogle Scholar
- Li XY, Bhaumik SR, Zhu X, Li L, Shen WC, Dixit BL, Green MR: Selective recruitment of TAFs by yeast upstream activating sequences. Implications for eukaryotic promoter structure. Curr Biol. 2002, 12 (14): 1240-1244. 10.1016/S0960-9822(02)00932-6.PubMedView ArticleGoogle Scholar
- Mencia M, Moqtaderi Z, Geisberg JV, Kuras L, Struhl K: Activator-specific recruitment of TFIID and regulation of ribosomal protein genes in yeast. Mol Cell. 2002, 9 (4): 823-833. 10.1016/S1097-2765(02)00490-2.PubMedView ArticleGoogle Scholar
- Amir-Zilberstein L, Ainbinder E, Toube L, Yamaguchi Y, Handa H, Dikstein R: Differential regulation of NF-kappaB by elongation factors is determined by core promoter type. Mol Cell Biol. 2007, 27 (14): 5246-5259. 10.1128/MCB.00586-07.PubMedPubMed CentralView ArticleGoogle Scholar
- Schug J, Schuller WP, Kappen C, Salbaum JM, Bucan M, Stoeckert CJ: Promoter features related to tissue specificity as measured by Shannon entropy. Genome Biol. 2005, 6 (4): R33-10.1186/gb-2005-6-4-r33.PubMedPubMed CentralView ArticleGoogle Scholar
- Tirosh I, Weinberger A, Carmi M, Barkai N: A genetic signature of interspecies variations in gene expression. Nat Genet. 2006, 38 (7): 830-834. 10.1038/ng1819.PubMedView ArticleGoogle Scholar
- Landry CR, Lemos B, Rifkin SA, Dickinson WJ, Hartl DL: Genetic properties influencing the evolvability of gene expression. Science. 2007, 317 (5834): 118-121. 10.1126/science.1140247.PubMedView ArticleGoogle Scholar
- Castillo-Davis CI, Mekhedov SL, Hartl DL, Koonin EV, Kondrashov FA: Selection for short introns in highly expressed genes. Nat Genet. 2002, 31 (4): 415-418.PubMedGoogle Scholar
- Blake WJ, Balazsi G, Kohanski MA, Isaacs FJ, Murphy KF, Kuang Y, Cantor CR, Walt DR, Collins JJ: Phenotypic consequences of promoter-mediated transcriptional noise. Mol Cell. 2006, 24 (6): 853-865. 10.1016/j.molcel.2006.11.003.PubMedView ArticleGoogle Scholar
- Hoopes BC, LeBlanc JF, Hawley DK: Contributions of the TATA box sequence to rate-limiting steps in transcription initiation by RNA polymerase II. J Mol Biol. 1998, 277 (5): 1015-1031. 10.1006/jmbi.1998.1651.PubMedView ArticleGoogle Scholar
- Wobbe CR, Struhl K: Yeast and human TATA-binding proteins have nearly identical DNA sequence requirements for transcription in vitro. Mol Cell Biol. 1990, 10 (8): 3859-3867.PubMedPubMed CentralView ArticleGoogle Scholar
- Futcher B, Latter GI, Monardo P, McLaughlin CS, Garrels JI: A sampling of the yeast proteome. Mol Cell Biol. 1999, 19 (11): 7357-7368.PubMedPubMed CentralView ArticleGoogle Scholar
- Griffin TJ, Gygi SP, Ideker T, Rist B, Eng J, Hood L, Aebersold R: Complementary profiling of gene expression at the transcriptome and proteome levels in Saccharomyces cerevisiae. Mol Cell Proteomics. 2002, 1 (4): 323-333. 10.1074/mcp.M200001-MCP200.PubMedView ArticleGoogle Scholar
- Gygi SP, Rochon Y, Franza BR, Aebersold R: Correlation between protein and mRNA abundance in yeast. Mol Cell Biol. 1999, 19 (3): 1720-1730.PubMedPubMed CentralView ArticleGoogle Scholar
- Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner R, Goodlett DR, Aebersold R, Hood L: Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science. 2001, 292 (5518): 929-934. 10.1126/science.292.5518.929.PubMedView ArticleGoogle Scholar
- Tian Q, Stepaniants SB, Mao M, Weng L, Feetham MC, Doyle MJ, Yi EC, Dai H, Thorsson V, Eng J, Goodlett D, Berger JP, Gunter B, Linseley PS, Stoughton RB, Aebersold R, Collins SJ, Hanlon WA, Hood LE: Integrated genomic and proteomic analyses of gene expression in Mammalian cells. Mol Cell Proteomics. 2004, 3 (10): 960-969. 10.1074/mcp.M400055-MCP200.PubMedView ArticleGoogle Scholar
- Kozak M: Initiation of translation in prokaryotes and eukaryotes. Gene. 1999, 234 (2): 187-208. 10.1016/S0378-1119(99)00210-3.PubMedView ArticleGoogle Scholar
- Kozak M: Determinants of translational fidelity and efficiency in vertebrate mRNAs. Biochimie. 1994, 76 (9): 815-821. 10.1016/0300-9084(94)90182-1.PubMedView ArticleGoogle Scholar
- Kozak M: Recognition of AUG and alternative initiator codons is augmented by G in position +4 but is not generally affected by the nucleotides in positions +5 and +6. Embo J. 1997, 16 (9): 2482-2492. 10.1093/emboj/16.9.2482.PubMedPubMed CentralView ArticleGoogle Scholar
- Kozak M: Structural features in eukaryotic mRNAs that modulate the initiation of translation. J Biol Chem. 1991, 266 (30): 19867-19870.PubMedGoogle Scholar