Eukaryotic genomes may exhibit up to 10 generic classes of gene promoters
© Gagniuc and Ionescu-Tirgoviste; licensee BioMed Central Ltd. 2012
Received: 13 May 2012
Accepted: 13 September 2012
Published: 28 September 2012
The main function of gene promoters appears to be the integration of different gene products in their biological pathways in order to maintain homeostasis. Generally, promoters have been classified in two major classes, namely TATA and CpG. Nevertheless, many genes using the same combinatorial formation of transcription factors have different gene expression patterns. Accordingly, we tried to ask ourselves some fundamental questions: Why certain genes have an overall predisposition for higher gene expression levels than others? What causes such a predisposition? Is there a structural relationship of these sequences in different tissues? Is there a strong phylogenetic relationship between promoters of closely related species?
In order to gain valuable insights into different promoter regions, we obtained a series of image-based patterns which allowed us to identify 10 generic classes of promoters. A comprehensive analysis was undertaken for promoter sequences from Arabidopsis thaliana, Drosophila melanogaster, Homo sapiens and Oryza sativa, and a more extensive analysis of tissue-specific promoters in humans. We observed a clear preference for these species to use certain classes of promoters for specific biological processes. Moreover, in humans, we found that different tissues use distinct classes of promoters, reflecting an emerging promoter network. Depending on the tissue type, comparisons made between these classes of promoters reveal a complementarity between their patterns whereas some other classes of promoters have been observed to occur in competition. Furthermore, we also noticed the existence of some transitional states between these classes of promoters that may explain certain evolutionary mechanisms, which suggest a possible predisposition for specific levels of gene expression and perhaps for a different number of factors responsible for triggering gene expression. Our conclusions are based on comprehensive data from three different databases and a new computer model whose core is using Kappa index of coincidence.
To fully understand the connections between gene promoters and gene expression, we analyzed thousands of promoter sequences using our Kappa Index of Coincidence method and a specialized Optical Character Recognition (OCR) neural network. Under our criteria, 10 classes of promoters were detected. In addition, the existence of “transitional” promoters suggests that there is an evolutionary weighted continuum between classes, depending perhaps upon changes in their gene products.
Promoters have guided evolution for millions of years. It appears that they were the main engine responsible for the integration of different mutations favorable for the environmental conditions. Promoters are critical regions for gene regulation in complex genomes and are located upstream of TSS (Transcription Start Site). A typical promoter region is composed of a core promoter and regulatory domains[2, 3]. The structure of a promoter is recognized by the presence of known promoter elements, such as TATA box, GC-box, CCAAT-box, BRE and INR box[4–12]. Therefore, accurate recognition of a promoter structure relies on a comprehensive list of promoter elements. Nevertheless, using these promoter elements for classification has proven to be difficult and perhaps even disadvantageous for different functional correlations between promoter sequences. From an evolutionary standpoint, within non-coding regulatory regions, nucleotides can change their order more frequently and these binding sites often become very small and instable. Previously, approaches towards promoter classification include motif sequences and other structural parameters, such as DNA curvature, bendability, stability, nucleosome positioning or comparison of various DNA sequences[14–19]. Currently, promoters from vertebrates are classified into two major classes, namely TATA and CpG types while in mammals there is a subclassification in TATA box–enriched and CpG-rich promoters. In order to investigate possible interactions between different biological processes, we found that an overall correlation between DNA sequence features among promoter regions may be an alternative method. In this context, we have chosen a different approach to classify promoter sequences by using two-dimensional patterns obtained through Kappa Index of Coincidence (Kappa IC) and (C + G)% values[21–24]. This classification it is mainly done by considering the shape and density of these promoter patterns. In this study, we explore the structural properties of these patterns and we search for correlations between promoter sequences of several different species. Genome sequencing has led to the development of many bioinformatic methods for accurate recognition and extraction of promoter sequences. A number of experimental approaches to compile TSSs on a genome-wide scale have been developed including the Eukaryotic Promoter Database[25, 26] and PlantProm Database. We used these databases and focused our attention on 20,597 promoter sequences from Arabidopsis thaliana, Drosophila melanogaster, Homo sapiens and Oryza sativa. In humans we were also interested in promoters of genes that are expressed preferentially in certain tissues. Several studies converged on characterizing patterns of tissue specific gene expression, including TiGER (Tissue-specific Gene Expression and Regulation) database[28–30], which contains comprehensive information about human tissue-specific gene expression profiles. We have used TiGER database list of tissue-specific genes to determine the proportion of each promoter class in 30 tissues. This allowed us to identify certain relations between promoter sequences and different biological processes.
We first investigated if some promoter patterns occur more often then others. Secondly we determined which of these patterns are more common in certain species and whether their distribution may have some evolutionary implications. In the third analysis we examined the distribution of these promoter classes among human tissues.
AT-based promoters. AT-based representative patterns are distinguished by high (A + T)% and Kappa IC values. The left side of the pattern is predominant, while the right side is significantly less pronounced. The shape of this pattern exhibits various different lengths of short poly(dA:dT) homopolymer tracts (Figure 1C). AT-based patterns are characteristic for gene promoters from Drosophila melanogaster and Arabidopsis thaliana and are less common in humans.
CG-based promoters. These promoters are represented by patterns containing a high percentage of C + G and high Kappa IC values. CG-based promoters show a high CpG content. The right side of the pattern is predominant while the left side is significantly less pronounced (Figure 1A). The shape of this pattern exhibits various different lengths of short poly(dC:dG) homopolymer tracts. In addition, the average frequency of occurrence between AT-based and CG-based promoters appears to differ completely in these species, but curiously, these promoters tend to be in a relative opposition in each species (Figure 2A,B). This observation suggests that these species have different preferences for allocation of certain fundamental functions. Patterns of this class are particulary characteristic for genes from Homo sapiens.
ATCG-compact promoters. ATCG-compact patterns characterize promoters with centrally disposed clusters, leading to the formation of a round shaped pattern (Figure 1D). The middle-lower region of the pattern contains evenly interspersed nucleotides (A,T,C,G ≈ 25%) and the middle-upper area shows different lengths of short homopolymer tracts (poly(dA), poly(dT), poly(dC), poly(dG)) disposed in tandem in any order. ATCG-compact patterns are characteristic for gene promoters from Arabidopsis thaliana.
ATCG-balanced promoters. Promoter sequences belonging to ATCG-balanced class show an almost balanced G + C and A + T content. The right and the left side of the pattern tend to share a relative 2-fold rotational symmetry. These patterns are generally composed of equally distributed short poly(dA:dT) and poly(dC:dG) homopolymer tracts (Figure 1B). ATCG-balanced and CG-spike promoters tend to occur in the same proportion in each species and appear to have almost similar average frequencies between species (Figure 2A,B). This observation indicates that for some specific functions the same classes of promoters are preferred between species. These patterns are characteristic for gene promoters from Homo sapiens and Oryza sativa.
ATCG-middle promoters. ATCG-middle patterns are characterized mainly by promoter sequences containing A + T and C + G balanced values and higher than average Kappa IC values. The right side and the left side of the pattern are equally distributed. However, the central part is pronounced. They are similar to ATCG-balanced class in that they also have a relative 2-fold rotational symmetry, but contain additional short homopolymer tracts (poly(dA), poly(dT), poly(dC), poly(dG)) disposed in tandem in any order (Figure 1E). These patterns are rare and are almost equally distributed in all four species.
ATCG-less promoters. Promoters from this class are represented by an abrupt transition between two C + G threshold levels. Similar to ATCG-balanced promoters, the right side and the left side of the pattern is equally distributed, however, some sequences around the central region are missing or have a lower density. Typically, these central regions lack of tandem short homopolymer tracts and short sequences consisting of equally interspersed nucleotides (A,T,C,G ≈ 25%), or short sequences showing small variations over 50% in favor of A + T or C + G nucleotides (Figure 1F). Based on the promoter sequence features, these promoter patterns seem to be complementary with ATCG-middle promoters. ATCG-less patterns are significantly rare (an overall frequency between species of 0.10% - 0.16%) and are characteristic for promoters from Homo sapiens and Oryza sativa but are almost absent in Drosophila melanogaster and Arabidopsis thaliana.
AT-less promoters. Promoter sequences belonging to AT-less class exhibit a high frequency of short CG-rich sequences. Although both sides of the pattern show a relative 2-fold rotational symmetry, the clusters from the left side of the pattern exhibit a lower density than those on the right. These patterns are characterized by a large number of short poly(dC:dG) tracts and a lower number of short poly(dA:dT) tracts (Figure 1G). Short poly(dA:dT) tracts typically occur as a consequence of an abrupt depletion of C + G nucleotides on short distances (30b–60b) inside the promoter sequence. Such a depletion is accompanied by high Kappa IC values and is typically present near TSS (± 200b), suggesting a regular expression of their genes. AT-less patterns are generally rare and are found equally in all four species, but are slightly more frequent in Homo sapiens.
CG-less promoters. In contrast, CG-less promoters are distinguished by a high frequency of short AT-rich sequences and are more common in Oryza sativa and Arabidopsis thaliana. The right and left side of the pattern tend to be equally distributed, however, the clusters from the right side of the pattern exhibit a lower density than those on the left. AT-less and CG-less promoters seem to be characterized by an imbalance between the number of short poly(dA:dT) tracts and short poly(dC:dG) tracts. Complementary to AT-less promoter characteristics, these patterns are characterized by a large number of short poly(dA:dT) tracts and a much lower number of short poly(dC:dG) tracts (Figure 1I). Compared with AT-less promoters, the overall preference for CG-less promoters is very high between species. However, in Homo sapiens the number of AT-less promoters slightly exceeds the number CG-less promoters (Figure 2A).
AT-spike promoters. Promoter sequences belonging to AT-spike class are represented by long repetitive sequences with a high content of A or T nucleotides. These patterns exhibit a central part and an elongated left side containing small density clusters. The shape of AT-spike representative patterns is explained by the presence of long poly(dA) or long poly(dT) homopolymer tracts or tandem short poly(dA) or short poly(dT) tracts (Figure 1J). These promoters are prevalent in Arabidopsis thaliana.
CG-spike promoters. In contrast to AT-spike promoter architecture, these promoters are represented by long repetitive sequences with a high content of C or G nucleotides. CG-spike patterns exhibit a central part and an elongated right side containing small density clusters. These patterns contain long poly(dC) or long poly(dG) homopolymer tracts or tandem short poly(dC) or short poly(dG) tracts (Figure 1H). AT-spike and CG-spike promoters seem to be complementary considering the fact that both promoter classes are differentiated by two opposite types of homopolymer tracts. AT-spike and CG-spike classes appear to be equally preferred between species, nevertheless, their promoters tend to be in opposition in each species (Figure 2B). This observation suggests a possible conservation of their antagonist role between these species, yet a different preference for certain functions. These patterns are common in Oryza sativa and Homo sapiens.
TATA-less and TATA-containing correlations
Tissue-specificity in humans
CG-based promoters have the highest percentage of occurrence (37.59%) and appear to be TATA-less class correspondents which tend to be associated with “housekeeping” genes. CG-based promoters are not only the most common but as expected they show the highest levels in all tissues. The first six tissues in which CG-based promoters have the highest percentages are cervix, skin, stomach, ovary, mammary gland and tongue (Additional file 4: Figure S10B online).
AT-based promoters (5.25%) are present in all tissues but are absent from the mammary gland. The first six tissues in which AT-based promoters have the highest percentages are liver, heart, kidney, lymph node, soft tissue and muscle. This order coincides with the first six tissues in which ATCG-compact promoters have the highest percentages, namely in prostate, liver, kidney, muscle, heart and lymph node. Equally curious, the last six tissues in which CG-based promoters have the lowest percentages are liver, uterus, kidney, heart, lung and brain (Additional file 4: Figure S10G and Figure S 7B online). This implies a special relationship between CG-based and AT-based promoters because their proportions seem to indicate an almost antagonistic activity which may suggests an involvement of these promoters in some metabolic processes. Nevertheless, the relationship between CG-based promoters and other classes of promoters in these tissues seems to conceal more than a simplistic association with the housekeeping genes.
AT-less promoters (14.36%) are overestimated in uterus while CG-less and ATCG-balanced promoters are overestimated in testis (Additional file 4: Figure S10E,F,H online).
CG-less promoters have an occurrence of 3.98% and are present in all tissues but they are absent from Spleen (Additional file 4: Figure S10F online).
There was no clear correlation regarding tissue order between AT-less and CG-less promoters. Nevertheless, we noticed that some tissues have a tendency to stay grouped, such as muscle and heart, stomach and soft tissue, larynx and colon, lymph node and liver or bone marrow and peripheral nervous system (Additional file 4: Figure S10E,F online). These groups may suggest a role of these promoters in simple feedback mechanisms among tissues responsible for maintaining homeostasis. Furthermore, the occurrence of short poly(dA:dT) tracts on short distances near TSS could also indicate an involvement of AT-less (and, by association, a complementary role for their CG-less counterpart) promoters in short term non-critical gene expression, which may strengthen our hypothesis regarding their physiological role. Moreover, in different tissues AT-less and CG-less percentages show a combined relationship of complementarity and proportionality (Figure 8C).
AT-spike promoters are found especially in tissues that require high levels of gene expression such as lung, eye, pancreas, uterus, liver, soft tissue, brain, kidney, prostate and blood. This tissue order and the presence of long poly(dA) or long poly(dT) tracts suggests an involvement of these promoters in survival mechanisms, possibly responsible for interactions with the environment.
CG-spike promoters also appear to be involved in survival mechanisms. These promoters are found in large numbers especially in tissues that need a short-term critical gene expression. This is supported by the order of the first seven tissues in which these promoters are most common, such as lung, eye, brain, peripheral nervous system, spleen, heart and blood, which also tend to have a high interaction with the environment (Additional file 4).
The proportions of CG-spike and AT-spike promoters seem to be similar in the first two tissues, namely in lung and eye. The occurrence of long poly(dA:dT) or tandem short poly(dA:dT) tracts on short distances (>30b) near TSS, could also indicate an involvement of AT-spike and CG-spike promoters in short term critical gene expression.
The percentage of occurrences between CG-based and AT-spike promoters appears to be relative and nearly complementary in all tissues (Figure 8A). Interestingly, the last two tissues in which AT-spike promoters have the lowest percentages and the first two tissues in which CG-based promoters have the highest percentages are cervix and skin (Additional file 4: Figure S10C,B).
The proportion of ATCG-compact and AT-less promoters seems to have similar values in tissues from kidney and lymph node whereas ATCG-compact and AT-based promoters appear to have similar values in bladder, skin and uterus (Figure 8B). ATCG-compact promoters tend to exhibit equal values in some tissues such as liver and kidney, brain and bone, heart and muscle. Interestingly, AT-based promoters show also equal values in these tissues but different than those found for ATCG-compact promoters (Additional file 4).
There was no clear correlation regarding the tissue order between ATCG-balanced and ATCG-compact promoters. However, ATCG-balanced and ATCG-compact promoters seem to have almost equal percentages in about 16 tissues. Both of these classes have the closest values in blood, bone, brain, cervix, colon, heart, muscle, skin and uterus (Additional file 4).
ATCG-less promoters are rare (0.03%) and are even more enigmatic since they are mainly represented in cervix and tongue (Additional file 4: Figure S10I online). In humans, from a total of 8,512 promoter sequences the percentage of ATCG-less promoters it is close to 1.08% whereas their appearances among 2,369 promoters of tissue-specific genes it is almost 0.03%. These results are not consistent with ATCG-less expected frequency of 0.3%, which may suggest that most of their genes are silent (Additional file 4).
ATCG-middle promoters are present only in nine of the thirty tissues, namely in soft tissue, eye, pancreas, liver, placenta, bladder, muscle, larynx and bone marrow (Additional file 4: Figure S10J online). However, in humans, from a total of 8,512 promoter sequences the percentage of ATCG-middle promoters it is close to 1.05%. Nevertheless, from 2,369 promoters of tissue-specific genes the observed frequency is close to 0.22% whereas their expected frequency is 0.29%, which suggests that some of their genes are also silent. The difference between expected and observed frequencies and an overall low occurrence of genes containing ATCG-middle and ATCG-less promoters may suggest their involvement in anatomical development and in some other cell-related cycles. This observation is supported by several tests performed on promoters from HOX gene family, namely HOXA and HOXB. These genes are represented mostly by patterns showing ATCG-middle characteristics. (Additional file 5: Figure S13A-E and Figure S14A-E online). A more broad analysis involving expected and observed frequencies for all classes of promoters is presented in our Additional file 6.
A comparative analysis was undertaken for 20,586 promoters from the Arabidopsis thaliana, Drosophila melanogaster, Homo sapiens and Oryza sativa (Additional file2), and an analysis based on tissue-specific gene expression profiles in humans (Additional file4). Following the analysis, 10 general classes of promoters have emerged. We used promoter sequences from two databases - the Eukaryotic Promoter Database and PlantProm Database. We showed that existing methods used in cryptography, such as Kappa Index of Coincidence, can be adapted for many types of analysis in molecular genetics, perhaps to highlight certain new features of DNA sequences. Our supplemental data files allow re-analysis of our data. We also provide an animation that displays several hundred promoter patterns in succession and ordered according to their class (Additional file9). We consider a possible subdivision of these promoter patterns in subclasses, between 2 up to 4 subclasses for each major class. Furthermore, our observations suggest the existence of a network between these promoter classes. In the near future we wish to merge the information related to these classes of promoters with other available data in gene regulatory networks, in order to form a better understanding of the relationship between some genetic factors and their pathological implications.
The Eukaryotic Promoter Database and PlantProm Database provide a collection of eukaryotic promoters for which the transcription start site (TSS) has been determined experimentally (Additional file1). We downloaded and tested 20,586 gene promoters from The Eukaryotic Promoter Database (6,649 gene promoters - Oryza sativa, 1,922 gene promoters - Drosophila Melanogaster and 8,512 gene promoters - Homo sapiens) and PlantProm Database (3,503 gene promoters - Arabidopsis thaliana). We were mainly interested in the regions flanking the putative TSS. From Eukaryotic Promoter Database we extracted promoter segments ranging from -499b to 100b, relative to the TSS. From PlantProm DB we used promoter segments ranging from 200 bp upstream and 51 bp downstream of the TSS.
We used a publicly available list of 6,534 tissue-specific gene names (under Tissue-Specific Genes based on Expressed Sequence Tags (ESTs)) from the TiGER database (gene names were sorted and redundancy was removed - Additional file10) and we searched for their promoters in the Eukaryotic Promoter Database in which we found 2,369 promoters. We generated 2,369 promoter patterns and we sorted them in order to highlight their proportion in each tissue (Additional file11).
We used Visual Basic to develop a software program for promoter analysis - called PromKappa (Promoter analysis by Kappa), and a software program for sorting promoter patterns - called PromNN (Promoter analysis by Neural Network). The source code implementation of these programs are attached to our Additional file3. Promoter patterns were generated by PromKappa program. We used sliding window approach to extract two types of values: Kappa IC and (C + G)%. A sliding window with a step of 1 and a window size of 30 nt, allowed us to detail the structure of known promoters. Kappa Index of Coincidence values were plotted on a graph against (C + G)% values, which form a recognizable pattern composed from clusters of various sizes on the Y-axis (Figure1A-J). The X-coordinate of each point was represented by a (C + G)% value and the Y-coordinate was represented by a corresponding Kappa IC value. As can be expected, by using a large window size we obtained smooth promoter patterns, whereas a small window size generated sharp and distinguishable characteristics of promoters which have been easily categorized.
We conducted three types of analysis. Initially, for each promoter sequence we generated a graph, representing a promoter pattern. In total, we generated 20,586 graphs (Additional file12). These graphs were saved in BMP (Bitmap Image File) format and were sorted by their shape and density using a neural network. In the second analysis, the center of each pattern was plotted on a graph designed to show the distribution of promoters for each species. We used a color scheme to highlight the denser surfaces. Red areas represent clusters of similar promoters while blue areas represent unique or rare promoters (Figure3A-D). For the third analysis, we measured the specificity of each promoter class among thirty tissues by using 2,369 promoters (Figure7A,B).
Pattern recognition and sorting
We have been able to demarcate promoter sequences into ten classes by using the maximum number (≥100) of appearances of similar promoter patterns. To determine the biological characteristics of promoter sequences, we have resorted to machine learning methods. All patterns were analyzed and sorted by PromNN, a pattern recognizer program using 93,264 artificial neurons and a single layer perceptron. It has the ability to learn patterns and classify them into specified classes. We used supervised learning to train the neural network by using 200 input patterns (20 of each class of promoters, 5 from each species - Additional file13). PromNN recognized ten promoter classes and provided information about the match score and match percentage for each promoter pattern.
Cytosine and guanine content
Where CG SW represents the percentage of cytosine and guanine from the sliding window. In this stage, CG SW value is relative to CG TOT . The expression (A + T + C + G) TOT represents the sum of occurrences of A, T, C and G from the sliding window sequence. (C + G) SW represents the sum of C and G occurrences in the sliding window sequence. Nevertheless, in our implementation we also included the option to extract CG SW values without considering CG TOT .
Kappa Index of Coincidence
With small changes, the same method for measuring the Index of Coincidence has been applied for only one sequence, in which the sequence was actually compared with itself, as shown below in the algorithm implementation.
T = 0
N = length(A) - 1
for u = 1 to N
B = A[u + 1] … A[N]
for i = 1 to length(B)
If A[i] = B[i] then C = C + 1
T = T + (C / length(B) × 100)
C = 0
IC = Round((T / N), 2)
Where N is the length of the sliding window, A represents the sliding window content, B contains all variants of sequences generated from A (from u + 1 to N), C counts the number of coincidences occurring between sequence B and sequence A, and T variable counts the total number of coincidences found between sequences of B and the sequence A.
This work was supported by a grant of the Romanian National Authority for Scientific Research, CNCS-UEFISCDI, project number PN-II-ID-PCE-2011-3-0429.
- Levine M, Tjian R: Transcription regulation and animal diversity. Nature. 2003, 424: 147-151. 10.1038/nature01763.View ArticlePubMedGoogle Scholar
- Smale ST, Kadonaga JT: The RNA polymerase II core promoter. Annu Rev Biochem. 2003, 72: 449-479. 10.1146/annurev.biochem.72.121801.161520.View ArticlePubMedGoogle Scholar
- Hahn S: Structure and mechanism of the RNA polymerase II transcription machinery. Nat Struct Mol Biol. 2004, 11: 394-403. 10.1038/nsmb763.PubMed CentralView ArticlePubMedGoogle Scholar
- Bucher P: Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J Mol Biol. 1990, 212: 563-578. 10.1016/0022-2836(90)90223-9.View ArticlePubMedGoogle Scholar
- Mantovani R: The molecular biology of the CCAAT-binding factor NF-Y. Gene. 1999, 239: 15-27. 10.1016/S0378-1119(99)00368-6.View ArticlePubMedGoogle Scholar
- Fujimori S, Washio T, Tomita M: GC-compositional strand bias around transcription start sites in plants and fungi. BMC Genomics. 2005, 6: 26-10.1186/1471-2164-6-26.PubMed CentralView ArticlePubMedGoogle Scholar
- Tatarinova T, Brover V, Troukhan M, Alexandrov N: Skew in CG content near the transcription start site in Arabidopsis thaliana. Bioinformatics. 2003, 19 (Suppl. 1): 1313-1314.Google Scholar
- Molina C, Grotewold E: Genome wide analysis of Arabidopsis core promoters. BMC Genomics. 2005, 6: 25-10.1186/1471-2164-6-25.PubMed CentralView ArticlePubMedGoogle Scholar
- Juo ZS, Chiu TK, Leiberman PM, Baikalov I, Berk AJ, Dickerson RE: How proteins recognize the TATA box. J Mol Biol. 1996, 261: 239-254. 10.1006/jmbi.1996.0456.View ArticlePubMedGoogle Scholar
- Kiran K, Ansari SA, Srivastava R, Lodhi N, Chaturvedi CP, Sawant SV, Tuli R: The TATA-box sequence in the basal promoter contributes to determining light-dependent gene expression in plants. Plant Physiol. 2006, 142: 364-376. 10.1104/pp.106.084319.PubMed CentralView ArticlePubMedGoogle Scholar
- Yamamoto YY, Ichida H, Matsui M, Obokata J, Sakurai T, Satou M, Seki M, Shinozaki K, Abe T: Identification of plant promoter constituents by analysis of local distribution of short sequences. BMC Genomics. 2007, 8: 67-10.1186/1471-2164-8-67.PubMed CentralView ArticlePubMedGoogle Scholar
- Ioshikhes IP, Zhang MQ: Large-scale human promoter mapping using CpG islands. Nat Genet. 2000, 26: 61-63. 10.1038/79189.View ArticlePubMedGoogle Scholar
- Ludwig MZ: Functional evolution of noncoding DNA. Curr Opin Genet Dev. 2002, 12: 634-639. 10.1016/S0959-437X(02)00355-6.View ArticlePubMedGoogle Scholar
- Yamamoto YY, Yoshioka Y, Hyakumachi M, Obokata J, Yoshiharu Y: Characteristics of core promoter types with respect to gene structure and expression in Arabidopsis thaliana. DNA Res. 2011, 18: 333-342. 10.1093/dnares/dsr020.PubMed CentralView ArticlePubMedGoogle Scholar
- Fukue Y, Sumida N, Nishikawa J, Ohyama T: Core promoter elements of eukaryotic genes have a highly distinctive mechanical property. Nucleic Acids Res. 2004, 32: 5834-5840. 10.1093/nar/gkh905.PubMed CentralView ArticlePubMedGoogle Scholar
- Florquin K, Saeys Y, Degroeve S, Rouzé P, Van de Peer Y: Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Res. 2005, 33: 4255-4264. 10.1093/nar/gki737.PubMed CentralView ArticlePubMedGoogle Scholar
- Kanhere A, Bansal M: Structural properties of promoters: similarities and differences between prokaryotes and eukaryotes. Nucleic Acids Res. 2005, 33: 3165-3175. 10.1093/nar/gki627.PubMed CentralView ArticlePubMedGoogle Scholar
- Yamamoto YY, Ichida H, Abe T, Suzuki Y, Sugano S, Obokata J: Differentiation of core promoter architecture between plants and mammals revealed by LDSS analysis. Nucleic Acids Res. 2007, 35: 6219-6226. 10.1093/nar/gkm685.PubMed CentralView ArticlePubMedGoogle Scholar
- Dineen DG, Wilm A, Cunningham P, Higgins DG: High DNA melting temperature predicts transcription start site location in human and mouse. Nucleic Acids Res. 2009, 37: 7360-7367. 10.1093/nar/gkp821.PubMed CentralView ArticlePubMedGoogle Scholar
- Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, Semple CA, Taylor MS, Engström PG, Frith MC, Forrest AR, Alkema WB, Tan SL, Plessy C, Kodzius R, Ravasi T, Kasukawa T, Fukuda S, Kanamori-Katayama M, Kitazume Y, Kawaji H, Kai C, Nakamura M, Konno H, Nakano K, Mottagui-Tabar S, Arner P, Chesi A, Gustincich S, Persichetti F, et al: Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet. 2006, 38: 626-635. 10.1038/ng1789.View ArticlePubMedGoogle Scholar
- Friedman WF: Department of Ciphers. The index of coincidence and its applications in cryptology. 1922, Geneva: Riverbank LaboratoriesGoogle Scholar
- Mountjoy M: The Bar Statistics. 1963, USA: NSA Technical Journal VII (2,4)Google Scholar
- Friedman WF, Callimahos LD: Military Cryptanalytics. Part I, 2. 1985, USA: Reprinted by Aegean Park PressGoogle Scholar
- Kahn D:  The Codebreakers - TheStory of Secret Writing. 1996, Macmillan: New YorkGoogle Scholar
- Schmid CD, Perier R, Praz V, Bucher P: EPD in its twentieth year: towards complete promoter coverage of selected model organisms. Nucleic Acids Res. 2006, 34 (Database issue): D82-D85.PubMed CentralView ArticlePubMedGoogle Scholar
- Périer RC, Praz V, Junier T, Bonnard C, Bucher P: The eukaryotic promoter database (EPD). Nucleic Acids Res. 2000, 28 (1): 302-303. 10.1093/nar/28.1.302.PubMed CentralView ArticlePubMedGoogle Scholar
- Shahmuradov IA, Gammerman AJ, Hancock JM, Bramley PM, Solovyev VV: PlantProm: a database of plant promoter sequences. Nucleic Acids Res. 2003, 31: 114-117. 10.1093/nar/gkg041.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu X, Yu X, Zack DJ, Zhu H, Qian J: TiGER: a database for tissue-specific gene expression and regulation. BMC Bioinforma. 2008, 9: 271-10.1186/1471-2105-9-271.View ArticleGoogle Scholar
- Yu X, Lin J, Zack DJ, Qian J: Identification of tissue-specific cis-regulatory modules based on interactions between transcription factors. BMC Bioinforma. 2007, 8: 437-10.1186/1471-2105-8-437.View ArticleGoogle Scholar
- Yu X, Lin J, Zack DJ, Qian J: Computational analysis of tissue-specific combinatorial gene regulation: predicting interaction between transcription factors in human tissues. Nucleic Acids Res. 2006, 34: 4925-4936. 10.1093/nar/gkl595.PubMed CentralView ArticlePubMedGoogle Scholar
- Nelson HC, Finch JT, Luisi BF, Klug A: The structure of an oligo(dA).oligo(dT) tract and its biological implications. Nature. 1987, 330: 221-226. 10.1038/330221a0.View ArticlePubMedGoogle Scholar
- Zhou Y, Bizzaro JW, Marx KA: Homopolymer tract length dependent enrichments in functional regions of 27 eukaryotes and their novel dependence on the organism DNA (G + C)% composition. BMC Genomics. 2004, 5: 95-10.1186/1471-2164-5-95.PubMed CentralView ArticlePubMedGoogle Scholar
- Gershenzon NI, Ioshikhes IP: Synergy of human Pol II core promoter elements revealed by statistical sequence analysis. Bioinformatics. 2005, 21: 1295-1300. 10.1093/bioinformatics/bti172.View ArticlePubMedGoogle Scholar
- Suzuki Y, Tsunoda T, Sese J, Taira H, Mizushima-Sugano J, Hata H, Ota T, Isogai T, Tanaka T, Nakamura Y, Suyama A, Sakaki Y, Morishita S, Okubo K, Sugano S: Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome Res. 2001, 11: 677-684. 10.1101/gr.GR-1640R.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang C, Bolotin E, Jiang T, Sladek FM, Martinez E: Prevalence of the initiator over the TATA box in human and yeast genes and identification of DNA motifs enriched in human TATA-less core promoters. Gene. 2007, 389: 52-65. 10.1016/j.gene.2006.09.029.PubMed CentralView ArticlePubMedGoogle Scholar
- Bradley RC: Review Article The logic of chromatin architecture and remodelling at promoters. Nature. 2009, 461: 193-198. 10.1038/nature08450.View ArticleGoogle Scholar
- Ioshikhes IP, Albert I, Zanton SJ, Pugh BF: Nucleosome positions predicted through comparative genomics. Nat Genet. 2006, 38: 1210-1215. 10.1038/ng1878.View ArticlePubMedGoogle Scholar
- Albert I, Mavrich TN, Tomsho LP, Qi J, Zanton SJ, Schuster SC, Pugh BF: Translational and rotational settings of H2A.Z nucleosomes across the Saccharomyces cerevisiae genome. Nature. 2007, 446: 572-576. 10.1038/nature05632.View ArticlePubMedGoogle Scholar
- Tirosh I, Berman J, Barkai N: The pattern and evolution of yeast promoter bendability. Trends Genet. 2007, 23: 318-321. 10.1016/j.tig.2007.03.015.View ArticlePubMedGoogle Scholar
- Tirosh I, Barkai N: Two strategies for gene regulation by promoter nucleosomes. Genome Res. 2008, 18: 1084-1091. 10.1101/gr.076059.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Cai S, Han HJ, Kohwi-Shigematsu T: Tissue-specific nuclear architecture and gene expression regulated by SATB1. Nat Genet. 2003, 34: 42-51. 10.1038/ng1146.View ArticlePubMedGoogle Scholar
- Iyer V, Struhl K: Poly(dA:dT), a ubiquitous promoter element that stimulates transcription via its intrinsic DNA structure. EMBO J. 1995, 14: 2570-2579.PubMed CentralPubMedGoogle Scholar
- Suter B, Schnappauf G, Thoma F: Poly(dA:dT) sequences exist as rigid DNA structures in nucleosome-free yeast promoters in vivo. Nucleic Acids Res. 2000, 28: 4083-4089. 10.1093/nar/28.21.4083.PubMed CentralView ArticlePubMedGoogle Scholar
- Filetici P, Aranda C, Gonzàlez A, Ballario P: GCN5, a yeast transcriptional coactivator, induced chromatin reconfiguration of HIS3 promoter in vivo. Biochem Biophys Res. 1998, 242: 84-87. 10.1006/bbrc.1997.7918.View ArticleGoogle Scholar
- Koch KA, Thiele DJ: Functional analysis of a homopolymeric (dA-dT) element that provides nucleosome access to yeast and mammalian transcription factors. J Biol Chem. 1999, 274: 23752-23760. 10.1074/jbc.274.34.23752.View ArticlePubMedGoogle Scholar
- Fashena SJ, Reeves R, Ruddle NH: A poly(dA:dT) upstream activating sequence binds high-mobility group I protein and contributes to lymphotoxin (tumor necrosis factor-β) gene regulation. Mol Cell Biol. 1992, 12: 894-903.PubMed CentralView ArticlePubMedGoogle Scholar
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Federhen S, Feolo M, Fingerman IM, Geer LY, Helmberg W, Kapustin Y, Krasnov S, Landsman D, Lipman DJ, Lu Z, Madden TL, Madej T, Maglott DR, Marchler-Bauer A, Miller V, Karsch-Mizrachi I, Ostell J, Panchenko A, Phan L, Pruitt KD, Schuler GD, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2012, 40: D13-D25. 10.1093/nar/gkr1184.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.