Low nucleosome occupancy is encoded around functional human transcription factor binding sites
© Daenen et al; licensee BioMed Central Ltd. 2008
Received: 11 January 2008
Accepted: 15 July 2008
Published: 15 July 2008
Transcriptional regulation of genes in eukaryotes is achieved by the interactions of multiple transcription factors with arrays of transcription factor binding sites (TFBSs) on DNA and with each other. Identification of these TFBSs is an essential step in our understanding of gene regulatory networks, but computational prediction of TFBSs with either consensus or commonly used stochastic models such as Position-Specific Scoring Matrices (PSSMs) results in an unacceptably high number of hits consisting of a few true functional binding sites and numerous false non-functional binding sites. This is due to the inability of the models to incorporate higher order properties of sequences including sequences surrounding TFBSs and influencing the positioning of nucleosomes and/or the interactions that might occur between transcription factors.
Significant improvement can be expected through the development of a new framework for the modeling and prediction of TFBSs that considers explicitly these higher order sequence properties. It would be particularly interesting to include in the new modeling framework the information present in the nucleosome positioning sequences (NPSs) surrounding TFBSs, as it can be hypothesized that genomes use this information to encode the formation of stable nucleosomes over non-functional sites, while functional sites have a more open chromatin configuration.
In this report we evaluate the usefulness of the latter feature by comparing the nucleosome occupancy probabilities around experimentally verified human TFBSs with the nucleosome occupancy probabilities around false positive TFBSs and in random sequences.
We present evidence that nucleosome occupancy is remarkably lower around true functional human TFBSs as compared to non-functional human TFBSs, which supports the use of this feature to improve current TFBS prediction approaches in higher eukaryotes.
Genomic DNA in eukaryotic cells is highly compacted in the nucleus into several levels of chromatin structures that ultimately make up the chromosomes. This compaction is, at the lowest level, achieved by wrapping the double-stranded DNA (dsDNA) around a histone protein octamer into particles known as nucleosomes. These are molecular spools, resembling beads on a string when observed by electron microscopy. In a single nucleosome, a DNA sequence of exactly 147 bp is wrapped around the histone core complex. The DNA needs to be sharply bent so that it can be tightly wrapped around the core proteins. This sharp bending occurs at every helical repeat, which is roughly every 10 bp. Nucleosomes are separated from each other by linker DNA of up to 50 bp. Two recently published, independent studies, one by Segal et al.  and one by Ioshikhes et al. , present evidence that the primary DNA sequence can facilitate the bending of the helix around the histone octamer by presenting AA dinucleotides at those positions where the phosphodiester backbone of the helix faces towards the histone core. A sequence favouring nucleosome wrapping is therefore composed of AA dinucleotides spaced approximately 10 bp apart. TT dinucleotides are observed approximately 5 bp in either direction of the AA dinucleotides, as this is the position where the complementary helix faces the core complex. This particular arrangement and periodicity of dinucleotides along a stretch of DNA is referred to as a nucleosome positioning sequence or NPS.
The binding of nucleosomes is highly dependent on the individual DNA sequences. The affinity of different DNA sequences for nucleosomes may vary by 1000-fold or more . The genome thereby appears to encode, at least partly, its own packaging by positioning nucleosomes using these NPSs . Unfortunately, this 'genomic code' is subtle and diffuse and it is often difficult to distinguish it from random noise . This is the main reason why its discovery arrived so many years after the initial deciphering of the regular genetic information stored in DNA. The genomic code for nucleosome positioning could be even more diffuse than previously thought, as the structure of the histone core particle reveals that tetranucleotides or possibly even longer sequences are involved, rather than the dinucleotide patterns that were recently discovered . Despite the diffusely encoded rules, the influence of the DNA sequence has nevertheless been established. However, the picture is somewhat more complex than this. The DNA-nucleosome interaction is dynamic, and the nucleosome occupancy is considered to be in a state of thermodynamic equilibrium. While nucleosomes can bind to almost any DNA sequence, the binding affinity is highly dependent on the specific sequence. Hence, the probability that a certain sequence is occupied by a nucleosome depends partly on the DNA sequence itself, and partly on other factors. These factors include epigenetic profiles and energy-dependent remodeling complexes [5, 6], steric hindrance and competition with other DNA binding molecules, such as transcription factors , all of which play roles in shaping the actual nucleosome occupancy patterns found along a DNA sequence in vivo.
In this article, we present evidence that nucleosomes could play an important role in the recognition of transcription factor binding sites (TFBS), guiding transcription factors to their target sites on the DNA. The nucleosomes appear to be positioned in vivo in such a way that functional binding sites are exposed through a relaxed, open chromatin structure, readily accessible to any transcription factor. On the other hand, the overwhelmingly large number of cryptic (i.e. false positive) binding sites that occur throughout the eukaryotic genome tend to be masked and inaccessibly wrapped in nucleosome particles, which are themselves stacked into a higher-order, dense chromatin configuration. This intriguing hypothesis has been proposed before by others , but to the best of our knowledge it has never been fully investigated in higher eukaryotes. Segal et al. presented evidence supporting the hypothesis in yeast . In the present paper, we present further evidence based on human data, confirming the hypothesis in higher eukaryotes.
To discriminate between true transcription factor binding sites (TFBSs) and false positives, the predicted nucleosome occupancy patterns of the respective surrounding genomic regions were examined.
Construction of the data sets
Nucleosome positions were predicted for four sets of genomic sequences. The first set includes only sequences that contain a TFBS the existence of which has been experimentally verified, not merely a motif presumed or predicted to be one. This set will be referred to as the 'true positives' set. The second set includes only reference or control sequences, which do not contain, or at least are not very likely to contain, a true TFBS, but that do contain a motif matching that of a true TFBS instead. Hence this set will be referred to as the 'false positives' set. A set of experimentally verified TFBSs was obtained from the Open Regulatory Annotation Database, which is also known as ORegAnno . A simple query for human TFBSs whose in vivo functionality is supported by experimental evidence resulted in 232 usable true positive TFBSs. Using the genome mappings provided by OregAnno, each TFBS was extracted along with its flanking sequence from the finished human genome assembly (hg18 – March 2006) , thus constructing the true positives data set. For this task an extractor script was written in Perl. For each TFBS, 3 kb of flanking sequence was included in both up- and downstream directions. Consequently, all the TFBSs in the data set are aligned: the starting position of the TFBSs is positioned precisely at the 3000th nucleotide in every sequence. The second set, containing false positives, was constructed in a comparable fashion. Using 'findMotif', a tool using an exact string matching algorithm that was written by Jim Kent and included in the UCSC Genome Browser source code , we located exact matches of TFBS motifs from the true positives dataset in the human promoters database, 'upstream1000.zip', which contains the coordinates and sequences of the regions 1000 bp upstream to the TSS of human genes and can be downloaded from . The overwhelming majority of these matches are false positive TFBS hits. This assumption is based on the experience that algorithmic TFBS prediction results in an unacceptable high number of hits consisting of a few true functional binding sites and numerous false non-functional binding sites . As demonstrated by Tronche et al.  the high number of false predictions is not simply a result of inadequate model frameworks: the predicted sites are effectively bound by transcription factors in vitro. In fact the methods do detect potential binding sites, albeit not necessarily functional ones. In a real cell due to the context and the combinatorial nature of regulation only 0.1% of the predicted putative TFBSs will be functional . From the returned matches, five were randomly sampled (unless only five matches or less were returned). Only motifs 5–16 bp long were searched for, as this is the limit imposed by the findMotif program. Next, the absolute chromosomal coordinates (genome mappings) were calculated for each hit. Again, as with the true positives set that was described above, the new set of false positive sequences, containing 3 kb of flanking sequence on each side, was extracted from the hg18 (March 2006) human genome assembly using the genome mappings that were obtained from the wrapper. A total of 694 false positive reference sequences were generated this way. Potential overlap of the false positives data set with the true positives data set was avoided by filtering the false positives data set against the true positives data set and the chip-chip fragment database contained in the Transfac Professional Rel. 11.4 database , reducing the number of false positive reference sequences from 694 to 575. Finally, two data sets of 232 random DNA sequences with a length of 6 kb were generated using the random-seq application included in the RSA-tools from van Helden . As both our data sets of experimentally verified TFBSs and false positives are derived from non-coding upstream sequences, we selected this available option in the random-seq tool. One data set contained sequences generated with a 0th order Markov chain model calibrated to obtain sequences with nucleotide frequencies similar to those observed in human non-coding upstream sequences. The other data set contained sequences generated with a 1st order Markov model calibrated to obtain sequences with similar dinucleotide frequencies as human non-coding upstream sequences. This is of importance as dinucleotides are known to be important for nucleosome positioning.
The nucleosome arrangement along all of these sequences was predicted using the thermodynamic model published by Segal et al. , who also provided a 32-bit Linux binary executable, wrapped by Perl scripts and released under the GNU LGPL, especially for this purpose . A regular Linux box running Fedora Core 6 (×86 architecture) was used to carry out the nucleosome prediction calculations. Automation of the entire process was achieved using shell scripts and customized Perl wrapper scripts to process input and output of the binary executable. The binary executable itself was used to calculate the nucleosome occupancy using a nucleosome-DNA interaction model that was developed by Segal et al.  using yeast sequence data. For each base pair position, Pstart and Poccupied probabilities are returned by the executable. Pstart is the probability that a particular position is a nucleosome start site. The Poccupied is the probability that it is occupied by a nucleosome at any point in time. According to Segal et al. 'stably positioned nucleosomes' are defined by Pstart > 0.2 . Only the stably positioned nucleosomes were extracted from the executable's output, the probabilities at other positions in the sequence were reduced to zero. This is of particular importance, because once several different sequences are examined to find average probabilities at each sequence position, then the less stably positioned nucleosomes will start to constitute noise.
Evaluation of the nucleosome occupancy
To examine the global nucleosome occupancy distribution in each data set, the nucleosome occupancy probabilities at every sequence position were summed across all sequences in the data set, resulting in the cumulative occupancy probability for every sequence position. Next, the cumulative probability was normalized against the true positives data set. This normalization is necessary to account for the different numbers of sequences in the data sets. Finally, the resulting cumulative probabilities were plotted for each data set using gnuplot . The region surrounding the TFBS could then be examined for a global nucleosome occupancy distribution. The above procedure was repeated iteratively, each time using a different Pstart threshold value for extracting the stable nucleosomes. Finally, for the Pstart threshold value that yielded the biggest difference between the two data sets, a Welch two sample t-test was carried out in R  to assess the statistical significance of the observed differences between the true positives data set and the false positives data set. The Welch two sample t-test was used because the populations of Pstart threshold values of the true positives and false positives had unequal variances.
Nucleosome occupancies around true and false positive transcription factor binding sites
Determination of the optimal Pstart threshold value
In addition, the most optimal Pstart threshold value was determined. The difference between the average Poccupied of the false positives set and true positives set is plotted in blue in Figure 3. The maximal difference (0.2) is observed for α = 0.01. For each data point in the graph, a Welch two sample t-test was performed on the data sets to assess the statistical significance of their difference in mean Poccupied values. The obtained p-values are in red. The difference in mean between the true and false positive data sets is most significant for α = 0.01, with a p-value as low as 8.46e-9.
Effect of motif lengths and sampling region on nucleosome occupancy probability
There is little difference between the average nucleosome occupancy probabilities of the two reference sets. Each set was sampled from a region spanning a different length upstream of the TSS. The reference set sampled up to 1 kb upstream of the TSS and has a lower standard deviation for most data points, indicating that in a smaller region there is less choice in motif matches to randomly sample from. This is supported by the difference in sample sizes for the longer motif lengths. However, the larger standard deviation of the reference set that was sampled up to 5 kb upstream of the TSS can also be ascribed to the possibility that this reference set exceeds the boundaries of the promoter region, and as such it could also contain regions with a different bias in nucleosome occupancy. Either way, although the average occupancies between the two data sets do not differ greatly, sampling from a region up to 1 kb upstream of the TSS is preferential because it yields a more 'stable' false positives data set.
The computational prediction of TFBSs from a nucleotide sequence alone results in an unacceptable high number of false positives in the predicted sites. This has been termed the futility theorem by Wasserman et al.  as nearly every predicted TFBS has no function in vivo. To improve the accuracy of the predictions, new approaches take into account information in and beyond the TFBSs, such as the preferences of binding site locations and the co-occurrences of other motifs in promoter regions [20, 21].
In the present study, we evaluate whether nucleosome occupancy around functional human TFBSs is a powerful enough feature to include in TFBS prediction algorithms as otherwise its inclusion would result in the addition of extra noise to the binding site detection system without the yield of more prediction accuracy. For that purpose, the nucleosome occupancy (density) around true TFBSs was compared to that around cryptic or false positive TFBSs using large human data sets and the significance of the difference between the two data sets was then statistically evaluated.
Comparison of true positives data sets with false positives data sets should be a reliable, genuine and fair. Therefore it is of particular importance to construct a correct reference (false positives) data set. Thus, when constructing a reference data set by collecting reference sequences, care has to be taken not to introduce a bias in the nucleosome occupancy. The human genome contains many regions that could easily result in such a bias: for instance, intergenic, intronic, exonic, and promoter regions probably have different nucleosome occupancy distributions due to differences in frequency and in organization of NPSs in each of these regions. Ultimately, it would be impossible to tell whether a significant difference from the reference data set with the true positives set was due to the true TFBS itself, to the bias of the region it is contained in, or to a bias in the reference set. To eliminate this uncertainty, we ensured from the beginning that the reference set and true positives set showed the same bias. Since most TFBSs are located in gene promoters, only promoter regions were used in the reference set. These promoter regions are located immediately upstream of the transcription start sites (TSS) of genes. How far upstream of the TSS they actually start is vague and differs from gene to gene. A reference set containing promoter regions could, at best, contain the first × kb upstream of the TSSs of annotated genes, where × is a safe number. Traditionally, a region up to 1.5 kb upstream of the TSS is scanned for putative TFBSs, with most TFBSs located within the first 0.5 kb. For that reason, a region spanning the first 1 kb upstream of the TSS was used to construct a false positives data set. Unfortunately, there are no databases for false TFBSs. To construct a false positives data set, random hits of each TFBS motif from the true positives data set were located in promoter regions, and using the returned genome coordinates, the flanking sequences were excised from the human genome assembly. This approach relies on the fact that by chance alone a short sequence motif, such as that of a TFBS, comprising only a few base pairs in length, occurs in a random genomic region at a very high frequency. This is caused by the limitation of the nucleic acids alphabet to only four bases. Typically, most of these hits are not true TFBSs but cryptic ones, i.e. the motifs occurs by chance. Randomly sequences generated in silico were also used as a reference. A randomly generated sequence, i.e. a non genomic sequence, is not likely to contain the recurring dinucleotide patterns, every 10 bp or so, that are required for a tight nucleosome-DNA interaction. It is therefore useful to establish the basal nucleosome occupancy in the biological system, or at least the basal level predicted by the thermodynamic model. In a genomic sequence, a lower nucleosome occupancy level will not be predicted.
Our results (Figure 1) suggest that the genomic sequence immediately surrounding a true TFBS does not contain any 'instructions' in the form of patterns for binding nucleosomes – just like in silico generated noise does not and cannot contain these either. Note, however, that this does not mean that these regions should be free of nucleosomes in vivo. The nucleosome-DNA interaction is in thermodynamic equilibrium. Several factors affect the binding of the nucleosomes: steric effects, competing molecules (such as transcription factors), nucleosome remodeling complexes, and genomic instructions. The genomic instructions merely help to push the equilibrium in the direction of higher nucleosome occupancy. In the absence of these genomic instructions, nucleosomes will still be able to bind to the DNA, but with reduced affinity, and hence with both a reduced likelihood to occupy the particular region and an increased susceptibility towards shifting positions. Considering the results depicted in Figures 1, the genomic instructions seem to be remarkably absent around true TFBSs, making them more accessible to transcription factors.
As discussed above, the region where false positive hits are sampled is of particular importance. The observed drop in nucleosome occupancy where a true TFBS is positioned could be, for instance, due to generally reduced nucleosome occupancy specific to the region proximal to the TSS. The first 1 kb upstream of the TSS most often contains several cis-regulatory elements (CREs) and would thus benefit from a generally open chromatin structure. As a result, the observed drop in nucleosome occupancy would not be specific for the particular (functional) TFBS anymore, but for the whole region. If false positive hits were to be sampled from a very large promoter region (e.g. up to 5 kb upstream of TSS), the majority of the hits would not originate from the first 1 kb, where most true TFBSs are positioned, but from regions more upstream, where the chromatin structure is likely to be more closed. Thus, false positive hits must be sampled from the region in which the functional TFBSs are located. If true TFBSs were enjoying an open chromatin structure specific merely to the region they are located in, then the false positive hits would also exhibit lower nucleosome occupancy. This was not the case, as demonstrated in Figure 4, indicating that the lower nucleosome occupancy was specific for the true TFBS and not for the region it was contained in.
We present evidence in higher eukaryotes supporting the hypothesis that transcription factors are guided to their target sites on the DNA by the nucleosome configuration. The positioning of these nucleosomes is intrinsically encoded by the genome itself, through recurring dinucleotide patterns that interact with the histone core of a nucleosome particle. Investigation of these nucleosome positioning sequences (NPSs) revealed that they were remarkably absent around functional TFBSs compared to the reference sequences (false-positive TFBSs). The absence of these patterns around true TFBSs does not necessarily mean the absence of nucleosomes. A reduced probability that the particular sites will be occupied by nucleosomes can be expected, because the absence of these positioning patterns will weaken the nucleosome-DNA interaction, but it will not prevent it. As a result, functional TFBSs are less likely to be blocked by dense chromatin configurations. Put differently, they are more likely to be exposed and thus accessible to their transcription factors. This is of particular importance, as it allows the in silico discrimination between functional TFBSs and the far more numerous false positive TFBSs that are found in the human genome. It could thereby substantially improve current TFBS prediction techniques, which are based on sequence conservation and tend to yield too many false-positive hits.
This research was supported by the Fund for Scientific Research (FWO)-Flanders, Interuniversity Attraction Poles Programme of the Belgian Science Policy, Belgian Federation against Cancer, and FP6 of the European Union (BRECOSM). We acknowledge Dr. Amin Bredan for critical reading of the manuscript.
- Segal E, Fondufe-Mittendorf Y, Chen L, Thastrom A, Field Y, Moore IK, Wang JP, Widom J: A genomic code for nucleosome positioning. Nature. 2006, 442 (7104): 772-778. 10.1038/nature04979.PubMedPubMed CentralView ArticleGoogle Scholar
- Ioshikhes IP, Albert I, Zanton SJ, Pugh BF: Nucleosome positions predicted through comparative genomics. Nat Genet. 2006, 38 (10): 1210-1215. 10.1038/ng1878.PubMedView ArticleGoogle Scholar
- Ercan S, Lieb JD: New evidence that DNA encodes its packaging. Nat Genet. 2006, 38 (10): 1104-1105. 10.1038/ng1006-1104.PubMedView ArticleGoogle Scholar
- Richmond TJ, Davey CA: The structure of DNA in the nucleosome core. Nature. 2003, 423 (6936): 145-150. 10.1038/nature01595.PubMedView ArticleGoogle Scholar
- Becker PB: Gene regulation: a finger on the mark. Nature. 2006, 442 (7098): 31-32. 10.1038/442031a.PubMedView ArticleGoogle Scholar
- Qiu J: Epigenetics: unfinished symphony. Nature. 2006, 441 (7090): 143-145. 10.1038/441143a.PubMedView ArticleGoogle Scholar
- Richmond TJ: Genomics: predictable packaging. Nature. 2006, 442 (7104): 750-752. 10.1038/442750a.PubMedView ArticleGoogle Scholar
- Montgomery SB, Griffith OL, Sleumer MC, Bergman CM, Bilenky M, Pleasance ED, Prychyna Y, Zhang X, Jones SJ: ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation. Bioinformatics (Oxford, England). 2006, 22 (5): 637-640. 10.1093/bioinformatics/btk027.View ArticleGoogle Scholar
- The UCSC Genome Annotation Database. [http://genome.ucsc.edu]
- UCSC Genome Browser source. [http://hgdownload.cse.ucsc.edu/admin/jksrc.zip]
- UCSC Genome Browser FTP. [http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/upstream1000.zip]
- Fickett JW: Quantitative discrimination of MEF2 sites. Mol Cell Biol. 1996, 16 (1): 437-441.PubMedPubMed CentralView ArticleGoogle Scholar
- Tronche F, Ringeisen F, Blumenfeld M, Yaniv M, Pontoglio M: Analysis of the distribution of binding sites for a tissue-specific transcription factor in the vertebrate genome. Journal of molecular biology. 1997, 266 (2): 231-245. 10.1006/jmbi.1996.0760.PubMedView ArticleGoogle Scholar
- Wasserman WW, Sandelin A: Applied bioinformatics for the identification of regulatory elements. Nature reviews. 2004, 5 (4): 276-287. 10.1038/nrg1315.PubMedView ArticleGoogle Scholar
- Wingender E, Dietze P, Karas H, Knuppel R: TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic acids research. 1996, 24 (1): 238-241. 10.1093/nar/24.1.238.PubMedPubMed CentralView ArticleGoogle Scholar
- van Helden J: Regulatory sequence analysis tools. Nucleic acids research. 2003, 31 (13): 3593-3596. 10.1093/nar/gkg567.PubMedPubMed CentralView ArticleGoogle Scholar
- The Segal Lab. [http://genie.weizmann.ac.il/]
- Gnuplot. [http://www.gnuplot.info/]
- The R Project for Statistical Computing. [http://www.r-project.org/]
- Holloway DT, Kon M, DeLisi C: Integrating genomic data to predict transcription factor binding. Genome informatics. 2005, 16 (1): 83-94.PubMedGoogle Scholar
- Jiang B, Zhang MQ, Zhang X: OSCAR: one-class SVM for accurate recognition of cis-elements. Bioinformatics (Oxford, England). 2007, 23 (21): 2823-2828. 10.1093/bioinformatics/btm473.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.