Identifying Positioned Nucleosomes with Epigenetic Marks in Human from ChIP-Seq
© Zhang et al. 2008
Received: 09 July 2008
Accepted: 13 November 2008
Published: 13 November 2008
Skip to main content
© Zhang et al. 2008
Received: 09 July 2008
Accepted: 13 November 2008
Published: 13 November 2008
In vivo positioning and covalent modifications of nucleosomes play an important role in epigenetic regulation, but genome-wide studies of positioned nucleosomes and their modifications in human still remain limited.
This paper describes a novel computational framework to efficiently identify positioned nucleosomes and their histone modification profiles from nucleosome-resolution histone modification ChIP-Seq data. We applied the algorithm to histone methylation ChIP-Seq data in human CD4+ T cells and identified over 438,000 positioned nucleosomes, which appear predominantly at functionally important regions such as genes, promoters, DNase I hypersensitive regions, and transcription factor binding sites. Our analysis shows the identified nucleosomes play a key role in epigenetic gene regulation within those functionally important regions via their positioning and histone modifications.
Our method provides an effective framework for studying nucleosome positioning and epigenetic marks in mammalian genomes. The algorithm is open source and available at http://liulab.dfci.harvard.edu/NPS/.
Chromatin structure widely manifests itself in various aspects of mammalian development and disease. The key structural element of chromatin is the nucleosome, which consists of an octameric histone core wrapped by 146 bps of DNA . Nucleosomes play two major roles in epigenetic regulation of gene expression. The first is to limit DNA accessibility to cellular machinery [2–5] through specific positioning of nucleosome core particles, which can be remodeled in an ATP-dependent manner. The second is to regulate transcriptional activities through covalent modifications (e.g. methylation, acetylation and phosphorylation) of the tails of four core histone types H2A, H2B, H3 and H4 [6–9]. Therefore, characterizing the global locations and modification marks of positioned nucleosomes is a crucial step towards unraveling the mechanism of epigenetic regulation in eukaryotes.
High-throughput mapping of positioned nucleosomes has been conducted in yeast [10, 11] and selected human promoters  using high resolution tiling microarrays. Several studies have also profiled genome-scale histone modification marks using Chromatin ImmunoPrecipitation (ChIP) coupled with microarrays (ChIP-chip) [8, 13–15], although only one study in yeast  mapped the modifications at nucleosome resolution. Recent developments in high-throughput sequencing techniques offer a promising alternative to microarrays in analyzing genome-wide nucleosome positioning and histone modifications. Barski et al.  used Solexa to conduct nucleosome-resolution ChIP-Seq of twenty different histone methylation marks as well as H2A.Z in human CD4+ T cells, although the analysis in that study was performed at a much lower resolution than offered by the data. Recently, the same group also sequenced MNase-digested nucleosomes and performed ChIP-Seq of eighteen histone acetylation marks [17, 18]. As an extension of their histone methylation study, they attempted to reveal the effects of nucleosome positioning  and combinatorial patterns of histone modifications on gene activity  separately. However, genome-wide studies of positioned nucleosomes coupled with histone modifications in human still remain limited.
We thus developed a novel computational framework for analyzing histone modification ChIP-Seq data at nucleosome resolution in a global scale. By combining the sequenced tags from all histone methylation ChIP-Seq data  and employing signal processing techniques, we comprehensively identified modified positioned nucleosomes in functionally important regions and revealed their distribution throughout the human genome. Through subsequent statistical modeling, we were then able to assign histone methylation marks to each positioned nucleosome in CD4+ T cells, revealing the key roles of nucleosome positioning and modifications in epigenetic gene regulation.
The dataset from Barski et al. contains one ChIP-Seq profile for each histone modification studied in CD4+ human T cells, totaling 185.7 million 25 nt tags that are uniquely mapped to the human genome . Despite Solexa technology's unique ability to sequence millions of tags from each sample, most locations in the human genome still lack enough coverage from a single sample alone to allow an accurate identification of positioned nucleosomes. To overcome this difficulty, we combined all 185 million ChIP-Seq tags in our search for positioned nucleosomes. Each Solexa sequencing tag of 25 nt represents one end of a ChIP-DNA fragment, corresponding to approximately 150 nt long mono-nucleosomal DNA resulting from MNase digestion. Therefore, we extended each sequenced tag to 150 nt in the 3' direction to represent a whole mono-nucleosome. This method of combining ChIP-Seq tags from 21 different histone modifications and extending each tag increased the original sequence data from individual sample by approximately 100-fold (approximately 21- and 6-folds from tag combination and extension, respectively) and provided over 10-fold non-redundant genome coverage (since all Solexa tags are uniquely mapped to the genome), sufficient to identify positioned nucleosomes.
Most detected peaks had similar numbers of tags from plus and minus strands, since double stranded DNA fragments are equally likely to be sequenced from either direction. However, some peaks had significantly unbalanced tag counts (i.e., the number of tags from one strand was more than 4-fold higher than that from the other strand). Examples of this phenomenon are shown in Figure 1B, where hundreds of tags were aligned to very narrow regions A (27 nt) and B (33 nt). Extending the tags in the 3' direction to 150 nt and taking the middle 75 nt created two separate peaks flanking A (peaks 1 and 2) and B (peaks 3 and 4), thus creating unbalanced tag counts between plus and minus strands for those peaks. It is worth noting that both regions A ('ggtctagaatggaatggaaagaatgga') and B ('atacaatcgattggaatcgaatggaatggaagg') are near the centromere and both contain the repeated sequence 'tggaa', which is a type of common sequence repeats found near human centromeres . Although the sequences of both regions A and B are uniquely mapped to the current version of human genome assembly (NCBI Build 36.1 or UCSC Hg18), it is possible that they occur in multiple copies near or within the centromeres, which may not have been yet fully sequenced and annotated. To avoid any complications arising from repeat sequences, we thus regarded the detected peaks with unbalanced tag counts as unreliable and discarded them from further analysis. This procedure removed 5% of the peaks, and the remaining 438,652 peaks were retained as confident positioned nucleosomes in CD4+ T cells.
Based on the ChIP-Seq tags for individual histone methylations, we assigned histone methylation marks to each identified nucleosome (see Methods). 99.97% of the positioned nucleosomes had one or more kinds of histone modifications, with an average of 4.4 modifications per nucleosome. Table S1 in Additional file 1 listed the numbers of modified positioned nucleosomes for each histone modification. In Table S1 in Additional file 1, we also classified the 21 ChIP-Seq histone modifications into three types: active marks (related to active genes  or active enhancers ), repressive marks (related to inactive genes  or heterochromatin ), and moderate marks (no clear preference between active and inactive genes ). The genome locations of identified positioned nucleosomes and their histone modification profiles are available at http://liulab.dfci.harvard.edu/NPS/Result/.
Sequences in nucleosome free regions have been found to be more evolutionarily conserved than nucleosome-occupied regions in both yeast genome  and human promoters . We calculated the average phastCons conservation scores  for all detected positioned nucleosomes in CD4+ T cells and their nearby linker regions. Positioned nucleosomes in genes, promoters and intergenic sequences all have lower sequence conservation than their flanking linker DNA (Figure 2C). The linkers might be conserved because they harbor cis-acting elements to interact with trans-acting factors.
DNase I hypersensitive (HS) sites are regarded as containing a mixture of regulatory cis-elements. A recent study to map DNase I HS regions in CD4+ T cell discovered that only 2% of the genome is HS, with an average size of 634 nt . We observed that 30% of our identified positioned nucleosomes were located in the DNase I HS regions. In addition, modified positioned nucleosomes were detected in 45% of all DNase HS regions, and in 99% of high confidence HS regions (DNase I HS quality score > 3, Figure S3 in Additional file 1). These observations suggest an epigenetic mechanism of modified positioned nucleosomes in establishing and maintaining functionally important regulatory regions in the human genome.
DNase I HS sites are regarded as containing a mixture of regulatory cis-elements and are thus expected to exhibit nucleosome depletion similar to trans-acting factor binding sites. We examined the nucleosome profiles around DNase I HS sites  by aligning those regions by their peak summits, and we indeed found nucleosome depletion at the DNase I HS sites, flanked by two strong nucleosome positioning peaks (Figure 3C and 3D). Furthermore, while the nucleosome occupancy on both sides of DNase I HS sites is relatively lower than that of CTCF binding sites, the flanking nucleosomes showed similar histone modification profiles as in the case of CTCF binding regions.
There have been several studies using tiling arrays to investigate nucleosome positioning in yeast and human promoters [7, 10–12]. Similarly, a recent study  revealed the promoter nucleosome architecture by separating the + and - strands of H3K4me3 and H2A.Z ChIP-Seq tags . Here we examined the average number of positioned nucleosomes in the promoters of 4,517 expressed genes (Figure 3E) and 8,906 unexpressed genes (Figure 3F), obtaining much stronger nucleosome patterns than all previous studies in humans. For both expressed and unexpressed genes, the average number of positioned nucleosomes showed strong depletion approximately 50 nt upstream of TSS, although the depletion for unexpressed genes was weaker. The nucleosome signal gradually decreased as it oscillated away from TSS, but we could still clearly delineate five peaks downstream of TSS and at least two peaks upstream.
Detailed histone modification profiles on positioned nucleosomes near TSSs revealed interesting patterns, some of which could not be observed previously without the map of nucleosome-resolution modifications (Figure 3E). First of all, despite the fact that a decreasing number of genes contained a positioned nucleosome as the distance to TSS increased, the proportion of positioned nucleosomes carrying the H3K4me3 mark was constantly high. Three active marks (H3K4me1, H3K4me2, and H3K9me1) were rarely present at +1 nucleosome, but increased both upstream and downstream. Two other active marks H4K20me1 and H2BK5me1 were absent from upstream positioned nucleosomes, but showed increasing presence downstream from TSS. Since the function of H2BK5me1 mark was previously unknown and H4K20me1 was known as a transcription elongation mark in mammalian cells , we speculate that H2BK5me1 may also be a transcription elongation mark. In contrast, the active variant H2A.Z had high occupancy upstream and decreasing presence downstream. No H3K27me1, H3K36me3, repressive, or moderate marks appeared in positioned nucleosomes near TSS.
Surprisingly, these modification patterns were almost identical between expressed and unexpressed genes (Figure 3E-F). This observation suggests that a subset of genes with undetectable transcripts not only contain positioned nucleosomes near TSS, but also carry the identical histone modifications as expressed genes. Since the elongation mark H4K20me1 was present downstream of TSS and so was the proposed elongation mark H2BK5me1, despite being slightly weaker than in expressed genes, it was likely that these genes were actually transcribed but quickly degraded.
Several studies reported that certain histone modifications tend to occur in groups [7, 9, 14, 31]. We also examined the histone modification co-occurrence patterns at nucleosome resolution, and we observed that active and repressive marks occurred mutually exclusively in the same nucleosomes (Figure S4 in Additional File 1). In order to test the robustness of our observations, we used different p-value cutoffs for assigning histone modifications. Although the number of nucleosomes with histone modifications varies, the general trend of correlation between histone modification profiles is not affected (Figure S5 in Additional File 1). We also found that adjacent nucleosomes avoided sharp transitions between active and repressive histone modifications and, instead, tended to carry the same or similar modifications (Figure S6 in Additional File 1). Our observations are supported by two recent publications [18, 32]. Wang et al. combined the histone acetylation and methylation ChIP-Seq data to observe that certain histone modification marks tend to be grouped for activating gene expression . Furthermore, Yu et al. used a Bayesian network to infer the relation between histone methylation marks and gene expression .
With the availability of commercial high-resolution tiling microarrays and high-throughput sequencing technologies, tremendous progress has been made towards understanding epigenetic regulation of gene expression. Nonetheless, studies on positioned nucleosomes with histone modification profiles at nucleosome resolution are still limited in human. We developed a signal-processing approach to identify positioned nucleosomes from existing histone methylation ChIP-Seq data and determined the histone modification states of each identified positioned nucleosome in CD4+ human T cells.
Second, ChIP-Seq tags that were not uniquely mapped to the genome were discarded, so our approach will miss some regions with segmental duplication and repeat elements. Repeat sequences currently pose serious challenges to understanding the human genome and will require further novel advances in experimental and computational techniques. Third, the ChIP-Seq data  used in this study lacked a proper control sample, which ideally should be prepared from naked genomic DNA digested with MNase, size-selected and sequenced. Such a control sample could greatly benefit our analysis and improve the accuracy of our identified nucleosomes. For example, we could use it to eliminate false positives that might arise from unannotated repetitive sequences in the genome and Solexa sequencing bias (e.g. more tags in GC-rich regions). In addition, although we computationally calculated the false discovery rate (FDR) for positioned nucleosome identification from Poisson-based p-values, we could empirically estimate a more reasonable FDR of our approach that considers potential ChIP-Seq bias by comparing the numbers of peaks identified from ChIP and control.
We developed a novel computational framework and efficiently identified over 438,000 positioned nucleosomes coupled with histone modifications in functionally important regions of the human genome. These nucleosomes are not evenly distributed across the genome: more than 75% are near genes, and the remaining ones far away from TSSs are likely around distal cis-regulatory elements such as enhancers and insulators. Our analysis of nucleosome positions and modifications near TSS, CTCF binding sites, and DNase I HS sites strongly support the dual role of nucleosomes in epigenetic regulation. First, chromatin structure at cis-regulatory elements may be controlled by histone modifications to recruit chromatin-associated proteins and maintain an open state, as supported by our analytical result that active and repressive marks do not occur together in the same or adjacent nucleosomes. Second, nucleosomes may be remodeled, exposing DNA binding sites to a variety of cellular machineries for transcription.
The work flow is divided into two major components: nucleosome position detection and histone modification assignment to positioned nucleosomes (Figure 1).
Nucleosome-level ChIP-Seq tags from all histone modifications were combined to provide enough genome coverage. Each 25 nt tag was extended to 150 nt in the 3' direction, and the middle 75 nt was used for nucleosome position detection. We then modified an approach that was applied to nucleosomes on tiling microarrays  to analyze the pre-processed data from ChIP-Seq. First, genomic regions with sparse tags (defined as fewer than 3 tags per kilobase (kb)) were discarded. The remaining regions were subjected to wavelet denoising to reduce the background noise in ChIP-Seq data. In this study, we chose coiflet4 as wavelet and scaling functions and performed decomposition at level two. Coiflet4 is suitable because its morphological characteristics are similar to the nucleosome peak shape . Denoising at level two removed most high frequency noises, and further denoising appeared to deform the true peak shapes. We selected minimization of Stein's Unbiased Risk Estimate (SURE) and soft thresholding for estimating the threshold and filtering the signal, respectively [35, 36]. SURE combined with soft thresholding is known to be suitable for denoising data containing edges (discontinuities), and matches well with our peak detection algorithm which perceives the edges as boundaries of positioned nucleosomes.
The peak detection algorithm calls peaks using LoG method, a combination of Gaussian window smoothing and Laplacian edge detection. Gaussian smoothing with an envelope size of 30 nt (which enabled us to achieve the best performance in peak detection in this study) was first adopted to further reduce the noise, then Laplacian edge detection was applied to find the inflection points of peaks in the data. The peaks detected by LoG were filtered based on the following criteria. First, the peak width should be between 80 and 250 nt. Second, the p-value of each peak was estimated as the probability that the same or larger number of tags can accumulate within the peak region by chance using Poisson approximation, and peaks with p-value greater than 10-5 were discarded, which corresponds to 0.014% of FDR estimated by using a bootstrapping-based method  (Figure S7 in Additional file 1). Finally, for each peak, the number of tags from one strand should be less than four times that from the other strand.
After positioned nucleosomes were detected as described above, the sequence tags were regrouped into different types of histone modification. The p-value of a particular histone modification at a positioned nucleosome was calculated based on the tag count of that histone modification in the nucleosome region using Poisson distribution, similar to the method mentioned above. A histone modification was assigned to the nucleosome if its tag count p-value in the region was less than 10-3.
The nucleosome-resolution (MNase digestion) ChIP-Seq Solexa tags for 20 types of histone methylation and histone variant H2A.Z as well as the ChIP-resolution (sonication) ChIP-Seq data for CTCF and Pol II in human CD4+ T cell  were downloaded from http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/hgtcell.html. The gene expression profiles of CD4+ T cell were obtained from SymAtlas . CTCF ChIP peaks were detected using MACS .
The software for positioned nucleosome identification from ChIP-Seq data is implemented in Python and publicly available at http://liulab.dfci.harvard.edu/NPS/. The algorithm also works to detect positioned nucleosomes from nucleosome sequencing experiments (without ChIP). It requires two input files: a ChIP-Seq tags file (BED format) and a user-defined parameter file (TXT format). The users can easily set their own parameters other than the ones suggested above by editing the parameter file. The output of the software is a tab-delimited format file containing the chromosomal coordinates and p-values of the detected nucleosomes. In this work, the software processed 185 million Solexa tags within five hours on a Linux computer (CPU speed: 2 GHz).
Laplacian of Gaussian
transcription start sites
false discovery rate
Stein's Unbiased Risk Estimate.
We thank Myles Brown, Kevin Struhl, Keji Zhao, Jérôme Eeckhoute, Mathieu Lupien, Dustin Schones and Wei Li for their insightful discussions and critical readings of this manuscript. The project was partially funded by NIH grant HG004069-02. JSS was supported in part by the Martin A. and Helen Chooljian Membership in Biology.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.