MMDiff: quantitative testing for shape changes in ChIP-Seq data sets
© Schweikert et al.; licensee BioMed Central Ltd. 2013
Received: 22 April 2013
Accepted: 15 November 2013
Published: 24 November 2013
Cell-specific gene expression is controlled by epigenetic modifications and transcription factor binding. While genome-wide maps for these protein-DNA interactions have become widely available, quantitative comparison of the resulting ChIP-Seq data sets remains challenging. Current approaches to detect differentially bound or modified regions are mainly borrowed from RNA-Seq data analysis, thus focusing on total counts of fragments mapped to a region, ignoring any information encoded in the shape of the peaks.
Here, we present MMDiff, a robust, broadly applicable method for detecting differences between sequence count data sets. Based on quantifying shape changes in signal profiles, it overcomes challenges imposed by the highly structured nature of the data and the paucity of replicates.
We first use a simulated data set to compare the performance of MMDiff with results obtained by four alternative methods. We demonstrate that MMDiff excels when peak profiles change between samples. We next use MMDiff to re-analyse a recent data set of the histone modification H3K4me3 elucidating the establishment of this prominent epigenomic marker. Our empirical analysis shows that the method yields reproducible results across experiments, and is able to detect functional important changes in histone modifications. To further explore the broader applicability of MMDiff, we apply it to two ENCODE data sets: one investigating the histone modification H3K27ac and one measuring the genome-wide binding of the transcription factor CTCF. In both cases, MMDiff proves to be complementary to count-based methods. In addition, we can show that MMDiff is capable of directly detecting changes of homotypic binding events at neighbouring binding sites. MMDiff is readily available as a Bioconductor package.
Our results demonstrate that higher order features of ChIP-Seq peaks carry relevant and often complementary information to total counts, and hence are important in assessing differential histone modifications and transcription factor binding. We have developed a new computational method, MMDiff, that is capable of exploring these features and therefore closes an existing gap in the analysis of ChIP-Seq data sets.
KeywordsChip-Seq Differential peak detection Kernel methods Maximum mean discrepancy Histone modifications H3K4me3 Cfp1
Chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq) is rapidly becoming the main experimental technique in functional genomic and epigenomic studies. ChIP-Seq’s ability to profile genome-wide patterns of transcription factor binding and histone modifications has led to its extensive use by the ENCODE consortium  in an endeavour to identify and characterise all functional elements encoded in the human genome.
Despite the widespread use of ChIP-Seq, data analysis is still a challenging task  and a typical computational pipeline includes a number of steps, each posing its own difficulties. An initial crucial step is the identification of regions with significant signal enrichment relative to a control sample in a process called peak calling. Over the last years, several tools for this task have been suggested and they have recently been compared in . As a result of peak calling, genome-wide catalogues are obtained, which provide valuable snapshots of protein binding or histone modifications in a given cell or tissue.
However, to understand the dynamics of histone modifications and TF binding and their effects on cell-specific gene regulation it is necessary to quantitatively compare different ChIP-Seq samples. This is a surprisingly difficult task as the statistical assessment of differences is hindered by a number of factors: on the one hand, the data is digital, consisting of counts of DNA fragments (reads) mapped onto regions of the genome. This feature, common to all sequencing-based methods, raises the immediate issue of choosing a suitable noise model for both technical and biological noise. On the other hand, in most studies, only a very small number of replicate experiments are performed, making statistical testing an intrinsically difficult task. To compound both of these problems, ChIP-Seq produces spatially distributed patterns of binding or histone modifications localised to specific regions of the genome (peaks); this feature, in particular, renders standard differential testing methods unsuited for the comparison of ChIP-Seq data sets.
Currently, two strategies are predominantly followed for the differential analysis of ChIP-Seq data sets: The most naive approach is to identify overlaps in the sets of genomic peak intervals detected in the different samples, e.g., [4–6]. This simplifies the problem to a basic occupancy analysis which is insensitive to changes in the affinity of TF binding or in the prevalence of histone modifications. In addition, the results are strongly dependent on the thresholds which are set heuristically in the peak calling step and differences in the noise background may further confound the outcome of this analysis. An alternative strategy is to compute the total number of reads mapping to each peak in each data set and to test for significant fold-changes across multiple tissues or conditions, e.g., . These count-based approaches have mostly advocated the adaptation of methods for RNA-Seq data analysis to the more structured ChIP-Seq data. For example, the frequently used methods DBChIP  and DiffBind  are based on the RNA-Seq methods DESeq  and EdgeR . They employ a negative binomial distribution to model both biological and technical noise in the total counts of expressed genes. To circumvent the problems of low experimental replication, they apply an elegant approach in which information is shared across genes, effectively pooling together genes with similar total counts. An immediate problem arising for count-based methods is finding the right normalisation. Initially, data sets were rescaled according to the observed library size, which corresponds to the total number of reads in the whole data set [11–13]. However, it has been shown that this strategy is inadequate in most situations, and a number of alternatives have been suggested, including rescaling to the median of the ratios of observed counts [9, 14], locally weighted regression (LOWESS)  and more recently rescaling using common peaks across data sets (MANorm, ). All these methods make strong a priori assumptions about the relationship of the data sets that are to be compared. The choice of the normalisation method can therefore greatly influences the results of count-based differential analysis [14, 17, 18].
Perhaps a more severe limitation of count-based methods is the information loss inherent in representing a peak by a single integer (the total counts of reads mapping into the given peak region). Any higher order information that is conveyed in the peaks is ignored. However, a spatial structure of the ChIP-Seq signal is particularly evident in the case of peaks associated with epigenomic marks. For example, trimethylation of lysine 4 on histone H3 (H3K4me3) is known to form distinct bimodal peaks at transcription start sites (TSS), e.g. . Interestingly, at a given genomic location the shape of observed enrichment peaks tend to be highly reproducible across biological replicates and increasing evidence hints towards a functional role of these profile structures [1, 20]. Focusing exclusively on total counts of reads associated with a peak might therefore be insufficient when investigating differences of epigenomic modifications between different samples and higher order features associated with the shape of an enrichment peak should also be taken into account.
We illustrate and compare our method on a simulation study, and on three independent ChIP-seq data sets of both transcription factor binding and epigenomic modifications. Our results show that MMDiff can capture biologically meaningful changes and is highly complementary to count-based approaches. We propose that MMDiff provides an important new tool for bioinformaticians and biologists interested in epigenomic data analysis, conveniently available as a Bioconductor tool.
The rest of the paper is organised as follows: we start with a discription of the statistical foundations of our method and a discussion on how the MMD statistic of  is modified to account for biological variability. We complete the Methods section with a thorough simulation study which compares the results of our method to four different competitors in a controlled environment. This enables us to discuss the strengths and weaknesses of the various methods, and in particular highlights the complementarity of the MMDiff approach w.r.t. count-based methods. We then present results on three different data sets: we start with an in-depth analysis on the H3K4me3 data set of . As this study constitutes our main biological motivation, we present multiple complementary analyses that demonstrate the functional significance of our results. To establish the broad applicability of our method, we also present results on two ENCODE data sets: a comparison of the histone mark H3K27ac across different human cell lines (K562 and GM12878), and a comparison of binding patterns of the transcription factor CTCF across different mouse tissues (cortex, cerebellum and liver). We conclude the paper with a broader discussion of the method in the context of NGS data analysis.
Kernel-based statistical tests
In order to incorporate shape features in a statistical testing procedure, we adopt a kernel-based non-parametric test, which allows us to retain information of any order within the testing procedure [21, 22]. We briefly review here the mathematical foundations of this procedure.
The statistical testing question we wish to address is the following: Suppose for a peak l we are given observations (i.e. reads) in data set s, and observations in data set s′, , where x s and are random variables with respective probability measures p and p′, and X s and are independently and identically distributed (i.i.d.) from p and p′, respectively. Can we decide at a given significance level to reject the null hypothesis p = p′?
A modelling issue of central importance is the choice of the features and the kernel function k. In our case, we wish to preserve the spatial information contained in the peak profile. We therefore used the estimated mid positions of the mapped reads as observed features and the radial basis function (RBF) as kernel k(x,x′) = exp[-(x - x′)2/(2σ2)]. The (hyper)-parameter σ controls the length scale of the kernel, i.e. the distance (along the genome) at which fragment counts decorrelate. In our experiments, we used a heuristic suggested in  such that , where is the median distance of all observations in X s and .
Accounting for biological variability
The bootstrap procedure for computing MMD statistics proposed in  has strong theoretical guarantees for discriminating between different distributions, given sufficient number of samples (i.e. reads mapped to a peak). A simulation study shows that the procedure appears to be well calibrated when comparing technical replicates of ChIP-Seq data (see Additional file 1). However, biological variability implies that the histogram distributions of the same peak in different biological replicates will be more different than expected. This turns out to be true, and the testing procedure of  rejects the null hypothesis in almost all comparisons between biological replicates (see Additional file 1).
In order to avoid this problem, we adopt a data-driven method to estimate biological variability from biological replicates. In general, this is a difficult task, as for most experiments only very few replicates are available; for example the ENCODE consortium set a standard of two independent biological replicates per ChIP-Seq measurement . A reliable estimate of biological variability on a peak by peak basis is therefore rarely possible. To obviate this problem, we pool peaks with similar total counts to generate robust estimates of p-values (this information sharing is similar in spirit to the regression approach of DESeq, ). Specifically, for each peak l we determine the number of reads mapping to it averaged across all considered samples. Peaks are then binned into quantiles determined on the averaged counts per peak. To obtain empirical p-values we compute the probability of observing an MMD value between biological replicates in the given bin, which is at least as large as the one observed for a given peak in the comparison between conditions. Raw p-values are subsequently corrected for multiple testing using the method of Benjamini and Hochberg .
Differential peak calling on simulated data
0 + 0.0
0.7 + 0.8
100 + 0.0
9799.3 + 0.8
0 + 0.0
100 + 0
50 + 0
0.1 + 0.3
2.4 + 1.7
99.9 + 0.3
9797.6 + 1.7
0.1 + 0.3
100 + 0
50 + 0
0 + 0.0
1.4 + 1.0
100 + 0.0
9798.6 + 1.0
0 + 0
100 + 0
50 + 0
17.8 + 9.1
5.5 + 2.9
82.2 + 9.1
9794.5 + 2.9
26 + 10
17.8 + 9.1
99.9 + 0
83 + 0
34.6 + 4.1
0.7 + 0.8
65.4 + 4.1
9799.3 + 0.8
2 + 0
34.6 + 4.1
100 + 0
83 + 0
27.0 + 5.6
0.7 + 0.8
73.0 + 5.6
9799.3 + 0.8
2 + 0
27.0 + 5.6
100 + 0
81 + 0
50.1 + 3.8
2.4 + 1.7
49.9 + 3.8
9797.6 + 1.7
4 + 0
50.1 + 3.8
100 + 0
94 + 0
45.3 + 4.1
1.4 + 1.0
54.7 + 4.1
9798.6 + 1.0
3 + 0
45.3 + 4.1
100 + 0
95 + 0
2.1 + 2.1
5.5 + 2.9
97.9 + 2.1
9794.5 + 2.9
73 + 30
2.1 + 2.1
99.9 + 0
60 + 10
2.5 + 1.5
0.7 + 0.8
97.5 + 1.5
9799.3 + 0.8
2.5 + 1.5
100 + 0
70 + 0
Results and discussion
Application 1: H3K4me3 data set
We first used our method MMDiff to examine a ChIP-Seq data set investigating the epigenetic mark H3K4me3 . This study particularly focused on the question of how profiles of this mark are shaped by Cfp1, which is known to be a conserved DNA-binding subunit of the H3K4 histone methyltransferase (HMT) Set1 complex. The experiment presented consists of ChIP-Seq measurements from three different cell lines: (1) a wild-type mouse ES cell line (WT), (2) a mutant ES line lacking Cfp1 (Cfp1-/-) [30, 31], and (3) a rescue cell line obtained by stable transfection of a human Cfp1 cDNA into Cfp1-/- ES cells (Resc) [32, 33]. We expected that H3K4me3 is reduced in the Cfp1-/- cells. However, as the H3K4 specific HMT activity is redundantly encoded in at least six different complexes in mammals, the precise target regions of Cfp1 were unknown . In addition, under the assumption that the different enzymes potentially act cooperatively at the same target regions, we expected that this histone modification would not be completely abolished at these regions but rather reduced, potentially leading to altered peak profiles. In , it was confirmed that Cfp1 is expressed at near endogenous levels in the rescue cell line and that the H3K4me3 levels are comparable to the levels observed in WT. To detect changes that are primarily due to the absence of Cfp1, we will thus contrast the variability between WT and Resc with the observed changes between WT and Cfp1-/-. Effectively using the Resc sample as a biological replicate for the control group will lead to a potential over-estimation of biological variation resulting in a conservative estimate of differential H3K4me3 patterns.
Clouaire et al. repeated the complete experiment on biological replicates . The antibodies used (here abbreviated with AB.1 and AB.2) have slightly different specificities, plausibly resulting in different signal to noise ratios and the two experiments were therefore analysed independently as two repeat experiments. We report results obtained by MMDiff and compare them with results obtained using DESeq as it is the most widely used count-based methods.
We started our analysis by identifying genomic regions that are significantly enriched for H3K4me3 modifications. We used the software package MACS on each of the data sets  and subsequently created a set of 67,035 MACS consensus peaks from regions overlapping in at least three data sets. We found that only 24% of these peaks overlapped with 4kb windows around TSSs. However, around 70% of reads mapping to peaks in WT were found in these promoter proximal peaks. This is in good agreement with the fact that H3K4me3 is known to localise around transcription start sites . We conclude that in addition to the promoter proximal regions, MACS calls a large number of narrower, low coverage peaks, which are potentially more likely to be spurious. We therefore complement our analysis by investigating 27,807 promoter regions defined by known annotated genes. Note that in WT about half of these promoters show only small enrichment for H3K4me3.
To ensure comparability of the data sets, we corrected for different sampling depths using the normalisation method suggested in . For simplicity, we will refer to the normalised number of reads mapping to peak l in sample s as the total counts, . The full pre-processing pipeline is described in detail in the Additional file 1 which also contains further initial analysis demonstrating that the data sets are only weakly affected by input biases and other biases such as GC content .
Resulting ChIP-Seq signals at three promoter regions are shown in Figure 1. The shapes of the peaks are remarkably well conserved between WT and Resc and also between the two experiments, confirming our motivation to exploit shape conservation between replicates to increase the sensitivity of differential tests. In general, we see a signal decrease in the Cfp1-/- cells as compared to WT/Resc, as expected. However, these changes often appear to be highly spatially dependent: for example the profiles in Figure 1A and C are only affected downstream of the promoter. Interestingly, the profiles in Figure 1B show similar total counts in WT/Resc and Cfp1-/- ES cells, as the massive decrease in the region downstream of the promoter is partially compensated for by an increase in the upstream part of the peak. This highlights the importance of considering shape based features when testing for statistically significant differences as all three promoter regions are consistently called by MMDiff in both experiments, but none is called by DESeq in any of the experiments.
Differential peak calling
The fact that most DMPs appear to have large mean total counts may partly be due to the fact that most changes appear at promoters that are strongly enriched in H3K4me3 under normal conditions , and partly because the peaks with low total counts are more dominated by noise and do not exhibit a conserved shape between WT and Resc. While total counts are not used as a discriminating feature by MMDiff, it is also interesting to see that most DMPs exhibit a change in total counts as is elucidated in an MA-plot (Figure 3D), where fold change is plotted versus mean total counts. As expected, the great majority of DMPs lose H3K4me3 as a consequence of Cfp1 depletion.
In the absence of a ground truth it is particularly difficult to evaluate and compare the results obtained from different methods. To approach an answer to the question whether the called DMPs are genuine or false positives, we are particularly interested in two aspects: the reproducibility of the differentially called regions, and their biological significance. To test the first aspect, we run independent analyses on the data sets obtained with the two different antibodies, and report the overlap of peaks called between the two experiments. Figure 3F shows bar charts of promoters and MACS consensus peaks called by DESeq and MMDiff in the two experiments; MMDiff is seen to call more promoters than DESeq, and also to have a larger fraction of promoters called consistently in both experiments. In the case of MACS consensus peaks, the numbers of regions called consistently in both experiments appear to be relatively low for both methods (15% for DESeq and 26% for MMDiff). However, MMDiff again is more consistent across experiments than DESeq.
This analysis demonstrates that the outcome of differential peak calling can vary when experimental data sets obtained with different antibodies are considered. Also, uncertainties introduced in the peak calling step can propagate to the differential peak calling procedure. To increase both, sensitivity and specificity, it is highly advisable to increase the number of considered replicates. The analysis also shows that employing shape features as done with MMDiff can lead to improved robustness of the results.
Changes of Pol II occupancy at Cfp1 target genes
In order to assess the biological significance of the observed changes, we analysed a Pol II ChIP-Seq data set from the same Cfp1 study . We now restrict our analysis to the promoter regions, in order to avoid the ambiguous assignment of peaks to genes. Using the pipeline described above, we investigated whether there are changes in Pol II binding - and thus gene transcription - associated with the called H3K4me3 DMPs.
As previously reported, changes in Pol II binding following Cfp1 deletion appear to be modest  and only very few promoters are detected to be differentially bound by Pol II (9 and 24 called by DESeq and MMDiff, respectively). This is surprising given the widely accepted role of H3K4me3 as epigenetic mark at active promoters. A possible explanation is that at most promoters residual levels of H3K4me3 remain and these basal levels may be sufficient to partially retain Pol II binding, so that changes are difficult to detect (see Figure 4B). A remarkable exception is shown in Figure 4C where the H3K4me3 signal is completely lost at the promoter of Jade-1 which is accompanied with the complete removal of Pol II binding. We next investigated whether there was a small but systematic shift of Pol II binding associated with other H3K4me3 DMPs. Figure 4D and E show MA-plots for the Pol II data set, with DMPs determined on the H3K4me3 set shown in red, and Figure 4F shows the distribution of fold changes in Pol II binding between Cfp1-/- and WT/Resc. We see a clear down-regulation of genes associated with DMPs called by MMDiff (p < 10-10, Wilcoxon rank sum testa). In the case of DMPs called by DESeq, the distribution also has a mean significantly different from zero, but appears highly non-Gaussian. This is consistent with the finding that DESeq calls a number of small 'ectopic’ promoters which are bare of H3K4me3 in WT but gain H3K4me3 in the absence of Cfp1 which is accompanied with very low levels of Pol II binding in Cfp1-/- cells (see Figure 4A). This analysis demonstrates that differences in H3K4me3 detected by MMDiff correlate well and consistently with subtle changes of Pol II binding, lending further evidence to the high quality of MMDiff results. It also shows that the relationship between H3K4me3 modifications and Pol II binding is more complex than expected, showing highly non-linear behaviour.
Functional annotation of Cfp1 target genes
Co-occurring transcription factor binding sites
We next examined the sequence composition of promoters with H3K4me3 profile changes in order to improve our understanding of the Cfp1 binding mechanisms. We used the MEME suite to find overrepresented sequence motifs in putative Cfp1 target promoters as detected by MMDiff . Again, we used the subset of 11,458 promoters with significant H3K4me3 enrichment to create a background model (Markov model of order 6). Among the top ten discovered motifs we found four binding motifs of the activating E2F family transcription factors, E2F2 and E2F3, with p-values < 10-80 (see Figure 5C). This finding is in good agreement with recent data suggesting that the HMTs MLL2 and Set1 directly associate with E2F transcription factors [39, 40] and indirect DNA binding of Cfp1 via E2F TFs might be the explanation to why a DNA binding deficient Cfp1 mutant has been shown to be able to rescue reduced levels of H3K4me3 at most affected promoter regions . We conclude that MMDiff is a powerful tool to promote the identification of transcription factor motifs and potential co-factors which play important roles in targeting HMTs to gene promoters.
Cluster analysis of peaks
Next, we set out to identify common patterns in H3K4me3 profiles and asked whether promoters with similar profiles were also affected in a comparable way by Cfp1 depletion. This approach is motivated by the idea that different clusters encoding different shapes might reflect different binding mechanisms or different control functions. In addition, we asked if Cfp1 depletion had a homogeneous effect on all TSS sites, or if differences might depend on the shape observed in WT itself. Similar to , we performed a cluster analysis on the peak histograms derived from the WT sample, using a Gaussian Mixture Model (GMM) with covariances constrained to be diagonal in order not to overfitb. We ran GMM multiple times for different cluster numbers and used the Bayesian Information Criterion (BIC) to determine the appropriate number of clusters. We observed a minimum of BIC at k = 18 clusters, which proved to be robust against different initialisations of the algorithm, and the same minimum was found both in the WT and Resc data sets.
DMPs by cluster membership
Pol II log2 FC
Pol II log2 FC
Application 2: H3K27ac
Application 3: CTCF binding
Finally, we tested MMDiff on a ChIP-Seq data set measuring the genome-wide binding of the transcription factor CTCF. CTCF is a transcriptional repressor which also plays a fundamental role in regulating the 3-D structure of chromatin . As such, it has been widely studied in recent years, with several ChIP-Seq experiments identifying thousands of binding sites across the genome.
Here we used an ENCODE CTCF ChIP-Seq data set consisting of two replicates from three mouse tissues; cortex, cerebellum and liver . The choice of tissues was deliberately heterogeneous to check the ability of MMDiff to identify both subtle changes (as expected between cortex and cerebellum) and more marked changes between brain tissues and liver. We used the provided broadPeaks files and after merging overlapping peaks we obtained 49,762 sites for further analysis. Once again, we compared the results of MMDiff and DESeq across pairwise comparisons between tissue types: cortex vs liver (CL) and cortex vs cerebellum (CC). Using a threshold of p < 0.05 for differential peak calling, MMDiff identified 2145 differential peaks in CL and 442 in CC, with DESeq identifying 2052 in CL and 46 in CC respectively. The overlap between peaks called by the two methods is limited, with 606 peaks called by both in CL and only 15 in CC, further demonstrating the complementarity of the two methods. As expected, fewer differences were called by both methods in CC as opposed to CL; Figure 7D shows an example of a peak detected by DESeq in both CL and CC and not called by MMDiff. Again, the peak has a very similar profile in all samples while the total counts vary greatly between tissues. In contrast, two example peaks called by MMDiff and not by DESeq are shown in Figure 7E and F. In Figure 7E, CTCF seems to be bound at two distinct binding sites in the cortex and liver. However, in the cerebellum, one of these sites appears to be vacant. It is noteworthy, that this change might have been detected by count-based methods if more stringent regions around the two binding sites had been considered. These methods are therefore more depending on the peak calling and peak merging processesc. Figure 7F shows two binding sites which are less than 200bp apart. In this case, CTCF is bound at both sites in cerebellum and liver and occupies only one binding site in the cortex sample. As illustrated by this example, MMDiff is capable of directly detecting changes at homotypic binding events at neighbouring binding sites.
The R package MMDiff
These applications show that our method is generic enough to be used in the analysis of a wide range of ChIP-Seq data sets which capture other epigenomic marks or (broad) binding patterns of DNA-associated proteins. It is now available as a Bioconductor R package (package MMDiff), with complete documentation and examples. Additional updates are also available from the project webpage [http://homepages.inf.ed.ac.uk/gschweik/MMDiff.html].
ChIP-Seq is one of the most widely employed experimental techniques in functional genomic and epigenomic studies, yet statistical analysis of ChIP-Seq data still poses many challenges. In this paper, we address the problem of statistical testing in ChIP-Seq data sets, and propose a non-parametric methodology which is capable of accounting for the highly structured nature of this type of data. Compared with techniques based on total counts, MMDiff can identify localised changes which alter the shape of a peak. The identification of such changes is particularly relevant in the light of recent findings that suggest a functional significance of the shape of histone modifications. For example, an analysis of H3K27me3 patterns around CTCF peaks, carried out as part of the ENCODE project, reported that the observed asymmetric shapes of H3K27me3 support the role of CTCF sites in delimiting active and polycomb-silenced domains . Furthermore, chromatin signatures have recently been associated with other biologically relevant features such as first exon length . MMDiff’s ability to capture shape changes in peaks may therefore enable the analysts to capture functionally significant changes in patterns of histone modifications or transcription factor binding which would not be retained by methods which only use total counts for testing. From the practical point of view, focusing on peak shape largely circumvents problems arising from choosing the right normalisation, and MMDiff is also independent of the definition of a suitable noise model.
Methodologically, MMDiff belongs to the family of Kernel based methods; these have a long history in bioinformatics, and have had a considerable influence in the analysis of high throughput sequencing data. An approach which is related to ours has been recently proposed for the purpose of alternative isoform detection from RNA-Seq data . While the methodology proposed in that paper also relies on MMD, the application domain is significantly different, as is the treatment of biological noise.
In the context of ChIP-Seq data, our empirical results, both on simulations and on three independent data sets, demonstrate that our approach is complementary to count-based methods such as DESeq. A practically advisable strategy may be to couple the two methods within an analysis pipeline, allowing analysts to detect both peaks that change in shape and peaks that only exhibit changes in total counts of reads, while maintaining the overall shape of the peak. As for all statistical testing methods, it is worthwhile to emphasize that multiple biological replicates are necessary to get a reliable estimate of the biological variance.
To strengthen our claim that our approach can provide a different perspective in the analysis of ChIP-Seq data, and can be an effective tool for hypothesis generation, we have carried out an in-depth analysis of results of using MMDiff on the data presented in . We demonstrated that MMDiff reproducibly yields biologically meaningful results. We were able to suggest mechanisms that link molecular observations of altered H3K4me3 patterns to phenotypes observed in Cfp1-/- ES cells . In particular, we find that a large number of genes playing a functional role in protein synthesis are potentially targeted by Cfp1. Effects on Pol II binding - and thus potentially transcription - at each individual affected gene seem to be very small; however, taking all affected genes together, we find a significant decrease of Pol II binding at these genes which is in agreement with the observation that Cfp1-/- ES cells show a reduction in translation initiation. Furthermore, the mild effect of Cfp1 deletion on Pol II binding at most promoters is in strong contrast to the observation at the promoter of Jade-1. Here, the lack of H3K4me3 in the Cfp1 depleted cell leads to a complete abolishment of Pol II binding. In this specific case, H3K4me3 seems to act as a switch directly regulating primary transcriptional mechanisms. Jade-1 is of particular interest as it is a key player in H4 acetylation at active genes . It was earlier shown that in the presence of the human tumour suppressor proteins ING4 and ING5, Jade-1 targets the chromatin through interaction with H3K4me3 modifications . Our finding may therefore hint to an epigenomic feed-forward loop based on cross-talk between H4 acetylation and H3K4 methylation.
Our results demonstrate the potential of non-parametric kernel methods to lead to novel biological insights from the analysis of ChIP-Seq data. It is an interesting direction for further research to investigate how the structured nature of NGS data can be exploited in predictive models for more general tasks than statistical testing.
a alternative hypothesis "true location ≠ to 0".
b About half of the TSSs were discarded prior to the analysis due to the absence of H3K4me3 enrichment. Additionally, regions overlapping with more than one TSS were excluded resulting in a set of 4148 promoter regions.
c Also note, that the high spatial resolution of the peaks is achieved by showing histograms of the corrected midpoints of the reads as opposed to coverage plots. Corresponding UCSC Genome Browser views are shown in the Additional file 1.
Chromatin immunoprecipitation followed by massively parallel DNA sequencing
- ES cells:
Embryonic stem cells
Transcription start site
Trimethylation of Lysine 4 on histone 3
Histone methyltransferase, FDR: False discovery rate
Differentially modified promoters (as detected on a H3K4me3 ChIP-Seq data set)
Gaussian Mixture Model
Bayesian Information Criterion
We would like to thank Arthur Gretton and Gunnar Rätsch for helpful discussions. Shaun Webb is thanked for computing support.
GSch acknowledges support from the Wellcome Trust, from the University of Edinburgh through its iDEA lab program and from the EU FP7 Marie Curie Actions. GS is funded by European Research Council through grant MLCS306999. BC is supported by BBSRC under the Sysmo SUMO2 project.
- ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012, 489 (7414): 57-74. 10.1038/nature11247.View ArticleGoogle Scholar
- Park PJ: ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009, 10 (10): 669-680. 10.1038/nrg2641.PubMed CentralView ArticlePubMedGoogle Scholar
- Wilbanks EG, Facciotti MT: Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One. 2010, 5 (7): e11471-10.1371/journal.pone.0011471.PubMed CentralView ArticlePubMedGoogle Scholar
- Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, Marshall A, Kutter C, Watt S, Martinez-Jimenez CP, Mackay S, Talianidis I, Flicek P, Odom DT: Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science. 2010, 328 (5981): 1036-1040. 10.1126/science.1186176.PubMed CentralView ArticlePubMedGoogle Scholar
- Chikina MD, Troyanskaya OG: An effective statistical evaluation of ChIPseq dataset similarity. Bioinformatics. 2012, 28 (5): 607-613. 10.1093/bioinformatics/bts009.View ArticlePubMedGoogle Scholar
- Van Nostrand EL, Kim SK: Integrative analysis of C. elegans modENCODE ChIP-seq datasets to infer transcription factor-responsive targets and upstream regulators of differentially-expressed genes from expression profiling experiments. Genome Res. 2013, 23 (6): 941-953. 10.1101/gr.152876.112.PubMed CentralView ArticlePubMedGoogle Scholar
- Ross-Innes CS, Stark R, Teschendorff AE, Holmes KA, Ali HR, Dunning MJ, Brown GD, Gojis O, Ellis IO, Green AR, Ali S, Chin SF, Palmieri C, Caldas C, Carroll JS: Differential oestrogen receptor binding is associated with clinical outcome in breast cancer. Nature. 2012, 481 (7381): 389-393.PubMed CentralPubMedGoogle Scholar
- Liang K, Keles S: Detecting differential binding of transcription factors with ChIP-seq. Bioinformatics. 2012, 28: 121-122. 10.1093/bioinformatics/btr605.PubMed CentralView ArticlePubMedGoogle Scholar
- Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11 (10): R106-10.1186/gb-2010-11-10-r106.PubMed CentralView ArticlePubMedGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26: 139-140. 10.1093/bioinformatics/btp616.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS: Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008, 9 (9): R137-10.1186/gb-2008-9-9-r137.PubMed CentralView ArticlePubMedGoogle Scholar
- Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol. 2008, 26 (11): 1293-1300. 10.1038/nbt.1505.PubMed CentralView ArticlePubMedGoogle Scholar
- Kharchenko PV, Tolstorukov MY, Park PJ: Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol. 2008, 26 (12): 1351-1359. 10.1038/nbt.1508.PubMed CentralView ArticlePubMedGoogle Scholar
- Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010, 11 (3): R25-10.1186/gb-2010-11-3-r25.PubMed CentralView ArticlePubMedGoogle Scholar
- Taslim C, Wu J, Yan P, Singer G, Parvin J, Huang T, Lin S, Huang K: Comparative study on ChIP-seq data: normalization and binding pattern characterization. Bioinformatics. 2009, 25 (18): 2334-2340. 10.1093/bioinformatics/btp384.PubMed CentralView ArticlePubMedGoogle Scholar
- Shao Z, Zhang Y, Yuan GC, Orkin SH, Waxman DJ: MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets. Genome Biol. 2012, 13 (3): R16-10.1186/gb-2012-13-3-r16.PubMed CentralView ArticlePubMedGoogle Scholar
- Lovén J, Orlando DA, Sigova AA, Lin CY, Rahl PB, Burge CB, Levens DL, Lee TI, Young RA: Revisiting global gene expression analysis. Cell. 2012, 151 (3): 476-482. 10.1016/j.cell.2012.10.012.PubMed CentralView ArticlePubMedGoogle Scholar
- Li Y, Wang H, Muffat J, Cheng AW, Orlando DA, Lovén J, Kwok SM, Feldman DA, Bateup HS, Gao Q, Hockemeyer D, Mitalipova M, Lewis CA, Vander Heiden MG, Sur M, Young RA, Jaenisch R: Global transcriptional and translational repression in human-embryonic-stem-cell-derived Rett Syndrome Neurons. Cell Stem Cell. 2013, 13 (4): 446-458. 10.1016/j.stem.2013.09.001.PubMed CentralView ArticlePubMedGoogle Scholar
- Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell. 2007, 129 (4): 823-837. 10.1016/j.cell.2007.05.009.View ArticlePubMedGoogle Scholar
- Bieberstein NI, Oesterreich FC, Straube K, Neugebauer KM: First exon length controls active chromatin signatures and transcription. Cell Rep. 2012, 2: 62-68. 10.1016/j.celrep.2012.05.019.View ArticlePubMedGoogle Scholar
- Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola AJ: A kernel method for the two-sample-problem. Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference. Edited by: Schölkopf B, Platt JC, Hoffman T. 2006, Cambridge, MA, USA: MIT Press, 513-520.Google Scholar
- Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola AJ: A kernel two-sample test. J Mach Learn Res. 2012, 13: 723-773.Google Scholar
- Drewe P, Stegle O, Hartmann L, Kahles A, Bohnert R, Wachter A, Borgwardt K, Rätsch G: Accurate detection of differential RNA processing. Nucleic Acids Res. 2013, 41 (10): 5189-5198. 10.1093/nar/gkt211.PubMed CentralView ArticlePubMedGoogle Scholar
- Clouaire T, Webb S, Skene P, Illingworth R, Kerr A, Andrews R, Lee JH, Skalnik D, Bird A: Cfp1 integrates both CpG content and gene activity for accurate H3K4me3 deposition in embryonic stem cells. Genes Dev. 2012, 26 (15): 1714-1728. 10.1101/gad.194209.112.PubMed CentralView ArticlePubMedGoogle Scholar
- Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, Bernstein BE, Bickel P, Brown JB, Cayting P, Chen Y, DeSalvo G, Epstein C, Fisher-Aylor KI, Euskirchen G, Gerstein M, Gertz J, Hartemink AJ, Hoffman MM, Iyer VR, Jung YL, Karmakar S, Kellis M, Kharchenko PV, Li Q, Liu T, Liu XS, Ma L, Milosavljevic A, Myers RM, et al: ChIP-seq guidelines and practices of the, ENCODE and modENCODE consortia. Genome Res. 2012, 22 (9): 1813-1831. 10.1101/gr.136184.111.PubMed CentralView ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing; journal of the royal statistical society. Series A, statistics in society. J R Stat Soc Series A, Stat Soc. 1995, 57: 289-Google Scholar
- Robinson MD, Smyth GK: Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007, 23 (21): 2881-2817. 10.1093/bioinformatics/btm453.View ArticlePubMedGoogle Scholar
- ENCODE ChIP-Seq Data Describing Transcription Factor Binding. [http://hgdownload.cse.ucsc.edu/goldenPath/mm9/encodeDCC/wgEncodeLicrTfbs/],
- Zhao X, Valen E, Parker BJ, Sandelin A: Systematic clustering of transcription start site landscapes. PLoS One. 2011, 6 (8): e23409-10.1371/journal.pone.0023409.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee JH, Skalnik DG: CpG-binding protein (CXXC finger protein 1) is a component of the mammalian Set1 histone H3-Lys4 methyltransferase complex, the analogue of the yeast Set1/COMPASS complex. J Biol Chem. 2005, 280 (50): 41725-41731. 10.1074/jbc.M508312200.View ArticlePubMedGoogle Scholar
- Thomson JP, Skene PJ, Selfridge J, Clouaire T, Guy J, Webb S, Kerr ARW, Deaton A, Andrews R, James KD, Turner DJ, Illingworth R, Bird A: CpG islands influence chromatin structure via the CpG-binding protein Cfp1. Nature. 2010, 464 (7291): 1082-1086. 10.1038/nature08924.PubMed CentralView ArticlePubMedGoogle Scholar
- Carlone DL, Lee JH, Young SRL, Dobrota E, Butler JS, Ruiz J, Skalnik DG: Reduced genomic cytosine methylation and defective cellular differentiation in embryonic stem cells lacking CpG binding protein. Mol Cell Biol. 2005, 25 (12): 4881-4891. 10.1128/MCB.25.12.4881-4891.2005.PubMed CentralView ArticlePubMedGoogle Scholar
- Tate CM, Lee JH, Skalnik DG: CXXC finger protein 1 contains redundant functional domains that support embryonic stem cell cytosine methylation, histone methylation, and differentiation. Mol Cell Biol. 2009, 29 (14): 3817-3831. 10.1128/MCB.00243-09.PubMed CentralView ArticlePubMedGoogle Scholar
- Ruthenburg AJ, Allis CD, Wysocka J: Methylation of lysine 4 on histone H3: intricacy of writing and reading a single epigenetic mark. Mol Cell. 2007, 25: 15-30. 10.1016/j.molcel.2006.12.014.View ArticlePubMedGoogle Scholar
- Hansen KD, Irizarry RA, Wu Z: Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. 2012, 13 (2): 204-216. 10.1093/biostatistics/kxr054.PubMed CentralView ArticlePubMedGoogle Scholar
- Bauer S, Grossmann S, Vingron M, Robinson PN: Ontologizer 2.0–a multifunctional tool for GO term enrichment analysis and data exploration. Bioinformatics. 2008, 24 (14): 1650-1651. 10.1093/bioinformatics/btn250.View ArticlePubMedGoogle Scholar
- Butler JS, Palam LR, Tate CM, Sanford JR, Wek RC, Skalnik DG: DNA Methyltransferase protein synthesis is reduced in CXXC finger protein 1-deficient embryonic stem cells. DNA Cell Biol. 2009, 28 (5): 223-231. 10.1089/dna.2009.0854.PubMed CentralView ArticlePubMedGoogle Scholar
- Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS: MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009, 37 (Web Server issue): W202-W208.PubMed CentralView ArticlePubMedGoogle Scholar
- Schuettengruber B, Martinez AM, Iovino N, Cavalli G: Trithorax group proteins: switching genes on and keeping them active. Nat Rev Mol Cell Biol. 2011, 12 (12): 799-814. 10.1038/nrm3230.View ArticlePubMedGoogle Scholar
- Tyagi S, Chabes AL, Wysocka J, Herr W: E2F activation of S phase promoters via association with HCF-1 and the MLL family of histone H3K4 methyltransferases. Mol Cell. 2007, 27: 107-119. 10.1016/j.molcel.2007.05.030.View ArticlePubMedGoogle Scholar
- Shulha HP, Cheung I, Whittle C, Wang J, Virgil D, Lin CL, Guo Y, Lessard A, Akbarian S, Weng Z: Epigenetic signatures of autism: Trimethylated H3K4 landscapes in prefrontal neurons. Arch Gen Psychiatry. 2012, 69 (3): 314-324. 10.1001/archgenpsychiatry.2011.151.View ArticlePubMedGoogle Scholar
- Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ, Hanna J, Lodato MA, Frampton GM, Sharp PA, Boyer LA, Young RA, Jaenisch R: Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc Natl Acad Sci USA. 2010, 107 (50): 21931-21936. 10.1073/pnas.1016071107.PubMed CentralView ArticlePubMedGoogle Scholar
- ENCODE ChIP-Seq data describing histone modifications. [http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/],
- Botta M, Haider S, Leung IXY, Lio P, Mozziconacci J: Intra- and inter-chromosomal interactions correlate with CTCF binding genome wide. Mol Syst Biol. 2010, 6: 426-PubMed CentralView ArticlePubMedGoogle Scholar
- Foy RL, Song IY, Chitalia VC, Cohen HT, Saksouk N, Cayrou C, Vaziri C, Côté J, Panchenko MV: Role of Jade-1 in the histone acetyltransferase (HAT) HBO1 complex. J Biol Chem. 2008, 283 (43): 28817-28826. 10.1074/jbc.M801407200.PubMed CentralView ArticlePubMedGoogle Scholar
- Saksouk N, Avvakumov N, Champagne KS, Hung T, Doyon Y, Cayrou C, Paquet E, Ullah M, Landry AJ, Côté V, Yang XJ, Gozani O, Kutateladze TG, Côté J: HBO1 HAT complexes target chromatin throughout gene coding regions via multiple PHD finger interactions with histone H3 tail. Mol Cell. 2009, 33 (2): 257-265. 10.1016/j.molcel.2009.01.007.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.