In this study the performance of MeDiChISeq has been evaluated with a large number of publicly available ChIP-seq datasets. These include the TFs SRF, MAX, NRSF [17] and the sequence-specific insulator protein CTCF [18], all of which present sharp peaks in their ChIP-seq profiles. Moreover, also broad patterns characteristic of some histone modifications, such as histone H3 lysine 4 trimethylation (H3K4me3), and lysine 9 (H3K9Ac) or lysine 27 (H3K27Ac) acetylation, were also included using published data sets [18]. Importantly, MeDiChISeq performance was compared to three other peak callers, which are representatives of the different methodologies implemented over the years: MACS models the background according to a Poisson distribution [7], BayesPeak takes advantage of a fully Bayesian hidden Markov model to identify binding events [10], and PeakRanger applies in addition to background modelling, in a second round a “summit-valley-alternator” algorithm to scan for significant summits [12]. The relevant parameters in which each peak caller has been used are provided in the Additional file 7. As illustrated in Figure 1A, all four peak callers predict a variable number of significant peaks when default confidence threshold conditions are used (MACS: p-value < 1×10-5; BayesPeak: posterior probability or PP > 0.5; PeakRanger: p-value < 1×10-4, FDR < 0.01; MeDiChISeq: no confidence cutoff; instead the number of peaks complying with the kernel are given) suggesting a priori that default parameters may have to be optimized for comparative studies (see also [19]). In general, we observed that PeakRanger and MACS display a more conservative behaviour than MeDiChISeq and BayesPeak when comparing the total number of predicted peaks. Note that the number of MeDiChISeq sites corresponds to those complying with the kernel fitting and have not been filtered by any other threshold criterion. Even more importantly, differences in the number of sites identified by each peak caller are observed also with biological replicates, which likely reflect inherent differences in the characteristics of each of such datasets. Note that in the present study we considered only ChIP-seq profiles that were published as biological replicates.
To compare commonly identified sites we used the predicted peak summits ±50nt flanking sequences; as BayesPeak does not specify summits, the centre of the predicted peak base was used. This comparison revealed that MeDiChISeq identified the majority of sites predicted by the other methods (Figure 1B, Additional file 8). Notably, when comparing the fraction of peaks shared among peak callers MeDiChISeq performs best for both sharp and broad binding patterns (CTCF and H3K4me3), while most of the other peak callers present significantly lower fractions of shared peaks, as seen for H3K4me3 (Figure 1B). This observation correlates with the high number of MeDiChISeq-identified sites relative to the other peak callers resulting from the efficient deconvolution by MeDiChISeq. In fact, as illustrated in Figure 1C, MeDiChISeq annotated 8 distinct loci of H3K4me3 chromatin modifications, where the other peak callers identified one, two or three sites. We noted that these differences in the deconvolution potential of peak callers were less pronounced for sharp binding patterns (Figure 1C, left panel).
MeDiChISeq’s sensitivity evaluated by their performance in reproducibility assays
Figure 1 illustrates that a comparison of peak caller performances under default parameters is unsatisfactory. In fact, default confidence thresholds that are too relaxed will increase the amount of false positive calls, while too stringent conditions will produce false negatives. To circumvent this problem, peak caller performance can be evaluated in the context of reproducibility assays by comparing binding site predictions for biological replicates and ranking them according to confidence descriptors. The underlying assumption is that true binding sites will be retrieved in both replicate datasets within a similar confidence ranking, while low confidence peaks, which are expected to contain also false positives, will show lower consistency in the reproducibility assay. The consistency between ranked peak confidence descriptors was previously formalized based on a copula mixture model, which estimates the probability that each pair of peaks is reproducible. This probability was described as “Irreproducibility Discovery Rate (IDR) [20] and has been used by the ENCODE consortium to identify a transition from signal to noise when peak caller binding site predictions were evaluated [21].
Here we have compared peak caller performance in the context of reproducibility across replicate ChIP-seq datasets. Importantly, MeDiChISeq showed the highest number of reproducible peaks in CTCF and NRSF ChIP-seq datasets (Figure 2A). Also for broader patterns like H3K27ac and H3K4me3 MeDiChISeq identified the highest number of reproducible peaks at acceptable IDR thresholds (e.g., 0.1 or 90% reproducible discovery). Note that the IDR progression curve for the histone modification mark H3K4me3 continues to increase rather slowly above this threshold, suggesting retrieval of an important number of irreproducible events among the significant top-ranked peaks in the replicate dataset. That the other peak callers identify less than 5,000 H3K4me3 peaks with IDR levels below 10% supports the view that for broader binding patterns the assessment of IDRs by applying standard approaches becomes suboptimal [20].
An important limitation of the above analysis is the potential variability between compared biological replicate datasets, as technical differences between the compared profiles may exist (e.g., sequencing depth differences; Peak caller deconvolution performance for broad patterns). To circumvent this limitation, we treated the predictions of two peak callers as “virtual replicates” for IDR analyses for a number of individual ChIP-seq datasets (Figure 2B). We thus ask if two peak callers identify binding events/marks in the same ChIP-seq dataset with similar confidence (i.e., if a top ranked peak of peak caller A is also top ranked by peak caller B). This novel type of comparison demonstrated that MeDiChISeq identifies higher numbers of reproducible peaks when compared with other methods. In fact, in the case of CTCF datasets, MeDiChISeq-MACS performed best for the first replicate, while MeDiChISeq-PeakRanger won in the case of the second replicate (Figure 2B). Importantly, evaluation of H3K27ac and H3K4me3 ChIP-seq datasets by this approach revealed large differences in reproducibility performance of the peak callers. PeakRanger and BayesPeak systematically performed worst, while MeDiChISeq versus any other peak caller gave the best scores in either biological replicate. Note that the particular IDR patterns observed for H3K4me3 in an inter-replicate comparison (right panel in Figure 2A) was not seen when the inter-peak caller performance for each replicate dataset was compared (right panels in Figure 2B), suggesting that it results from significant divergence between the “biological replicate datasets”. Overall these analyses showed that MeDiChISeq systematically identified the most reproducible events among biological replicates and peak caller annotations, thereby revealing the high sensitivity and reliability of this approach.
MeDiChISeq’s specificity in the context of curated benchmark datasets
In addition to identifying the highest number of true binding events (sensitivity), a good peak caller algorithm is expected to produce the lowest amounts of false positives (specificity). As indicated above, IDR studies are expected to identify a transition from signal to noise when evaluating peak callers’ binding sites reproducibility. In this manner, the highest number of significant binding sites at the lowest IDR, as observed in the case of MeDiChISeq performance (Figure 2), reflects a high degree of sensitivity and specificity, at least in the context of reproducible binding site discovery in biological replicates or when comparing different peak callers’ performance per ChIP-seq dataset.
Previous studies have evaluated peak caller performance to distinguish false positives from “true” binding sites by using a manually curated collection of binding regions (and “false” enrichment sites) that cover typical variation in peak size and appearance for the TFs NRSF, SRF and MAX [14, 17]. We have evaluated MeDiChISeq in the context of this benchmark, demonstrating in all three cases a high percentage of true binding site recovery (> 80%) and low false discovery rate (Figure 3). It is worth mentioning that, while its overall FDR performance is similar to that of MACS, MeDiChISeq generally retrieves more true binding sites. Furthermore, while using a background control dataset affected the false discovery rate of all other evaluated peak callers, MeDiChISeq performed equally well in identifying true binding events in presence and absence of this control. This is most likely due to the fact that the binding site identification relies on a pre-computed kernel and is thus less affected by artifactual enrichment events. This performance is well illustrated in the case of NRSF datasets, where in the absence of background control dataset, MACS and MeDiChISeq present a maximal percent of recovery of 90% but accompanied by high FDR levels (>0.5). Importantly, the incorporation of background control dataset in the analysis reduces the FDR levels but the percentage of true site recovery is also compromised for MACS (less than 80%), while MeDiChISeq manages to keep the percentage of recovery levels up to 90% even under these conditions (FDR < 0.4).
Peak caller performance relative to sequencing depth
The rapid technological progress in the field of massive parallel sequencing provided over the years sequencing platforms with continuously increasing sequencing depths. In fact, while the first versions of the Illumina Genome analyzer had sequencing capacities in the range of several millions reads, the latest Hiseq2000/2500 versions provides more than 3 billion reads per flow cell. Following this continuous progress, the number of sequenced reads used per ChIP-seq assay has increased considerably. In fact, while early ChIP-seq assays generated <4 million total mapped reads (TMRs), current datasets comprise >20 million TMRs. Importantly, increasing the sequencing depth increased also the number of discovered binding sites [5, 22, 23] but only a few studies evaluated the peak caller performance for conditions with varying sequencing depths. Obviously, increasing the sequencing depth will increase both the signal and the noise levels, which could potentially affect peak caller performance.
To address this question, we created a high density ChIP-seq dataset by combining the datasets of the two biological CTCF replicates. This meta-profile comprised >36 M TMRs and was used for profile reconstruction from subsets generated by random sampling (80%; 60%; 40%; 20%; 10%; 5%) (Additional file 9A). To perform IDR evaluation, pseudo replicates were produced by two independent random samplings. As expected, the CTCF profile reconstructed from <2 M TMRs had unacceptably high IDR levels (Figure 4). In this condition MeDiChISeq and PeakRanger performed worst, followed by MACS and BayesPeak. This is readily explained by the fact that both MeDiChISeq and PeakRanger evaluate peak shapes, which are highly variable at low TMR levels (see pseudo replicates at 1.8 M TMRs in Additional file 9A). Importantly, with the increase in the TMR levels peak shapes consolidate and the performance of MeDiChISeq is enhanced accordingly (Figure 4A). Indeed, above 14 M TMRs MeDiChISeq outperforms all other peak callers with respect to the number of significant peaks with the lowest IDR levels. Note that above this TMR level all peak callers tend to reach saturation, a phenomenon generally referred to as the sequencing depth beyond which the number of newly discovered sites (in this case in a reproducible manner) reach a constant level (i.e. between 30 to 40 thousand sites for IDR levels lower than 0.1; Figure 4B).
This comparative study clearly demonstrates a direct correlation between TMR size and the degree of reproducible peak identification by any of the compared peak callers. In addition, it shows that MeDiChISeq, and to a certain degree also BayesPeak, tend to retrieve more reproducible binding sites than MACS and PeakRanger. This could be the direct consequence of different sensitivities and/or specificities of the peak callers under these conditions. To assess this issue we evaluated the extent of overlap between the retrieved sites by the different methods relative to MeDiChISeq. This analysis demonstrated that MeDiChISeq retrieved >88% of the binding sites identified by the other methods, but predicted an additional >10,000 specific sites (Additional file 9B).
These additional sites may originate from the use of the stringent comparative conditions (summit overlap +/−50nts distance). Indeed, a comparison between MACS and MeDiChISeq revealed that > 4,000 of the 15,000 MeDiChISeq-specific sites overlapped with MACS calls, when peak comparison settings were relaxed (Additional file 10). The remaining 10,000 binding sites that did not appear in the MACS-predicted site list were further evaluated for the presence of the previously described CTCF motif. Importantly, more than 4,000 MeDiChISeq-specific sites (40%) contained a CTCF motif, strongly suggesting that this population corresponds to bona fide CTCF binding sites that were ignored by MACS (Figure 4C). Of note, PeakRanger and/or BayesPeak identified nearly 3,000 of these bona fide CTCF binding sites (illustrated in Additional file 10C, right panel).