The MeDiChISeq ChIP-seq deconvolution package

Background Chromatin immunoprecipitation combined with massive parallel sequencing (ChIP-seq) is widely used to study protein-chromatin interactions or chromatin modifications at genome-wide level. Sequence reads that accumulate locally at the genome (peaks) reveal loci of selectively modified chromatin or specific sites of chromatin-binding factors. Computational approaches (peak callers) have been developed to identify the global pattern of these sites, most of which assess the deviation from background by applying distribution statistics. Results We have implemented MeDiChISeq, a regression-based approach, which - by following a learning process - defines a representative binding pattern from the investigated ChIP-seq dataset. Using this model MeDiChISeq identifies significant genome-wide patterns of chromatin-bound factors or chromatin modification. MeDiChISeq has been validated for various publicly available ChIP-seq datasets and extensively compared with other peak callers. Conclusions MeDiChI-Seq has a high resolution when identifying binding events, a high degree of peak-assessment reproducibility in biological replicates, a low level of false calls and a high true discovery rate when evaluated in the context of gold-standard benchmark datasets. Importantly, this approach can be applied not only to ‘sharp’ binding patterns - like those retrieved for transcription factors (TFs) - but also to the broad binding patterns seen for several histone modifications. Notably, we show that at high sequencing depths, MeDiChISeq outperforms other algorithms due to its powerful peak shape recognition capacity which facilitates discerning significant binding events from spurious background enrichment patterns that are enhanced with increased sequencing depths. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-14-834) contains supplementary material, which is available to authorized users.

package for parallel computation and packages lars, quadprog and corpcor are required.MeDiChISeq can be loaded as follows: > library(MeDiChISeq)

Input data
The example data set used in this tutorial has been originally published by Ernst et al. (2011) and can be downloaded from the public repository GEO (GSM646314, GSM646332, GSM646424 and GSM646430) and from MeDiChiSeq website together with the peak calling results.We illustrate how to annotate peaks in ChIP-seq profiles such as CTCF, which presents sharp binding patterns, and H3K4me3 that is characterized by broader binding islands.
While in the examples illustrated in this vignette we have used BED format files as input datasets, MeDiChISeq can process mapped read files in following formats: SAM, BAM, BOWTIE, SOAP.

Parallelization
Due to the important computation requirements during the linear regression fitting and the bootstrapping large datasets might require several hours of processing.For this reason both fit.peak.profile.seqand deconv.entire.genome.seqfunctions can be run following a multicore mode.The parallelization schema is based on mclapply function from parallel package, thus users can specify the number of processors in use with the nr.cores parameter (see below).

Clonal reads
It can happen that the ChIP-seq datasets may contain an important fraction of sequenced reads aligned to the exactly same genomic position.Such so called clonal reads are usually considered as artefacts resulting from the PCR amplification step retrieved in most of the library construction protocols.If this is a frequent phenomena (i.e. the fraction of clonal reads is significant relative to the total population) they can have a negative influence on the shape of peaks.That is why we suggest to remove them using the parameter remove.clonal.reads=TRUE,which abrogates the clonal population, but keeps a copy of them defined by the parameter clonal.reads.to.keep.

Learning a representative binding pattern
In the first step MeDiChISeq defines a representative binding pattern from the provided ChIP-Seq dataset.In fact, this step is justified by the fact that several technical aspects in the generation of a given ChIP-seq dataset, among them the chromatin sonication, but also the inherent nature of the factor under study (i.e.sharp versus broad binding pattern profiles) may directly influence the shape of peaks populating the ChIP-seq profile.
Briefly, this is performed by fitting a binding pattern model, described here as kernel, to a reduced number of genomic regions, for instance those retrieved in a single chromosome.The formalism of this procedure is extensively described in Reiss et al. (2007).Such kernel is used in a second step of MeDiChISeq work flow for identifying binding events genome wide.
The shape of binding sites is learned with fit.peak.profile.seqfunction.Starting with a default kernel, peaks are iteratively annotated and refitted and kernel parameters are adjusted in order to find the modeled shape that will fit the best to the observed one.We assume that in general high intensity peaks represent true binding sites, which is why refitting and learning of new parameters takes place only in such regions.The number of regions to consider can be defined with n.peaks and n.skip parameters.Note that the threshold (quant.cutoff= "q1e-7") for annotating putative binding regions can be more stringent than in deconv.entire.genome.seq(quant.cutoff= "q1e-5") because we are interested in annotating a fraction of binding sites for learning the representative binding pattern.
Currently, parameters that are optimized include the shape and scale of the Gamma function used to asses the fragment length distribution.The choice of starting parameters can influence the final fitting, thus it is important to choose the one that could be close to the expected final results.After performing multiple fittings on different profiles we suggest to use start.pars= c(shape = 10, scale = 30) which correspond to the average DNA fragment length of 300 bp.
In the example below short reads are first elongated by reads.elong=150bp, then the kernel fitting is performed.This elongation parameter should correspond more or less to the sonication fragment length.If we have no feeling about value to set up, we can define reads.elong=NULL,then the elongation distance will be estimated from forward and reverse read profiles.This is performed as following: kernel is fitted separately to the forward and reversed intensity profiles, which are created from reads of length reads.length.Then reads.elongequals to estimated.fragment.lengthF orward + estimated.fragment.lengthReverse − reads.length.
Nevetheless we suggest to use the first approach since it is faster and gives optimal results for most of the evaluated profiles.
This is an example of code that will find the kernel fitting to the CTCF profile, figure 1.
• frag.length:estimated fragment length.From the kernel model, the base of peaks corresponds to 2 * frag.length.
Moreover, if kernel=TRUE, verbose.console=TRUE,keep.wigs=TRUE,than in the directory output.dirall the above outputs will be saved together with generated WIG file, console display of work progress and the PDF files with progress of kernel fitting and final kernel plots.

Whole genome deconvolution
In this part, we will deconvolve a small chunk of chromosome 19 from CTCF data set.In general we can specify any genome piece of interest with parameters chrom.list,limL, limU, potential.peaks.Such region is divided into overlapping windows which are further deconvolved with adjusted version of chip.deconvfunction from MeDiChI package.For the whole genome deconvolution set chrom.list=NULL, limL=0, limU=Inf, potential.peaks=NULL,which is the default.
When required the potential.peaksparameter makes reference to an output in BED format from any other peak caller, for instance MACS, whose results would be validated using MeDiChISeq.Alternatively, it could be used to target MeDiChISeq analysis to different chromatin regions, for instance promoter loci, by providing their coordinates in BED format.
Remember to set up reads.elong to the value that had been used for elongation when the chosen kernel was produced by fit.peak.profile.seqfunction.
When nr.boots ≥ 2, for each peak global and local p-values are calculated.Using local.windowsparameter define the number and the size of areas surrounding each peak that we consider in calculations of local confidence.Final p-values are a combination of all local and global p-values.
In this example we also introduce the control sample which is deconvolved in parallel with IP and used for correcting the confidence of peaks in case when the control sample presents enriched patterns in the same chromatin loci.Parameter Control.corr.paramdefines the minimal percentage of overlap between peaks from IP and control to consider them in the same loci and apply correction.
When verbose.console=TRUEwe can follow progress of deconvolution.
> out.CTCF <-read.This object is saved in output.dir.Moreover you can find there a BED file with annotated peaks and their intensities and WIG files which can be loaded to any genome browser for visual verification of obtained results.In order to facilitate the choice of p-values cutoff a plot of intensities versus p-values is generated, for more details go to section 5.The console output of deconv.entire.genome.seq is also saved.

Single window deconvolution
Here we present the use of the core function chip.deconv.seqwhich deconvolves the subset of ChIP-seq intensity profile and allows to get a visual representation of deconvolution in a given window, see figure 2. This function and its adjusted version are the slight modifications of chip.deconvdesigned by (Reiss et al., 2007).
chip.deconv.seqworks over any WIG intensity file that can be generated with write.wigs.parallelfunction.
Deconvolution of one (centered in 461582) of the windows produced by deconv.entire.genome.seqfor CTCF -sharp pattern profile.
> data <-read.table("chr19_IP_GSM646314_GM12878_CTCF_rep1_res-10_dist-150_both.wig",+ skip=2) > kernel <-read.table("MeDiChISeq_CTCF_kernel.One can also use scatterplots of p-values versus intensity(figure 3) in order to see the distribution of those values in a given profile and choose an adequate cutoff.In the illustrated scatterplot points, which correspond to annotated binding sites, follow an S-shape distribution.This distribution is a consequence of the presence of three kinds of peaks: those with low confidence that have intensity comparable with the background; sites presenting a correlative increase between intensity and confidence and a small subset of peaks for which the confidence gets constant level though their intensity increase.Based on such scatterplots we suggest to remove the low confidence population by selecting a -log10(p-value) threshold in the range of 2 − 3.

Figure 1 :
Figure 1: Example of highest intensity peaks used to learn the kernel shape and the final fitting for CTCF (left) and H3K4me3 (right) profiles.

Figure 2 :
Figure 2: Single window deconvolution of CTCF (left) and H3K4me3 (right) profiles.Top panels present the results for the entire window.Bottom panels are zoomed on one of the deconvolved regiones.

Figure 3 :
Figure 3: Logarithm of intensity versus negative logarithm of combined local and global p-values of putative peaks for CTCF (left) and H3K4me3 (right) profiles.

table (
At this stage an output of deconv.entire.genome.seqconsists of a list of all putative peaks even those with low confidence and final elimination of those false positives should be applied.On the zoomed deconvolution window for CTCF (figure2, lower panel) we can see that two peaks were called including one (position 458091) with very low intensity (6.271881) which is a result of overfitting and looks rather like a shoulder of the peak annotated in position 457891.As shown further this low intensity peak gets higher p-value (6.767360e − 03) and its control correction coefficient, defined as − log 10(combined.local.global.p.values IP ) × log 10(intensity IP )− (− log 10(combined.local.global.p.valuesControl ) × log 10(intensity Control )), is much smaller (13.60735) than for the "real" peaks.