Preprocessing of raw FASTQ reads
We implemented an efficient Q-nexus preprocessing application, flexcat (that is based on flexbar [26]), using the SeqAn C++ programming library [27]. flexcat removes the random and the fixed barcodes, inserts the randomized barcode into the ID field of the sequence (for instance, TL:ATGCC would be added to the description line of a sequence with the random barcode ATGCC), and clips adapter sequences. In ChIP-nexus reads, the random barcode is followed directly by a fixed four-nucleotide barcode. Reads that display no fixed barcode or more than one different nucleotide within the fixed barcode are discarded.
Read mapping
In principle, Q-nexus can be used with any read mapper, whereby only uniquely mappable reads should be used for downstream analysis. In the experiments described in this work, we used bowtie [16] version 1.1.2 with the settings —chunkmbs 512, -k 1, -m 1, —best, —strata.
Processing of aligned reads
We implemented an efficient tool called nexcat that scans BAM files in order to identify sets of PCR duplicates. The random barcode from the FASTQ description line is carried over into the BAM file, and can thus be used to distinguish PCR duplicates from IMUB reads. PCR duplicates are identified as sets of reads with an identical random barcode whose 5’ terminus is located at the same genomic position and all but one of those reads are discarded. Since Q-nexus peak calling utilizes only the 5’ end of each read, it does not matter which read is retained.
Assessment of Q-nexus duplication levels
Our method has three different ways of defining “duplication”. Identically mapped (IM) reads, are simply reads whose 5’ end maps to the identical chromosomal location. The information in the random barcodes is not relevant for the determination of IM reads. Identically mapped with identical barcode (IMIB) reads are defined as IM reads that additionally have an identical random barcode. The ChIP-nexus analysis pipeline [11] removes all IMIB reads except one at each position. We name the remaining reads identically mapped with unique barcode (IMUB) reads (in the original publication, these reads were named “usable reads”). The duplication plots shown in Fig. 2
b–c provide an overview of IM, IMIB, and IMUB reads according to the proportion of reads that have a given duplication level (Fig. 2
a and Additional file 1: Figure S1). We calculate an overall duplication level as the proportion of reads with a duplication level of two or more among all reads. Table 2 shows the overall duplication levels according to each of the three definitions of “duplication”.
Binding characteristics: qfrag length distribution and pseudo control
We refer to 5’ end positions of reads that map either to the forward or the reverse strand as hits. The outcome of ChIP-nexus experiment is modeled as a set of hits:
$${} T=\{\ h=(\text{pos},\text{strand})\ |\ \text{pos} \in \{1,\ldots,l\} \wedge \text{strand} \in \{f,r\}\}, $$
where l is the length of the chromosome. A qfrag is defined to be the genomic interval between an ordered pair of hits (h
i
,h
j
), such that h
i
is on the forward strand, h
j
is on the reverse strand. For the distribution of qfrag-lengths qfrags of fixed lengths are considered and for each length δ=2,…,Δ the number of qfrags is determined:
$$Q_{t}(\delta)=|\{(h_{i},h_{j})\ |\ h_{i} \in T_{f}\ \land\ h_{j} \in T_{r}\ \land h_{j}-h_{i}=\delta\}|. $$
The pseudo-control is derived from the original data by inverting the strand information for each given hit, i.e
$$\begin{array}{@{}rcl@{}} h^{\prime}.strand:=\left\{ \begin{array}{ll} f,& \text{if}\ h.strand=r\\ r, & \text{otherwise} \end{array}\right. \end{array} $$
and subsequently shifting the (strand inverted) hit by one read length rl towards 5’ end, i.e.
$$\begin{array}{@{}rcl@{}} h^{\prime}.pos:=\left\{ \begin{array}{ll} h.pos+rl-1,& \text{if}\ h^{\prime}.strand=r\\ h.pos-rl+1, & \text{otherwise} \end{array}\right. \end{array} $$
The distribution of qfrag-length in the pseudo-control is defined as before:
$$Q_{p}(\delta)=|\{(h^{\prime}_{i},h^{\prime}_{j})\ |\ h^{\prime}_{i} \in T^{\prime}_{f}\ \land\ h^{\prime}_{j} \in T^{\prime}_{r}\ \land h^{\prime}_{j}-h^{\prime}_{i}=\delta\}|. $$
We use the difference between Q
t
(δ) and Q
p
(δ) as signature and the maximum value as estimate for the protected-region width ℓ
′′′, i.e.
$$\ell^{\prime\prime\prime}={\underset{\delta}{\text{arg max}}}\, Q_{t}(\delta)-Q_{p}(\delta). $$
Evaluation: Binding characteristics
5’ end coverage around motif centered binding sites
For each dataset summits were derived using Q [19] with the following parameter settings: —fragment- length-average 15, —fragment-length- deviation 10, —keep-dup and sorted by significance. The top 2,000 peaks (summit position ± 40) were used for a de novo motif analysis with DREME [28] using default settings. The top 30,000 peaks were filtered for those with an occurrence of the top motif in a distance of at most the length of the motif. Finally, the selected summits were centered to the center of the motif occurrence. Around the motif filtered and centered sites the integrated distribution of 5’ ends were determined using Q with the following parameter settings: —bed-hit-dist <CENTERED_SITES_BED>,—keep-dup, —pseudo-control.
Cross-correlation analysis
We performed cross-correlation analysis using the function get.binding.characteristics of the SPP package (version 1.11) with the following parameter settings: srange=c(2,110),bin=1.
MACS2
The predicted fragment lengths were derived in the course of peak calling with parameter settings as stated below.
MACE
The optimal border pair sizes of MACE were derived in the course of peak calling with parameter settings as stated below.
qfrag-length distribution with pseudo-control
We derived qfrag-length distributions using Q with the following parameter settings: —qfrag-length-distribution, —step-num 110, —keep-dup.
ChIP-nexus peak calling
We use q
min
=ℓ
′′′−x and q
max
=ℓ
′′′+x to form qfrags from all hits on the forward and reverse strand that satisfy q
min
≤hj.pos−hi.pos≤q
max
. We used a value of x=5 by default, which is a parameter for the deviation from estimated protected-region width.
We calculate the depth of qfrags at any given position and search the qfrag coverage profile along the genome for free-standing local maxima which we refer to as summits that correspond to predicted binding positions, where free-standing means that there is no position with a higher qfrag depth within a radius of q
min
. For each summit s
i
the number of 5’ ends within the range s
i
−q
max
,…,s
i
+q
max
, denoted as k, is determined. Assuming a null model in which reads are evenly distributed across the genome, P-values are calculated using the Poisson distribution.
$$P(x \geq k)=1-\sum_{i=0}^{k-1}\text{Pois}(i,\lambda) $$
where
$$\lambda=2 \cdot q_{max} \cdot \frac{|T_{f}|+|T_{r}|}{l} $$
All regions covered by at least one qfrag are tested. P-values are corrected for multiple testing using the Benjamini-Hochberg procedure [22].
Evaluation: reproducibility of peak calling
IDR analyses were performed with parameter settings recommended for pairs of biological replicates. Peak lists were derived for Q-nexus, MACS2 and MACE as stated below.
Peak calling parameters for Q-nexus
We used version 1.3.0 of Q with the following parameter settings: —nexus-mode, —top-n 200,000. Q-nexus predicts single binding positions or summits. The summits were extended by 2 bp in upstream and downstream direction, sorted by signal score, i.e. the number of 5’ ends that map for a given summit s
i
into the range s
i
−q
max
,…,s
i
−q
max
, and only the top 100,000 were kept.
Peak calling parameters for MACS2
We used version 2.1.0.20150731 of MACS2 in the callpeak mode with the following parameter settings —keep-dup all, —pvalue 5e-01, —call-
summits. Furthermore, we used —gsize dm for Drosophila and —gsize hs for Human to specify the size of the genome. MACS2 tends to combine multiple adjacent summits into broader broader peaks and only reports the highest summit position, but the option —call-summits causes MACS2 to report all summits. The summits were extended by 2 bp in upstream and downstream direction, sorted by P-value, and only the top 100,000 were kept.
Peak calling parameters for MACE
We used version 1.2 of MACE. The python script for preprocessing was used with the following parameter settings: —kmerSize 0, which turns off the nucleotide bias correction. We did this, because according to the implementation the length of each read has to be greater than three times the kmer-size. Discarding all (clipped) reads shorter than 19 leads to a significant loss of information. The python script for peak calling was used with the following parameter settings: —pvalue 0.99. The MACE algorithm does not report peaks, but border pairs, i.e. pairs of peaks on the forward and reverse strand with a distance that approximates to a optimal border pair size that is estimated from the data. We defined the centers between the border pairs as summits. The summits were extended by 2 bp in upstream and downstream direction, sorted by P-value, and only the top 100,000 were kept.