NGS-Integrator combines NGS data tracks from experiments that identify genomic regulatory elements and chromatin modifications (e.g. ChIP-Seq, ATAC-Seq, and Bisulfite-Seq), but not RNA-Seq data. A genome-wide NGS track is a vector R = [ri, i] where ri represents a measured value at nucleotide base position i along the length of DNA representing the genome. Usually, but not always, ri represents a discrete value, e.g. the number of sequencing reads that overlap the position. In general, values of ri can range from 0 to infinity. Conceptionally, the value ri gives information about the likelihood of a given feature being present, e.g. binding of a particular transcription factor (TF) from ChIP-Seq data, when considered in the context of the intrinsic background noise, ni, in the signal. To transform the data in terms of probability of a feature, we can use the complement of the minimum Bayes factor,
$$ {\mathrm{p}}_i=1-\exp \left[-\frac{{\mathrm{z}}_i^2}{2}\right], $$
where zi equals ri/ni. With two data tracks k = 1,2, a probability vector that combines information from both can be obtained by calculating the joint probability at each i (Pi) as
$$ {\mathrm{P}}_i={\Pi}_{k=1}^2{p}_k $$
Implementation
Algorithm
Computationally, NGS-Integrator consists of two elements (Fig. 1a). The Calculator element calculates the complement of the minimum Bayes factor (cMBF) at each position i, estimating ni as the median of the ri values across a window of i values straddling the position at which pi is being calculated. This calculation assumes that the features being detected are relatively sparsely represented along the genome, allowing the median to be representative of the true background noise level. The Integrator element calculates the joint probability between the two NGS tracks Pi based on the respective pi vectors. When integrating multiple tracks, sequential dual-input calculations are performed, based on the commutativity of the joint probability operation. Also, it is obvious that the computational time complexity of the algorithms is O(kw) where k is the number of nucleotides in the genome and w is the size of the window. The memory-complexity of the algorithm is a linear function of k.
Input and output
For calculation of genome-wide minimum Bayes’ factors from the read depth, NGS-Integrator requires a bed file containing coverage (number of the sequence reads at each nucleotide) generated from standard tools, such as samtools (depth) and bedtools (coverage, genomecov). The input file should contain 4 fields, chromosome number, start region, end region and coverage value. The output of NGS-Integrator as a BED file contains cMBF values at each nucleotide position across the whole genome. The output BED file can be easily modified to bedGraph or BigWig file format to be visualized on genome browsers. The output of NGS-Integrator could also be used to identify overlapping regions across multiple genome-wide NGS data types (Fig. 1a).
Datasets for integration analyses using NGS-Integrator
Three replicates of ChIP-Seq for the transcription factor ELF1 (unpublished) using an ELF1 antibody (sc-631, Santa Cruz Biotechnology) done in mouse kidney cortical collecting duct cell mpkCCD cells as described previously [6] were used to generate input BED files with genome-wide coverage (10 bp-scale bin). Coverage across the whole genome was estimated by bedtools (coverage) from BAM files. To compare with our ELF1 ChIP-Seq data, public data for ELF1 ChIP-Seq data in erythroleukemia cells (ENCSR033OW) with fold change of signal over control track and its optimal irreproducible discover rate threshold peaks was downloaded from the mouse ENCODE database (https://www.encodeproject.org/).
Four replicates of ATAC-Seq data (GSE108786) were generated in mpkCCD cells as described previously [6]. Datasets of ChIP-Seq data for histone H3K27Ac (GSE95009) and RNA Polymerase II (GSE79584) were used from previous published data [7, 8]. Genome-wide coverage of every 10 bp-scale bin as an input bed file of each replicate was generated by bedtools (coverage) from BAM files.
Multiple ChIP-Seq datasets (GSE29218) published as part of the mouse ENCODE project were used to examine integration of multiple types of ChIP-Seq data. The ChIP-Seq datasets for P300 (GSM723018), RNA Polymerase II (GSM723019) and H3K27Ac (GSM851278) were generated in Bruce4 mouse embryonic stem cells (mESCs) by Dr. Bing Ren’s laboratory [9]. Two replicates from each ChIP-Seq dataset were used for integration using NGS-Integrator.