NGS-Integrator: An efficient tool for combining multiple NGS data tracks using minimum Bayes’ factors

Wen, Bronte; Jung, Hyun Jun; Chen, Lihe; Saeed, Fahad; Knepper, Mark A.

doi:10.1186/s12864-020-07220-7

Software
Open access
Published: 19 November 2020

NGS-Integrator: An efficient tool for combining multiple NGS data tracks using minimum Bayes’ factors

Bronte Wen¹^na1,
Hyun Jun Jung^1,2^na1,
Lihe Chen¹,
Fahad Saeed³ &
…
Mark A. Knepper ORCID: orcid.org/0000-0002-2276-8091¹

BMC Genomics volume 21, Article number: 806 (2020) Cite this article

1934 Accesses
3 Citations
Metrics details

This article has been updated

Abstract

Background

Next-generation sequencing (NGS) is widely used for genome-wide identification and quantification of DNA elements involved in the regulation of gene transcription. Studies that generate multiple high-throughput NGS datasets require data integration methods for two general tasks: 1) generation of genome-wide data tracks representing an aggregate of multiple replicates of the same experiment; and 2) combination of tracks from different experimental types that provide complementary information regarding the location of genomic features such as enhancers.

Results

NGS-Integrator is a Java-based command line application, facilitating efficient integration of multiple genome-wide NGS datasets. NGS-Integrator first transforms all input data tracks using the complement of the minimum Bayes’ factor so that all values are expressed in the range [0,1] representing the probability of a true signal given the background noise. Then, NGS-Integrator calculates the joint probability for every genomic position to create an integrated track. We provide examples using real NGS data generated in our laboratory and from the mouse ENCODE database.

Conclusions

Our results show that NGS-Integrator is both time- and memory-efficient. Our examples show that NGS-Integrator can integrate information to facilitate downstream analyses that identify functional regulatory domains along the genome.

Background

Genome-wide next-generation sequencing (NGS) can provide information about binding of proteins to DNA, chromatin modifications, and chromatin accessibility. These factors are under intense study because they determine what genes are expressed in a given cell type and how gene expression is regulated in different physiological or pathophysiological states. In general, as with any experimental data, genome-wide NGS data has background noise, random and systematic errors that can obscure the features being investigated. Therefore, multiple replicates are needed to maximize confidence and consensus in feature identification. It is essential to replicate multiple NGS experiments for confidence in the produced data. However, approaches to utilization of multiple data tracks for feature identification has been limited largely to visualization of superimposed [1] or stacked data tracks. Peak-calling tools [2,3,4,5] are typically applied to individual replicates and not the aggregate data from multiple tracks.

In this paper, we propose an efficient method for integrating multiple tracks for data averaging of multiple replicates. Our proposed method, called NGS-Integrator, transforms genome-wide NGS data for identification of genomic regulatory elements and chromatin modifications (e.g. ChIP-Seq, ATAC-Seq, and Bisulfite-Seq) to a probability vector using the complement of the minimum Bayes’ factor for each track, followed by calculation of joint probabilities for the tracks as a function of position along the genome. The integrated tracks can then be used as input to peak calling programs to identify specific features with greater specificity.

Methods

NGS-Integrator combines NGS data tracks from experiments that identify genomic regulatory elements and chromatin modifications (e.g. ChIP-Seq, ATAC-Seq, and Bisulfite-Seq), but not RNA-Seq data. A genome-wide NGS track is a vector R = [r_i, i] where r_i represents a measured value at nucleotide base position i along the length of DNA representing the genome. Usually, but not always, r_i represents a discrete value, e.g. the number of sequencing reads that overlap the position. In general, values of r_i can range from 0 to infinity. Conceptionally, the value r_i gives information about the likelihood of a given feature being present, e.g. binding of a particular transcription factor (TF) from ChIP-Seq data, when considered in the context of the intrinsic background noise, n_i, in the signal. To transform the data in terms of probability of a feature, we can use the complement of the minimum Bayes factor,

$$ {\mathrm{p}}_i=1-\exp \left[-\frac{{\mathrm{z}}_i^2}{2}\right], $$

where z_i equals r_i/n_i. With two data tracks k = 1,2, a probability vector that combines information from both can be obtained by calculating the joint probability at each i (P_i) as

$$ {\mathrm{P}}_i={\Pi}_{k=1}^2{p}_k $$

Implementation

Algorithm

Computationally, NGS-Integrator consists of two elements (Fig. 1a). The Calculator element calculates the complement of the minimum Bayes factor (cMBF) at each position i, estimating n_i as the median of the r_i values across a window of i values straddling the position at which p_i is being calculated. This calculation assumes that the features being detected are relatively sparsely represented along the genome, allowing the median to be representative of the true background noise level. The Integrator element calculates the joint probability between the two NGS tracks P_i based on the respective p_i vectors. When integrating multiple tracks, sequential dual-input calculations are performed, based on the commutativity of the joint probability operation. Also, it is obvious that the computational time complexity of the algorithms is O(kw) where k is the number of nucleotides in the genome and w is the size of the window. The memory-complexity of the algorithm is a linear function of k.

Input and output

For calculation of genome-wide minimum Bayes’ factors from the read depth, NGS-Integrator requires a bed file containing coverage (number of the sequence reads at each nucleotide) generated from standard tools, such as samtools (depth) and bedtools (coverage, genomecov). The input file should contain 4 fields, chromosome number, start region, end region and coverage value. The output of NGS-Integrator as a BED file contains cMBF values at each nucleotide position across the whole genome. The output BED file can be easily modified to bedGraph or BigWig file format to be visualized on genome browsers. The output of NGS-Integrator could also be used to identify overlapping regions across multiple genome-wide NGS data types (Fig. 1a).

Datasets for integration analyses using NGS-Integrator

Three replicates of ChIP-Seq for the transcription factor ELF1 (unpublished) using an ELF1 antibody (sc-631, Santa Cruz Biotechnology) done in mouse kidney cortical collecting duct cell mpkCCD cells as described previously [6] were used to generate input BED files with genome-wide coverage (10 bp-scale bin). Coverage across the whole genome was estimated by bedtools (coverage) from BAM files. To compare with our ELF1 ChIP-Seq data, public data for ELF1 ChIP-Seq data in erythroleukemia cells (ENCSR033OW) with fold change of signal over control track and its optimal irreproducible discover rate threshold peaks was downloaded from the mouse ENCODE database (https://www.encodeproject.org/).

Four replicates of ATAC-Seq data (GSE108786) were generated in mpkCCD cells as described previously [6]. Datasets of ChIP-Seq data for histone H3K27Ac (GSE95009) and RNA Polymerase II (GSE79584) were used from previous published data [7, 8]. Genome-wide coverage of every 10 bp-scale bin as an input bed file of each replicate was generated by bedtools (coverage) from BAM files.

Multiple ChIP-Seq datasets (GSE29218) published as part of the mouse ENCODE project were used to examine integration of multiple types of ChIP-Seq data. The ChIP-Seq datasets for P300 (GSM723018), RNA Polymerase II (GSM723019) and H3K27Ac (GSM851278) were generated in Bruce4 mouse embryonic stem cells (mESCs) by Dr. Bing Ren’s laboratory [9]. Two replicates from each ChIP-Seq dataset were used for integration using NGS-Integrator.

Results

Example 1:Integration of multiple-replicate transcription factor ChIP-Seq data

To provide an example of an application of NGS-Integrator for genome-wide NGS data for multiple replicates of the same experiment, we processed previously unpublished datasets obtained from antibody-based chromatin immunoprecipitation followed by NGS (ChIP-Seq) using an antibody to the transcription factor ELF1 (Fig. 1b). Based on technical limitations, especially in the antibody-based immunoprecipitation step, ChIP-Seq data for TF binding site identification often includes significant background noise that obscures specific binding sites. In this example, we used three replicates of ELF1 ChIP-Seq data generated in mouse kidney cortical collecting duct (mpkCCD) cells [6, 10]. Figure 1 shows a genomic region on the UCSC genome browser that includes the Eef2 gene. The replicates are shown along with peaks identified with MACS2 [11] for each. Only one peak is common to all three replicates. Integration of three replicate datasets using NGS-Integrator is shown as the red track. When this track was used as input to MACS2, additional peaks were called, many of which contain typical ELF1 binding motifs (ANCCGGAAGT) identified with Homer 4.9 (blue lines, Fig. 1). For comparison, we showed ELF1 ChIP-Seq data reported in ENCODE (ENCSR033OWC) for murine erythroleukemia (MEL) cells along with called peaks (IDR peaks) indicated by green bars. ELF1 binding from the present study is consistent with ELF1 binding in the ENCODE data even though the cell types are different. We provide an additional example of the use of NGS-Integrator to integrate ChIP-Seq data in Supplementary Fig. 1, mapping genomic binding sites of the TF SNAIL1 in human LS174T colorectal cancer cells [12].

Example 2:Integration of multiple replicates of ATAC-Seq data

ATAC-Seq (Assay for Transposase-Accessible Chromatin using Sequencing) is a recently developed method to map open chromatin regions across the genome [13] and is able to identify regulatory DNA elements where transcriptional regulators bind. We previously applied ATAC-Seq in mpkCCD cells to identify active regulatory DNA elements near the Aqp2 gene, which codes for the water channel protein aquaporin-2 [6]. Here, four replicates of ATAC-Seq data from mpkCCD cells were integrated (Fig. 2a) as another example of homogeneous data integration. Data integration using NGS-Integrator allows the data to be summarized in a single track with improved overall signal-to-noise discrimination (red track, Fig. 2a).

Example 3:Combining data from RNA polymerase II ChIP-Seq, histone H3K27Ac ChIP-Seq, and ATAC-Seq to predict locations of enhancers

As a third example, heterogeneous data integration was done using three types of data expected to overlap at active enhancer sites, viz. ATAC-Seq, ChIP-Seq for histone H3 acetylation at Lysine 27 (H3K27Ac) and ChIP-Seq for RNA polymerase II (RNA Pol II), all from mouse mpkCCD cells (Fig. 2b). Although these three features are known to overlap at enhancers [14], none alone are adequate to precisely identify the boundaries of active enhancers. Typically, sequences within identified enhancers are analyzed to identify putative TF binding sites. By sharpening enhancer boundaries, NGS-Integrator can narrow the list of putative TFs that need to be studied further to understand the regulation of a particular gene.

Example 4:Combining data from P300, RNA polymerase II, histone H3K27Ac ChIP-Seq datasets to predict locations of active regulatory elements

To test integration of a large number of datasets, NGS-Integrator was applied to ChIP-Seq datasets published by Dr. Bing Ren’s laboratory. Total 6 ChIP-Seq datasets for the histone acetyltransferase P300, RNA Pol II and Histone H3K27Ac (two replicates per target protein) generated from mouse embryonic stem cell line, Bruce4, were used to calculate cMBFs from each dataset and combine all 6 datasets into one track with NGS-Integrator (Fig. 3), revealing active regions surrounding Eef2 including one at the transcriptional start site.

Performance characteristics

Figure 4 shows execution time and memory usage for a range of input file sizes for the Calculator module (Fig. 4a) and the Integrator module of NGS-Integrator (Fig. 4b). To vary input file sizes, multiple scale bins (1000 bp, 500 bp, 100 bp, 50 bp, 10 bp and 5 bp) were generated in four different mouse chromosomes (1, 10, 17, and 19) through the coverage calculation using bedtools prior to application of NGS-Integrator. As shown in Fig. 4a, the execution time increases with a linear-trend with increasing size of the input file. The memory consumption for given set of files was also noted to be consistent with increasing size of the data. Figure 4b shows that decreasing the file size for integrating output files from the Calculation module corresponds to a sharp decrease in the execution time, as expected. Complexity analysis of the algorithms suggests that it can operate in linear time and space with respect to the length of the genome, consistent with the results shown here.

Conclusions

In this paper we have proposed, designed and implemented easy-to-use software called NGS-Integrator which allows integration of multiple genome-wide NGS data tracks. Our software is implemented in Java and allows integration of data obtained either from multiple datasets of the same technique or data sets derived from different, but complementary, techniques. As shown in the examples, NGS-Integrator can integrate information to facilitate downstream analyses that identify functional domains along the genome. In the Calculator module, the NGS data are transformed to range [0,1] through use of the complement of the minimum Bayes’ factor. Fundamentally, any data transformation that yields values in the range [0,1] and has a positive monotonic relation to the input data would work. The main reason for the choice of the minimum Bayes’ factor is the direct use of the background noise for the calculation, which is easily estimated for sparse data as described. We expect that NGS-Integrator will be very useful for a wide variety of experimental studies where multiple replicates of genome-wide data need to be integrated into a single track.

Availability and requirements

Project name: NGS-Integrator.

Project home page: https://hpcwebapps.cit.nih.gov/ESBL/NGS-Integrator/.

Archived version: GitHub (https://github.com/ESBL/NGS-Integrator).

Operating system(s): Platform independent.

Programming language: Java.

Other requirements: Java 1.7 or higher.

License: Open source.

Availability of data and materials

The datasets analyzed during the current study are available in the GEO: ATAC-Seq data (GSE108786), ChIP-Seq data for histone H3K27Ac (GSE95009), and RNA Polymerase II (GSE79584). The datasets of ELF1 ChIP-seq analyzed during the current study are available from the corresponding author on request.

Change history

08 December 2020
The funding information in the metadata of the original publication was incorrect. The article has been updated to rectify the errors.

Abbreviations

NGS:: Next-generation sequencing
TF:: Transcription factor
cMBF:: Complement of the minimum Bayes factor
ChIP-seq:: Chromatin immunoprecipitation followed by DNA sequencing
ATAC-seq:: Assay for Transposase-Accessible Chromatin using sequencing
H3K27Ac:: Histone H3 acetylation at lysine 27
RNA Pol II:: RNA polymerase II

References

An J, Lai J, Wood DL, Sajjanhar A, Wang C, Tevz G, et al. RNASeqBrowser: a genome browser for simultaneous visualization of raw strand specific RNAseq reads and UCSC genome browser custom tracks. BMC Genomics. 2015;16:145.
Article Google Scholar
Feng J, Liu T, Qin B, Zhang Y, Liu XS. Identifying ChIP-seq enrichment using MACS. Nat Protoc. 2012;7(9):1728–40.
Article CAS Google Scholar
Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, et al. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol. 2009;27(1):66–75.
Article CAS Google Scholar
Xu J, Zhang Y. A generalized linear model for peak calling in ChIP-Seq data. J Comput Biol. 2012;19(6):826–38.
Article CAS Google Scholar
Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH. An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol. 2008;26(11):1293–300.
Article CAS Google Scholar
Jung HJ, Raghuram V, Lee JW, Knepper MA. Genome-wide mapping of DNA accessibility and binding sites for CREB and C/EBPbeta in vasopressin-sensitive collecting duct cells. J Am Soc Nephrol. 2018;29(5):1490–500.
Article CAS Google Scholar
Sandoval PC, Claxton JS, Lee JW, Saeed F, Hoffert JD, Knepper MA. Systems-level analysis reveals selective regulation of Aqp2 gene expression by vasopressin. Sci Rep. 2016;6:34863.
Article CAS Google Scholar
Isobe K, Jung HJ, Yang CR, Claxton J, Sandoval P, Burg MB, et al. Systems-level identification of PKA-dependent signaling in epithelial cells. Proc Natl Acad Sci U S A. 2017;114(42):E8875–84.
Article CAS Google Scholar
Shen Y, Yue F, McCleary DF, Ye Z, Edsall L, Kuan S, et al. A map of the cis-regulatory sequences in the mouse genome. Nature. 2012;488(7409):116–20.
Article CAS Google Scholar
Yu MJ, Miller RL, Uawithya P, Rinschen MM, Khositseth S, Braucht DW, et al. Systems-level analysis of cell-specific AQP2 gene expression in renal collecting duct. Proc Natl Acad Sci U S A. 2009;106(7):2441–6.
Article CAS Google Scholar
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9(9):R137.
Article Google Scholar
Beyes S, Andrieux G, Schrempp M, Aicher D, Wenzel J, Anton-Garcia P, et al. Genome-wide mapping of DNA-binding sites identifies stemness-related genes as directly repressed targets of SNAIL1 in colorectal cancer cells. Oncogene. 2019;38(40):6647–61.
Article CAS Google Scholar
Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods. 2013;10(12):1213–8.
Article CAS Google Scholar
Spicuglia S, Vanhille L. Chromatin signatures of active enhancers. Nucleus. 2012;3(2):126–31.
Article Google Scholar

Download references

Acknowledgements

We carried out next-generation sequencing in the National Heart, Lung, and Blood Institute DNA Sequencing Core Facility (Dr. Yuesheng Li, Director).

Funding

The work was supported by the Division of Intramural Research, National Heart, Lung, and Blood Institute projects ZIA-HL001285 (to MAK) and ZIA-HL006129 (to MAK). BW was supported by the NHLBI Summer Internship Program (June–August 2018). Fahad Saeed was supported by the NIGMS of the National Institutes of Health (NIH) under award number: R01GM134384, and by the National Science Foundations (NSF) under the award number: NSF CAREER OAC-1925960.

Author information

Bronte Wen and Hyun Jun Jung contributed equally to this work.

Authors and Affiliations

Epithelial Systems Biology Laboratory, Systems Biology Center, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD, USA
Bronte Wen, Hyun Jun Jung, Lihe Chen & Mark A. Knepper
Division of Nephrology, Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
Hyun Jun Jung
School of Computing and Information Sciences, Florida International University, Miami, Florida, USA
Fahad Saeed

Authors

Bronte Wen
View author publications
You can also search for this author in PubMed Google Scholar
Hyun Jun Jung
View author publications
You can also search for this author in PubMed Google Scholar
Lihe Chen
View author publications
You can also search for this author in PubMed Google Scholar
Fahad Saeed
View author publications
You can also search for this author in PubMed Google Scholar
Mark A. Knepper
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

BW, LC, HJJ and MAK designed the algorithm. BW and HJJ wrote the code and HJJ carried out NGS experiments. BW, HJJ and FS carried out performance testing. BW, LC, HJJ, FS and MAK wrote the manuscript and all authors revised and approved the final manuscript.

Corresponding author

Correspondence to Mark A. Knepper.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

Fig. S1. Examples of data integration using NGS-Integrator to identify genomic binding sites for transcription factor SNAIL1-HA in human LS174T colorectal cancer cells. a and b Two replicates of ChIP-Seq data for SNAIL1-HA obtained from GSE127183 were integrated (window size for background noise calculation: 10 kb with 2× median across the window). The NGS-Integrator integrated track (red track) was generated from the two replicates and peaks identified using MACS2 (bdgcallpeak, cutoff > 0.5) was also shown below the NGS-Integrator integrated track (red). All data tracks were displayed on the UCSC Genome Browser with human genome hg19.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Wen, B., Jung, H., Chen, L. et al. NGS-Integrator: An efficient tool for combining multiple NGS data tracks using minimum Bayes’ factors. BMC Genomics 21, 806 (2020). https://doi.org/10.1186/s12864-020-07220-7

Download citation

Received: 30 April 2020
Accepted: 09 November 2020
Published: 19 November 2020
DOI: https://doi.org/10.1186/s12864-020-07220-7

NGS-Integrator: An efficient tool for combining multiple NGS data tracks using minimum Bayes’ factors

Abstract

Background

Results

Conclusions

Background

Methods

Implementation

Algorithm

Input and output

Datasets for integration analyses using NGS-Integrator

Results

Example 1:Integration of multiple-replicate transcription factor ChIP-Seq data

Example 2:Integration of multiple replicates of ATAC-Seq data

Example 3:Combining data from RNA polymerase II ChIP-Seq, histone H3K27Ac ChIP-Seq, and ATAC-Seq to predict locations of enhancers

Example 4:Combining data from P300, RNA polymerase II, histone H3K27Ac ChIP-Seq datasets to predict locations of active regulatory elements

Performance characteristics

Conclusions

Availability and requirements

Availability of data and materials

Change history

08 December 2020

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary Information

Additional file 1:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us