GenomicDistributions: fast analysis of genomic intervals with Bioconductor
BMC Genomics volume 23, Article number: 299 (2022)
Epigenome analysis relies on defined sets of genomic regions output by widely used assays such as ChIP-seq and ATAC-seq. Statistical analysis and visualization of genomic region sets is essential to answer biological questions in gene regulation. As the epigenomics community continues generating data, there will be an increasing need for software tools that can efficiently deal with more abundant and larger genomic region sets. Here, we introduce GenomicDistributions, an R package for fast and easy summarization and visualization of genomic region data.
GenomicDistributions offers a broad selection of functions to calculate properties of genomic region sets, such as feature distances, genomic partition overlaps, and more. GenomicDistributions functions are meticulously optimized for best-in-class speed and generally outperform comparable functions in existing R packages. GenomicDistributions also offers plotting functions that produce editable ggplot objects. All GenomicDistributions functions follow a uniform naming scheme and can handle either single or multiple region set inputs.
GenomicDistributions offers a fast and scalable tool for exploratory genomic region set analysis and visualization. GenomicDistributions excels in user-friendliness, flexibility of outputs, breadth of functions, and computational performance. GenomicDistributions is available from Bioconductor (https://bioconductor.org/packages/release/bioc/html/GenomicDistributions.html).
Sets of genomic regions are a fundamental data type for biological data analysis. They result from a variety of epigenome analysis experiments, such as ChIP-seq or ATAC-seq. Genomic region sets consist of genomic coordinates that specify regions with a shared property. Unlike genes, whose functions are better defined, the functional importance of non-coding genomic regions has been harder to interpret. To address this issue, multiple tools have been recently developed for a variety of analyses on genomic region sets, such as enrichment analysis (LOLA , LOLAweb , GIGGLE , IGD , GREAT , epiCOLOC ), visualization (chromPlot , karyoploteR ), region set comparison (BEDTools , Bedshift , AIList , regioneR ), or region annotation (Goldmine , annotatr , ChIPpeakAnno , ChIPseeker ). Other tools have been developed to classify and infer biological function of region sets (word2vec-based embedding , Avocado ), identify regulatory activity of regions (MIRA ), or to analyze heterogeneity across samples (COCOA ). Existing R packages provide some region-based analytical approaches, such as visualizing the distribution of genomic regions across chromosomes or annotations (chromPlot , karyoploteR , annotatr ), or for particular types of region sets, (e.g. ChIPpeakAnno , ChIPseeker  for ChIP-seq data), However, there is no general-purpose R package for extensive visualization and statistical analysis of genomic region sets from any source (Additional file 1: Table S1). Furthermore, many existing packages are not optimized to deal with the growing scale of region data, which now exceeds hundreds of thousands of publicly available region sets and hundreds of billions of individual regions [21,22,23].
To this end, we introduce the GenomicDistributions R package. GenomicDistributions specializes in computing basic statistics and visualizing distributions of genomic region sets from any experimental source. GenomicDistributions functions compute a variety of statistics and plot results to explore genomic region data. These include chromosome distribution plots, feature distance plots, neighbor region distance plots, GC content plots, signal summary plots, genomic partition overlap plots, peak width quantile trimmed histogram plots, and dinucleotide frequency plots. GenomicDistributions offers several key advantages over existing approaches (Fig. 1a): First, we carefully designed the package functions to have a simple, uniform, and flexible interface (Fig. 1b). GenomicDistributions functions all take the same input: a Bioconductor GenomicRanges or GenomicRangesList object, creating a unified interface for the user to summarize one or more region sets with the same line of code. We also separated calculation and plotting functions, enabling users to run calculations for reporting statistics without directly summarizing these only as plots. The outputs for every calculation function are the inputs for plotting functions. Furthermore, all plotting functions return ggplot2 objects, making it easier for users to adjust style of images. We also put considerable effort into optimizing performance, so GenomicDistributions scales better with large inputs than other R packages for related computations (Additional file 1: Fig. S1). Finally, GenomicDistributions provides a broad array of available analysis types. Its scope is more universal than many existing packages, targeting genomic interval data from any source (Additional file 1: Table S1).
Results and discussion
GenomicDistributions functions calculate and plot summary statistics for single or multiple genomic region sets. These include functions for calculating distribution of regions across chromosomes, distances between neighboring regions, GC content, overlaps with genomic features, nearest distances from genomic features, widths of regions, and dinucleotide frequency, as well as functions for summary of user-provided signal values from different conditions.
Several functions require reference genome feature annotations. Users can either provide their own feature annotations or use wrapper functions (indicated with a “Ref” function name suffix), which require only a string specifying genome assembly as an input, and which automatically extract the feature annotations from the associated GenomicDistributionsData package. The wrapper functions are available for hg19, hg38, mm9, and mm10 reference genome assemblies; for other reference genomes, users can still use GenomicDistributions functions, but will need to supply the annotation data (Additional file 2: Table S2).
To demonstrate the use of GenomicDistributions, we put together a dataset of 5 genomic region sets of various types: EZH2 (Enhancer Of Zeste 2 Polycomb Repressive Complex 2 Subunit) regions in embryonic hepatocyte cells; FGF2 (Fibroblast Growth Factor 2) differentiation factor regions in iPSC (induced Pluripotent Stem Cells); and three sets of histone marks in B-cells, namely H3K27me3, which is associated with heterochromatin; H3K27ac, which is enriched in active enhancers; and H3K4me3, which is associated with active promoters. For this test set, we applied each GenomicDistributions calculation and plotting function. Here, we show examples of the resulting plots.
First, the signal summary function summarizes external signal data across query regions. This function requires a user-provided matrix with normalized signal values across a genome. GenomicDistributionsData provides a pre-constructed matrix of normalized chromatin accessibility signal values across genomes (hg19, hg38, mm10) of different cell types, which can be used to infer cell-type specificity of chromatin accessibility in the query regions (Fig. 2a). The output represents a summary of chromatin accessibility signal within test regions across different cell types. Next, the chromosome distribution plot (Fig. 2b) helps to visualize how the regions are distributed across chromosomes. The neighbor region distance plot (Fig. 2c) shows the distribution of distances between two consecutive regions in a sorted region set. The GC content plot (Fig. 2d) displays the distribution of GC content percentage within genomic regions of interest. The genomic partition distribution plot (Fig. 2e) shows how regions are distributed across genome annotation classes. Users can either provide features of interest or use the calcPartitionsRef function with pre-defined elements including core promoters, proximal promoters, exons, introns, 5′ UTRs (untranslated regions), 3′ UTRs, and intergenic regions (Additional file 2: Table S2). In addition to raw partition distributions plots (Fig. 2e), GenomicDistributions also offers expected partition distribution plots (Additional file 1: Fig. S2). Expected partition distribution plots correct the raw overlap counts (observed) by dividing those by expected overlaps, which depend on the size of the partition. This correction is particularly important, since the sizes of individual partitions such as exons vs. introns are considerably different, and therefore we expect more regions to overlap introns than exons by chance. The corrected overlap values are presented as the log10(observed/expected) overlap count for each partition. We then use the Chi-square independence test to calculate the p-values inferring the significance of the observed overlaps compared to expected (Additional file 1: Table S3). GenomicDistributions also provides a novel plot type that further extends this concept, called cumulative partition distribution plots (Additional file 1: Fig. S3, Supplementary methods). The cumulative partition distribution plot extends this concept in two ways: 1) it shows not only the total counts, but how they accumulate in total genome coverage when regions are ordered by size; and 2) instead of the fraction of regions in each feature, it shows a combined enrichment score, which is the average of the fraction of regions in each feature and the fraction of the features covered by regions. These two changes make the plots more informative, as they include the total size of bases covered in each fraction, and a more balanced enrichment score that naturally accounts for the different total coverage of each partition (Additional file 1: Fig. S3, Supplementary methods).
Another plot produced by GenomicDistributions is the feature distance plot, which shows how the regions are distributed with respect to the nearest feature of interest. Due to the common use of distances to nearest transcription start site (TSS), we provide the calcTSSDistanceRef function for convenience (Fig. 2f). The width distribution plot (Additional file 1: Fig. S4) shows widths of genomic regions as a histogram with clipped top percentiles, a feature that allows enhanced visual comparison by eliminating long tails. Finally, dinucleotide frequency plots (Additional file 1: Fig. S5) shows the distribution of dinucleotide content within genomic regions of interest.
Our design philosophy for GenomicDistributions included several key concepts that provide advantages over other tools with similar purpose: First, we separated calculation functions from plotting functions (Fig. 1b). While many visualization-based analysis functions run calculations and plotting at the same time (Additional file 1: Table S1), by decoupling them, GenomicDistributions provides the flexibility to use the two independently so the intermediate results can be used for other purposes in downstream tools. Each analysis is thus done in two steps: first, by calling a calc function, which returns a summarized set of computed statistics; and second, by passing this result to a plot function.
We also designed a consistent user interface for GenomicDistributions functions. Each calc function can accept either a GenomicRanges object representing a single set of intervals, or a GenomicRangesList object with multiple sets of intervals (Fig. 1a, b). This provides a single interface for either individual region set exploration or to compare among region sets. The result of all calc functions can then be used as direct input into a corresponding plot function. In turn, each plot function returns a ggplot object which can then be further styled by the user. This consistency across functions makes it simple to learn how to use the package and to run multiple analyses on a single input.
Most of the plot functions offer multiple plotting options. For example, the feature distance distribution plot offers: 1) the default histogram option (Fig. 2f); and 2) a heatmap option (Additional file 1: Fig. S6a). Similarly, the signal summary plot can be visualized as a bar plot with any defined groups included (Fig. 1a), or, for example, as a violin plot showing a subset of predefined groups, such as tissues or cell types (Additional file 1: Fig. S6b). In addition to the carefully designed user interfaces, we also took considerable effort to optimize GenomicDistributions for speed. By leveraging the highly optimized code in the R data.table package  and using fast rolling joins, GenomicDistributions calculation functions are much faster than alternatives in other R packages.
To showcase the speed efficiency and scalability of GenomicDistributions, we carried out a running time benchmark against relevant R packages with related functions (Additional file 1: Table S1). Specifically, we tested the speed of four analyses: 1) distribution of regions across genomic partitions, 2) distance of regions to TSSs 3) distribution of regions across chromosomes and 4) distance of regions to user-defined features. To perform this benchmark, we assembled a collection of six ChIP-seq region sets displaying variability in terms of region number (less than 10,000 to more than 300,000 regions) and region widths. Narrow regions are represented by transcription factor region sets (TCF12: Transcription Factor 12, ATF3: Activating Transcription Factor 3; MEF2C: Myocyte Enhancer Factor 2C), while broader regions originate from histone mark region sets (for region set details see Additional file 1: Table S4, Supplementary methods). The running time of GenomicDistributions calc/plot functions is generally lower, and often much lower, than competing packages (Additional file 1: Fig. S1). Furthermore, GenomicDistributions tends to scale much better with increasing number of regions.
GenomicDistributions is an R package with a broad set of functions available in one place with the purpose to explore genomic regions. While other currently available tools provide some of the summary functions, GenomicDistributions is the most feature-rich, in terms of the number and type of statistics/plots that are produced. GenomicDistributions also takes a step forward on ease-of-use through our modular, consistent programming design. Multiple region sets can be analyzed as easily as a single one, and for common reference genomes, GenomicDistributions can make use of our pre-compiled genome annotations. When more flexibility is needed, GenomicDistributions is also adaptable enough to be used with any reference assembly. Not only can users provide their own annotations, or features of interest, we also give them the freedom to visualize the results in their own way either by using the output of calc functions, or by editing ggplot objects returned by plot functions. Lastly, GenomicDistributions is substantially faster and more scalable than other publicly available R packages. Together, these strengths make GenomicDistributions a fast, flexible, powerful, and easy-to-use package for analysis of genomic region set data.
The package is available from https://bioconductor.org/packages/release/bioc/html/GenomicDistributions.html, or https://github.com/databio/GenomicDistributions. Detailed methods description is available in Additional file 1: Supplementary methods.
Availability of data and materials
The test dataset was obtained from ENCODE  (accession numbers: ENCFF869UBZ, ENCFF539PUL, ENCFF969HEX), and CISTROME Data Browser [21, 22] (GEO accession numbers: GSM2698625, GSM2439179). The benchmark region sets were obtained from ENCODE (accession numbers: ENCFF264OCQ, ENCFF543TAQ, ENCFF145FYQ, ENCFF647QUI, ENCFF720VWD, ENCFF130BPH).
Enhancer of Zeste 2 Polycomb Repressive Complex 2 Subunit
Fibroblast Growth Factor 2
Histone H3 lysine 27 trimethylation
Histone H3 lysine 27 acetylation
Histone H3 lysine 4 trimethylation
Histone H3 lysine 4 monomethylation
Induced Pluripotent Stem Cells
Transcription start site
Transcription Factor 12
Activating Transcription Factor 3
Myocyte Enhancer Factor 2C
Human Embryonic Stem Cell
Sheffield NC, Bock C. LOLA: enrichment analysis for genomic region sets and regulatory elements in R and bioconductor. Bioinformatics. 2016;32(4):587–9.
Nagraj VP, Magee NE, Sheffield NC. LOLAweb: a containerized web server for interactive genomic locus overlap enrichment analysis. Nucleic Acids Res. 2018;46(W1):W194–9.
Layer RM, Pedersen BS, Disera T, Marth GT, Gertz J, Quinlan AR. GIGGLE: a search engine for large-scale integrated genome analysis. Nat Methods. 2018;15(2):123–6 [cited 2021 Jun 25]. Available from: https://www.nature.com/articles/nmeth.4556.
Feng J, Sheffield NC. IGD: high-performance search for large-scale genomic interval datasets. Bioinformatics. 2021;37(1):118–20.
McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, et al. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol. 2010;28(5):495–501.
Zhou Y, Sun Y, Huang D, Li MJ. epiCOLOC: integrating large-scale and context-dependent Epigenomics features for comprehensive Colocalization analysis. Front Genet. 2020;11:53.
Oróstica KY, Verdugo RA. chromPlot: visualization of genomic data in chromosomal context. Bioinformatics. 2016;32(15):2366–8.
Gel B, Serra E. karyoploteR: an R/bioconductor package to plot customizable genomes displaying arbitrary data. Hancock J, editor. Bioinformatics. 2017;33(19):3088–90.
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–2.
Gu A, Cho HJ, Sheffield NC. SHORT REPORT Bedshift: perturbation of genomic interval sets. bioRxiv. 2020;12:2020.11.11.378554.
Feng J, Ratan A, Sheffield NC. Augmented interval list: a novel data structure for efficient genomic interval search. Bioinformatics. 2019;35(23):4907–11.
Gel B, Díez-Villanueva A, Serra E, Buschbeck M, Peinado MA, Malinverni R. regioneR: an R/bioconductor package for the association analysis of genomic regions based on permutation tests. Bioinformatics. 2016;32(2):289–91.
Bhasin JM, Ting AH. Goldmine integrates information placing genomic ranges into meaningful biological contexts. Nucleic Acids Res. 2016;44(12):5550–6.
Cavalcante RG, Sartor MA. Annotatr: genomic regions in context. Valencia A, editor. Bioinformatics. 2017;33(15):2381–3.
Zhu LJ, Gazin C, Lawson ND, Pagès H, Lin SM, Lapointe DS, et al. ChIPpeakAnno: a bioconductor package to annotate ChIP-seq and ChIP-chip data. BMC Bioinformatics. 2010;11(1):237.
Yu G, Wang L-G, He Q-Y. ChIPseeker: an R/bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics. 2015;31(14):2382–3.
Gharavi E, Gu A, Zheng G, Smith JP, Zhang A, Brown DE, et al. Embeddings of genomic region sets capture rich biological associations in lower dimensions. Bioinformatics. 2021;37(23):4299–306. https://doi.org/10.1093/bioinformatics/btab439.
Schreiber J, Durham T, Bilmes J, Noble WS. Avocado: a multi-scale deep tensor factorization method learns a latent representation of the human epigenome. Genome Biol. 2020;21(1):1–18.
Lawson JT, Tomazou EM, Bock C, Sheffield NC. MIRA: an R package for DNA methylation-based inference of regulatory activity. Bioinformatics. 2018;34(15):2649–50.
Lawson JT, Smith JP, Bekiranov S, Garrett-Bakelman FE, Sheffield NC. COCOA: coordinate covariation analysis of epigenetic heterogeneity. Genome Biol. 2020;21(1):1–23.
Mei S, Qin Q, Wu Q, Sun H, Zheng R, Zang C, et al. Cistrome data browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse. Nucleic Acids Res. 2017;45(D1):D658–62.
Zheng R, Wan C, Mei S, Qin Q, Wu Q, Sun H, et al. Cistrome data browser: expanded datasets and new tools for gene regulatory analysis. Nucleic Acids Res. 2019;47(D1):D729–35.
Davis CA, Hitz BC, Sloan CA, Chan ET, Davidson JM, Gabdank I, et al. The encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 2018;46(D1):D794–801.
Dowle M, Srinivasan A, Gorecki J, Chirico M, Stetsenko P, Short T, et al. data.table: Extension of `data.frame [Internet]. 2021. Available from: https://rdatatable.gitlab.io/data.table/
We acknowledge support, contributions, and helpful discussions from other Sheffield lab members. We also appreciate helpful insights from David Auble.
This work was supported by the National Institute of General Medical Sciences grant GM128636 (NCS). KK was supported by UVA Wagner fellowship. JTL was partially supported by the UVA Cancer Center. The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Ethics approval and consent to participate
Consent for publication
NCS is a consultant for InVitroCell Research, LLC.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Kupkova, K., Mosquera, J.V., Smith, J.P. et al. GenomicDistributions: fast analysis of genomic intervals with Bioconductor. BMC Genomics 23, 299 (2022). https://doi.org/10.1186/s12864-022-08467-y
- Genomic regions
- Region set summary
- Data visualization
- R package