To show the general applicability of DEBrowser on “count data” from different data types; we used a large data set that we recently generated to study gene regulation in innate immune cells (human monocyte derived dendritic cells, hMDDCs) in response to Toll-like receptor signaling [41]. This study generated RNA-Seq, ChIP-Seq and ATAC-Seq [42] to track changes in transcription and regulatory element activity in the course of Toll-like receptor (TLR) signaling. We reprocessed all raw sequence reads either as described in the original publication or using more recent approaches (Additional file 3).
These data are ideal to showcase the main features of DEBrowser and how DEBrowser can be used throughout the analysis cycle. Indeed, we show how DEBrowser was used for data assessment to identify batch effects, and data preparation by filtering low count features and removing batch effects and then performing differential analysis.
Data assessment
Quality control (QC): Quality control of the count data is a fundamental step in analysis, yet it is not well supported in current applications. With DEBrowser users can easily establish whether normalization, batch correction or sample removal are necessary, or if the data is suitable for differential analysis. To this end, DEBrowser implements PCA, all2all scatter, heatmaps, interquartile range (IQR) and density plots of each sample. These plots can be drawn using a user defined subset of genes, for example by choosing the top N variable genes as defined by coefficient of variance [40, 41]. The subset of genes can be defined graphically by either an expression cutoff or by directly selecting them from another plot.
For example, the data for hMDDCs show clear donor dependent differences, which are visible in all2all plots and are captured by the second principal component (Fig. 2a). These differences may be the result of genetic heterogeneity or simply due to technical variability in library construction. Regardless of the source, the variability introduced by inter-donor differences has a direct impact on the power to detect TLR responsive genes.
Plots available for data assessment
Principal component analysis (PCA)
PCA finds an ordered coordinate transformation whose basis capture, in decreasing order, the most variance in the data. DEBrowser allows users to plot any pair of principal components in a scatter plot. Once users specify sample information (e.g. condition), DEBrowser uses colors or shapes to group samples. These plots are ideal to detect outliers or batch effects (Fig. 2).
All2all scatter plots
Gives an overview of sample similarities and variance by plotting all pairwise scatter plots (Additional file 2: Figure S2). Low correlation or high variance across replicates will negatively impact the power to detect DE.
Heatmaps
DEBrowser allows users select genes based on variance, minimum expression level, DE p-value, or after manually selecting a set of genes from any gene centric plots (e.g. scatter, volcano, and other heatmaps). Heatmaps are also useful to assess replicate variability, low quality samples, or batch effects (Fig. 3). Similar to other plots, heatmaps can be used to visualize any type of count data and are ideal to identify global patterns in the data such as dynamic changes in chromatin accessibility following TLR signaling [41] (Additional file 2: Figure S3).
IQR and density plots
Interquartile range and density plots display a sample’s quantification distribution in different ways. Using these plots, users can detect any global discrepancy across samples and evaluate the impact of normalization on the distribution of counts. DEBrowser simplifies comparisons by providing plots for both normalized and unnormalized data. Plots are re-drawn as soon as users change the normalization method.
Data preparation
Removal of low coverage features
Removing features (genes, or genomic regions) that have low coverage due to their low expression, increases the speed and accuracy of DE algorithms. It also helps to perform more accurate dispersion calculations and multiple hypothesis correction [43]. DEBrowser provides three common ways to filter these features: by specifying a minimum signal in at least one sample, by a minimum average signal across all samples or by requiring a minimum signal in at least n samples (n defined by the users). Once filtering a criterion is specified, DEBrowser reports read counts for each sample (Fig. 4a, b) and plots the feature count distributions before and after applying the filtering (Fig. 4c, d).
Normalization
The count data originating from a sequencing experiment is affected by sequencing depth as well as from differences in the composition of the detected features [44, 45]. DEBrowser supports normalization methods specifically designed for count data: median ratio normalization (MRN) implemented in DESeq2 package [43, 46], Trimmed Mean of M-values (TMM), Relative Log Expression (RLE), and upper quartile methods implemented in the EdgeR package [47]. To evaluate the effect of normalization, DEBrowser immediately displays PCA, IQR and density plots after normalization.
Batch effect correction
When quality control shows a clear batch effect that can be traced back to a technical artifact (e.g. different sequencing devices, different personnel, library kits, reagent batch), DEBrowser allows the users, if the experimental design allows batch correction, to minimize the batch effect [48]. The users can specify a batch for every sample via a simple tab separated file that can be created using a text editor or spreadsheet software. Given a batch specification, DEBrowser supports two different methods: ComBat [26, 49] and Harman [50]. The resulting, batch corrected dataset can be evaluated using the same tools available for initial quality control (Figs. 2b, 3b). DEBrowser provides a platform to detect, correct and evaluate the result of batch correction.
Differential expression (DE) analysis
To demonstrate a typical usage of DEBrowser, we applied DEBrowser on a data set from a previously published study on the role of Jun terminal kinases (JNK1 and JNK2) in the liver and their role in insulin resistance [2]. For this purpose, the authors relied on four different mouse genotypes: wild type (WT), and hepatocyte specific knockouts of Jnk1 and Jnk2 independently (LΔ1, LΔ2), and a double knockout (LΔ1,2). Each genotype was fed either a regular or a high fat diet (HFD). Thereafter, hepatic expression was assayed for each genotype fed with corresponding diet in triplicate using RNA-Seq, resulting in a total of 24 libraries. This study is ideal for DE analysis as it included three replicates per condition. Therefore, we used RSEM [15] for library quantification and DEBrowser to analyze the resulting read count table.
DEBrowser supports differential analysis using DESeq2 [43], EdgeR [47], and Limma [51]. We used all three methods and present the analysis done with DESeq2 in the main figures and the comparisons between all three methods in the supplementary figures (Additional file 2: Figures S6 and S7). Users can perform differential analysis after defining the groups of samples to compare. DE results can be visualized through the same scatter, volcano and MA plots used for data assessment (Fig. 5a-c). Users can highlight results by specifying desired significance and fold change cut-offs. All plots allow interactive access: Users may select a point within the plot to zoom-in and re-display only selected data. Plots are redrawn as soon as the users change any parameter or select points to zoom-in on any data point or set of points that can be investigated by graphically selecting them.
As reported by the authors, high fat diet has a stronger effect on LΔ2 animals compared LΔ1 animals (Additional file 2: Figure S8). To examine genes that are dysregulated in the liver under high fat diet fed mice, we performed DE analysis between WT mice fed with a normal or high fat diet. In all, 493 genes are significantly down regulated and 350 are up regulated in livers of mice fed with a high-fat diet (p < 0.01, |log2foldChange| > 1). Disease ontology analysis of up regulated genes shows, not surprisingly, a clear enrichment of diseases resulting from poor diet (urinary, kidney and other obesity related ailments). Here to show an example, we overlaid enriched insulin signaling pathway genes on a scatter plot (Fig. 6a) and easily created a heatmap (Fig. 6b) by using selected genes on this scatter plot similar to that in Additional file 2: Figure S3-B in the original report [2].
We then compared the effect of both normal and HFD on all four genotypes. To do so we performed pairwise comparisons between all conditions and selected all genes with a |log2foldChange| > 1 and padj< 0.01 in at least one comparison to display in a heatmap similar to Additional file 2: Figure S3-A in the original report (Fig. 6c). Finally, we used DEBrowser to reproduce Additional file 2: Figure S3-B in original report. To do so we performed gene ontology analysis of the DE genes between WT and LΔ1,2 on HFD, selected genes that are annotated as part of the PPARα pathway and visualized them as a heatmap (Fig. 6d).
It is important to note that the original publication used much older DE methods [52]. When we applied DESeq2 to replicate the analysis we found that it had much greater power to detect differentially expressed genes and indeed at a similar threshold many more genes are called DE. Nevertheless, there is a very good agreement (73%) in the calls made by both methods (Additional file 2: Figure S4), and most importantly, there are no differences with the Gene Ontology enrichments reported in the original publication (Additional file 2: Figure S5).
Further, users can explore individual regions by hovering over points. The gene or region id is shown and a bar plot displaying the values of the gene or region across all samples is drawn. This is especially useful to investigate, for example, why certain genes may have large differences in between samples but fail to achieve statistical significance. In Fig. 7a, the, Fabp3 gene fails to achieve a significance of 0.05 despite of its mean expression between the conditions exceeding a 50-fold change. Hovering over Fabp3 shows the high variance of this gene across samples, which explains why statistical tests that account for both inter and intra condition variability fail to achieve significance in cases of high intra condition variability (Fig. 7b).
Volcano, Scatter and MA plots work on summary statistics: significance averages or fold change of averages. To explore the underlying data for any set of regions in a plot, DEBrowser can draw heatmaps for any selected region from any main plot. Selection can be made in a rectangular form or as a free-form using plotly’s lasso select (Fig. 8a), which then dynamically generates a heatmap of the selection. (Fig. 8b). Conversely, in any heatmap the users can select a subset of regions (such as based on similar expression pattern) for downstream analysis such as gene ontology, disease and pathway analysis.
Gene ontology, disease and pathway discovery
For gene expression analysis in particular, DEBrowser supports Gene Ontology (GO) [53], KEGG pathway [53] and disease ontology analysis [54]. Users can perform GO or Pathway analysis directly on the results of differential expression analysis or on a subset of selected genes from any of the plots described above. For KEGG pathway analysis, in particular, DEBrowser provides pathway diagram for each enriched category (Fig. 9).
To further assist users in differential analysis, DEBrowser provides k-means clustering of differential regions, and when these regions are associated with genes, a gene ontology enrichment analysis is performed using enrichGO function in ClusterProfiler package [53] (Fig. 10).
Comparison to related applications
There are several applications, with varying functionalities, available for the exploration and analysis of DE. Most notable ones are, OASIS [27], VisRseq [28], DEGUST [29], DEIVA [30], WebMeV [31], Chipster [32], and DEapp [33]. A comparison of DEBrowser features to those applications is shown in Additional file 1: Table S1.
DEBrowser modular design
To reduce the code complexity and manage the program easier the components were designed in a modular fashion, so that while DEBrowser grows larger, it is easy to build on top of the simple modules. To this end, bar, box and scatter plots, heatmaps modules could be reused multiple times in DEBrowser. We also shared example shiny applications that use individual modules. This modularity increased our development, test speed, and code reusability. For example, the size and the margins of the plots are controlled within the same module in all the plots in DEBrowser. This modular design allows other users to repurpose any of the tools built into DEBrowser for their own packages.