To demonstrate the utility of WHAM!, we used two independent, publicly available test datasets. The first was derived from 47 human microbiome samples from four body sites made available by the Human Microbiome Project (HMP) [27]. Shotgun metagenomic sequencing data were processed through an analysis pipeline utilizing the Huttenhower Biobakery pipeline [18], including FastQC, Kneaddata [19], MetaPhlAn [1] and HUMAnN2 [2, 3] to obtain an annotated gene abundance matrix. After host decontamination and quality filtering, the estimated counts in each sample were calculated by multiplying the relative abundances for each feature by the total sum of profiled counts. Following count estimation, the gene family identifiers were further collapsed by GO term mapping via the “humann2_regroup_table” function provided within HUMANn2. This dataset has been mounted as a test case to our web-app in the ‘Try a Sample Dataset’ mode on the application homepage. Although an already well-studied dataset, our analysis of these HMP sequencing data highlights the utility and exploratory capabilities provided by our visualization suite. As expected, body sites vary widely in the taxonomic species present and in the abundance of these taxa (Fig. 2a). Arm samples were dominated by the genus Cutibacterium (previously classified as Propionibacterium), which was also observed in the original HMP analysis (Fig. 2b, c) [27]. Furthermore, stool and saliva samples exhibited much greater microbial diversity when compared to arm and vaginal samples, at the depth of resolution provided in the original data (Fig. 2a). As demonstrated, WHAM! is able to readily identify and visualize taxonomic differences based on group classifications which could include varied diets, drug treatment groups, disease states, or any other user-defined classification. We can similarly explore the GO term abundance across samples using the ‘Explore Features’ tab, automatically identifying differentially abundant GO terms across samples based on user-controlled p-value and effect size cutoffs (Fig. 2d). Of those found to be significantly different, several antibiotic resistance-related GO terms were represented, including drug transmembrane transport, differing between stool and all other body sites tested (Fig. 2e). The taxa contributing to the abundance of this pathway also differed between sites, with high diversity, including E. coli and Bacteroides, found in stool samples (Fig. 2f).
Because of our interest in the emergence of antibiotic resistance, we chose to explore our test data set for patterns in pathway abundances for antibiotic resistance mechanisms based on GO-term categories. By searching for these keywords in the ‘Feature Search’ tab, we detected several antibiotic resistance-related GO-term categories across the four body sites (Fig. 3a). Clicking on the features in the heatmap revealed significant differences in relative abundance levels of a subset of GO terms across the four body sites. These included the ‘response to antibiotic’ GO-term, which was significantly different in abundance in comparisons between stool and vagina, stool and saliva, vagina and saliva, and arm and saliva (Fig. 3b). Our analysis also demonstrates relatively high abundance levels of antibiotic resistance gene families in saliva and a wide dispersion of these gene families in stool samples (Fig. 3a).
Further investigation via the ‘Feature Search’ tab also provided taxonomic identification corresponding to the differences in ‘response to antibiotic’ GO-term abundance across the four body sites. In arm samples, the ‘response to antibiotic’ GO-term was almost exclusively present in C. acnes, while in saliva and stool samples the contributing taxa were more diverse, with the highest prevalence occurring in Streptococcus oralis in saliva and Prevotella copri in stool (Fig. 3c). Such observations in other data sets can address a number of biologically relevant questions, including how commensal bacteria contribute to the spread of antibiotic resistance, and how particular bacterial species are able to inhabit multiple different body sites, and whether or not their attributes differ across body sites.
Correlation analyses of functional features can enable users to obtain information about shared selection, or interactions between gene families, according to abundance patterns across different classification groups in the studied datasets. From this information, the highly correlated antibiotic transporter activity (GO term 9), kanamycin kinase activity (GO term 11), and response to antibiotic (GO term 7) pathways, suggest shared selection. These three pathways also were found to be anti-correlated with antibiotic metabolism (GO term 3) and beta-lactam antibiotic catabolism (GO term 5) (Fig. 3d). Establishing and evaluating these relationships in real time provides the opportunity to test and generate on-the-fly hypotheses by biomedical experts.
Based on our findings at the GO-term level, we then investigated these samples at the gene family level, further demonstrating the utility of our tool at analyzing specific gene features in addition to a broad-level feature analysis. Analysis of 114 Uniref90 gene families that mapped to the ‘response to antibiotic’ GO-term based on the HUMANn2 mapping files showed relatively high levels of antibiotic resistance gene families in saliva and stool, with scattered extreme values also found in arm samples (Fig. 4a). Targeting a specific gene, the Tetracycline resistance protein TetQ, we found that the contributions in saliva came primarily from Prevotella pallens with more diverse contributions found in stool samples (Fig. 4b). There were significant differences in abundance levels occurring in all pairwise body site comparisons with the exception of the comparison between arm and vagina (Fig. 4c). Focusing further on the tetracycline resistance genes, there was shared expression in stool and saliva samples with non-zero abundance of tetracycline resistance protein class B found in saliva only (Fig. 4d). Cross comparison of the tetracycline gene families identified high correlation for a subset of genes (TetQ, TetW, TetO) (Fig. 4e), all found to be abundant across stool samples.
Lastly, we demonstrate the use of WHAM! for exploration and visualization of a second test dataset derived from the EBI metagenomic service describing the metagenomic profiling of 15 preterm infants [28]. We used the ‘Explore Your Data’ module to visualize relevant taxa present and the relative abundance of taxa in the babies born via vaginal or cesarean delivery (Fig. 5a). This analysis identified 17 taxa that differed significantly between experimental groups, including clinically important strains of Staphylococcus, such as S. aureus, which was significantly more abundant with cesarean delivery (adjusted p = 0.005) (Fig. 5b). Further analysis using the ‘Explore Features’ tab identified several Staphylococcus associated virulence proteins including a Staphylococcal hemolytic protein family and Staphylococcal AgrD which is involved in quorum-sensing signaling to release exoproteins involved in virulence [29] (Fig. 5c). Both features were identified as differentially abundant between the conditions (adjusted p = 0.0014 and p = 0.0042 respectively). We provide this information to illustrate how WHAM! can facilitate the discovery of taxa and their genes that could be of clinical significance. Although S. aureus can be an important pathogen in infants [30] the available metadata do not permit assessment of its clinical significance in this study.
These implementation examples demonstrate how WHAM! can be applied to metagenomics data to easily identify and visualize biologically relevant relationships and to generate novel hypotheses. Recently developed tools, Metaviz [31], BURRITO [32] and MetaComp [33], address similar challenges, however, WHAM! has several important differences. Although visually striking and useful, Metaviz focuses on taxonomic analysis without factoring in biological processes, gene features or pathways [31]. Like WHAM!, BURRITO enables uses to interactively explore their metagenomics data, but lacks the capability of feature searching and hypothesis testing and provides fewer statistical tests for relative abundance across groups when compared with WHAM! [32]. MetaComp has robust statistics and accepts a range of inputs, but it requires an external download and installation, which can lead to unexpected issues depending on the user’s compute platform [33]. WHAM! allows for web-based hypothesis generation based on both taxa and functional features, permitting on-the-fly confirmation and figure generation, substantially adding to the current suite of tools available for metagenomic analysis.