COMAN accepts as input the Illumina paired-end sequencing reads in FASTQ format and a metadata file specifying the sample conditions for comparative statistical analysis. A sample input dataset is offered to guide users on data formatting requirements (by clicking the “SampleData” button in the homepage). It is optional to upload a file containing the metagenomic taxonomic abundance profiles, which will only be used for “transcription activation analysis” (as described below). All uploaded data and results generated by COMAN are restricted to the user who initiated the job. A comprehensive guidance on COMAN usage can be found in the “Instructions” section of the server homepage.
The whole COMAN metatranscriptome data analysis pipeline (Fig. 1) mainly involves quality control, removal of reads derived from non-coding RNAs, functional annotation, comparative statistical analysis, taxonomy-associated functional analysis and co-expression network analysis. Details on key steps are shown below:
Quality control and removal of non-coding RNA
The uploaded NGS reads are subject to an initial quality control step to remove the adapter regions and low quality reads following a previously described approach [15]. Afterwards, all the QC-passed reads are mapped, using BLASTN, to an in-house non-coding RNA database (see Results for details and evaluation) to filter out the reads derived from non-coding RNAs, including ribosomal RNA and tRNA. The reads with best BLAST hits at e-value < 10−5 are excluded from downstream analysis.
Mapping to reference genomes
In this step, all the high quality reads after depletion of non-coding RNA are further mapped to a reference genome database at 1e-5 cutoff, which includes more than 2700 NCBI complete microbial genomes (accessed at ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/all.faa.tar.gz). We used a much faster and highly sensitive tool named DIAMOND [16] within the COMAN pipeline, since using BLASTX to perform this task is too time-consuming and not practically feasible, especially for a web-based application.
Functional annotation of genes and reads
Functional annotations of those reference genomes have been pre-prepared at the COMAN backend. This includes commonly used annotation systems: Clusters of Orthologous Groups (COG) [17] and KEGG Ortholog groups (KO) [18]. The annotation to COG was conducted using RPS-BLAST against the CDD database (v.3.10) at 1e-5 cutoff, whereas the KO annotation was through the combinatorial use of DIAMOND and KOBAS 2.0 annotate program [19]. In addition, we used PRIAM (with default parameters) [20] to annotate the genes to enzymes (Enzyme Commission numbers) (ECs) that are further used to achieve the profiling of MetaCyc pathways [21].
Comparative statistical analysis
Based on the mapping of reads to reference genomes and the functional annotation results, COMAN performs functional profiling and calculates the relative abundance for each functional group and enzyme, as well as for a higher hierarchy level in the annotation system. This high-level profiling includes COG categories, KEGG pathways, KEGG pathway classes, KEGG modules, and various levels of MetaCyc pathways.
The clustering of all samples using multidimensional scaling (MDS) is incorporated within the pipeline. Once the functional profiling based on COG and KO is completed, the metadata file will be used to conduct differential expression (DE) analysis (Wilcoxon rank sum test, with the default cutoff of FDR-corrected p-value being 0.10), in order to characterise the potential functional changes between two different conditions.
Researchers are often highly interested in the associations between pathway variations and biological phenotypes observed. To facilitate such process, we integrate pathway inference and analysis based on KEGG and MetaCyc into the COMAN pipeline. Despite certain overlap and differences [22], they are both commonly used pathway systems during genomic analysis and metabolic reconstruction, and users of COMAN will have access to results derived from both databases. Most importantly, COMAN uses MinPath [23] to infer the pathways represented in the submitted microbial communities, based on functional annotations of genes and the relationship between pathways and functional groups (KO for KEGG; EC for MetaCyc). Compared with the naive one-hit-match mapping approach that may inflate the number of inferred pathways leading to overestimation of functional diversity, MinPath eliminates some spurious pathways and achieves a more trustworthy inference of biological pathways present in the samples [23].
Afterwards, COMAN incorporates GAGE [24] to infer the significantly enriched or depleted pathways when comparing two conditions. Users may apply a suitable threshold of FDR-corrected p-values to identify such pathways that display coordinated differential expression over the whole pathway and that can be associated with the biological phenotypes. This pathway enrichment analysis is applied to KEGG pathways, KEGG modules and MetaCyc pathways.
Functional profiling, differential expression and pathway enrichment analysis elucidate the functional states and dynamic changes or responses of the involved microbiota community. It is also useful to relate the observed functional variations with particular taxa [25]. For this purpose, COMAN performs a taxonomic contribution analysis to identify which microbial phylum is mostly responsible for community-level expression variation, for each of the most varied functional groups (Fig. 1). These “most varied functional groups” refer to either significantly differentially expressed ones based on user-specified FDR-corrected p-value, or the ones with highest fold-change (up- 50 and down-regulated 50 groups) when the former is not available. Last but not least, COMAN incorporates a taxonomic distribution analysis, where the expression distributions of each functional group among different phyla at both conditions are calculated.
Transcription activation analysis
If the user has the taxonomic abundance profiles across all samples generated by metagenomic sequencing and analysis, this file (phylum-level) can be uploaded to COMAN at the data uploading process. Afterwards, COMAN normalises the gene expressions using the taxonomic abundances, followed by comparison and visualisation of the normalised expression levels in two conditions. This further elucidates whether the observed expression variations of certain genes are derived from varied transcription levels (transcriptional activation or repression) or simply caused by bacterial taxonomic composition differences.
Co-expression network analysis
In co-expression network, genes having similar or related functions tend to possess similar expression profiles and thus tend to be clustered together [26, 27]. COMAN calculates the pairwise correlations (Spear correlation) for the “most varied functional groups” aforementioned, based on their expression profiles among different samples in one condition, and generates a co-expression network accordingly. Afterwards, the random walk algorithm is used to find densely connected sub-networks, or communities within the network (Fig. 1). The elements within the same community represent concordant behaviours, such as similar responses to a particular stimulus, and thus the similar or closely related functions.
The results from COMAN co-expression analysis include TAB-delimited data files presenting the involved functional groups and their topological properties and belonged communities, as well as high-quality figures for network visualisation with deduced communities (Fig. 1). Moreover, since such network typically contains extensive information, a web-based interactive network is provided in the result page for deeper inspection. For clarity purposes, all detected communities with fewer than 3 elements are merged into one single residual module. Users may also investigate the “hub nodes” in the resulting network, which represent genes or functional groups with extremely high connectivity.
Despite that only the “most varied functional groups” are involved in COMAN, users may use the abundance profiles for all functional groups or any subset of interest (COGs, KOs, or ECs) to construct a global co-expression network and perform network topological analysis. Note that to perform a meaningful co-expression network analysis using COMAN, the minimal number of samples within one condition has to be greater than four. However, we recommend the sample size to be larger than eight in order to generate findings of more biological relevance with lower false positives.
Profiling of Biosynthetic Gene Clusters for analysis of microbial secondary metabolites
Biosynthetic gene clusters (BGCs) are physically clustered gene sets responsible for the synthesis of microbial secondary metabolites, whose importance and widespread distribution in the human microbiome have been demonstrated before [28]. Despite the greater understanding of different microbiota in varying environments, there is still paucity of characterised metabolites that are synthesized by the microbial community and that may contribute to the differential phenotypes [15]. Here we plugin antiSMASH 3.0 [29] for BGC annotation into the COMAN pipeline, to facilitate the characterisation and comparison of secondary metabolite biosynthesis among different conditions. Using our pre-prepared identification of BGCs from the NCBI reference genomes, the BGCs and their abundances in each submitted sample can be calculated following similar rules as Donia et al [28]: 50 % of genes (after excluding non-biosynthetic ones) in each BGC need to be covered by reads; the abundance score of this identified BGC is defined by taking the average abundance of these genes. Further, the abundances of BGCs producing different types of secondary metabolites (BGC types, as defined by antiSMASH) are calculated, followed by a comparative statistical analysis (Wilcoxon rank sum test) to identify differentially abundant BGC types. This reveals the active production capability of different types of secondary metabolites by the involved microbiota. A bar chart illustrating the relative proportion of different types is provided as well. Note that this BGC-related analysis module in COMAN reflects the transcriptionally active BGCs in the microbial community.
Results visualisation and access
COMAN produces a collection of easy-to-interpret figures upon job termination. The example output of a complete dataset derived from a human gut microbiome project is available on the server (the “Example Data” section of the server homepage). The essential data in tabular format (“TAB” delimited file) are provided, therefore the user may use them to perform additional analyses, which include the integration with other tools as well as more sophisticated analyses for advanced users. For relatively basic figures such as bar charts, they can make custom modifications of figures using their preferred programs. For co-expression network visualisation, users may choose CytoScape with a graphical user interface or use the R script provided (Additional file 1) to apply different filtering and layout parameters to the network. Moreover, COMAN keeps the raw sequence mapping files in BLAST m8 format that includes the details of sequence alignment (e.g. identity, alignment position, e-value). A detailed documentation called “README.txt” describing each output file is included for every completed job to help users interpret the results.
All results generated by COMAN are compressed to simplify data download. Since the sequence mapping results are high data volume and take considerably more time to download, COMAN separates such files with those generated by functional analysis for the convenience of users.