Input files
Three tab-delimited text/csv input files (gene annotation, sample-group definition, gene expression matrix) are required before running iGEAK (Fig. 1a-c), but file preparation is very straightforward (Fig. 1d). iGEAK does not directly use or process raw CEL (microarray) or FASTQ/BAM (RNA-seq) files. If a user has only raw files (CEL, FASTQ, BAM) and no bioinformatics skills, on-line tools such as ArrayAnalysis.org [8] and the Galaxy platform (https://usegalaxy.org) can be very useful to get a gene expression matrix. The Galaxy platform provides a variety of sequencing read aligners (e.g. TopHat (https://ccb.jhu.edu/software/tophat/index.shtml) or HISAT2 (https://ccb.jhu.edu/software/hisat2/index.shtml)) and read counting tools (e.g. featureCounts (http://bioinf.wehi.edu.au/featureCounts) or htseq-count (https://htseq.readthedocs.io/en/release_0.11.1)). iGEAK’s 3 input files can be directly prepared from this gene expression matrix and sample-group annotations (Fig. 1d).
A "gene annotation" file (Fig. 1a) is a tab-delimited two-column text/csv file. The first column is for unique gene identifiers, such as Affymetrix probeset IDs (microarray) or Ensembl gene IDs (RNA-seq). The second column should be always for gene symbols (or blank when no stable gene symbol is matched to a given unique gene ID). Various gene ID types can be converted into Gene symbols using DAVID (https://david.ncifcrf.gov/conversion.jsp) or BioDBnet (https://biodbnet-abcc.ncifcrf.gov/db/db2db.php). If raw gene counts (RNA-seq) were summarized by gene symbols, both columns should be gene symbols.
A "sample-group definition" file (or "metadata", Fig. 1c) is a tab-delimited multi-column text file. The first column describes sample IDs and all other columns contain sample group definitions. iGEAK can handle up to 10 different group definitions (columns). Users must choose one group definition column when this file is uploaded on to iGEAK.
A "gene expression matrix" file (Fig. 1b) is a tab-delimited data matrix of log2-transformed normalized gene expression values (microarray) or raw gene counts (RNA-seq). Detailed formats for the 3 input files are described at the iGEAK project site. Sample input files for microarray and RNA-seq studies are also available at the site.
Tabs
Each analysis step (Fig. 2a and b) corresponds to a “tab” in iGEAK’s GUI. When a user changes input data or parameters and clicks a tab, the tab (function) instantly updates outputs, so that a user can easily and quickly explore data using different combinations of parameters. Some tabs (“Introduction”, “Venn Diagram”, and “Orthologs”) are independent, but others inter-connected. Currently 14 tabs are implemeted in iGEAK. A detailed step-by-step tutorial (slides and video clip) using a sample RNA-seq dataset is available at the iGEAK project homepage (check the “Crash Course” section).
“Introduction” tab: This tab provides brief descriptions about iGEAK, the copyright disclaimer, and action buttons displaying pre-installed R/Bioconductor packages and iGEAK session information.
“Data Upload” tab: Three input files are uploaded to iGEAK using this interface. Users also need to choose species (human or mouse), sample groups, and parameters for filtering sub-optimal probesets (microarray) or low-count genes (RNA-seq). A normalized gene count matrix (RNA-seq) using the trimmed mean of M-values normalization (TMM) method [9] is also generated at this step.
“PCA” tab: iGEAK creates an interactive principal component analysis (PCA) plot and a sample (Pearson) correlation plot based on transcriptomes (Fig. 3a and b). These two plots help users quickly and visually to identify outlier samples. When outlier samples are detected, users can easily re-group them by editing a group-definition file in a spreadsheet program (e.g. Excel) or a text editor, then re-upload the updated file to iGEAK.
“Multi-group” tab: This tool is designed for users who want to quickly check gene expression patterns (heatmap (Fig. 3c), boxplot (Fig. 3d), test statistics) of a given gene set across multiple samples and groups. Users can perform parametric tests (analysis of variance (ANOVA) and post-hoc Tukey’s test) or non-parametric tests (Kruskal-Wallis and Mann-Whitney U-test) based on the samples’ characteristics and research design/goals. To help users to choose between parametric and non-parametric test, iGEAK provides (1) Shapiro-Wilk Normality test statistics and (2) Group dispersion (= standard deviation for each group). If the Shapiro-Wilk test p-value > 0.05, the parametric tests (ANOVA and Tukey’s test) are preferred, as the expression data do not seem to violate the normality assumption. However, if each group’s sample size is > 15 and there are 2–9 groups in total, the parametric tests can perform well even with continuous data that are slightly non-normal. Users are recommended choosing the non-parametric tests (Kruskall-Wallis and post-hoc pairwise Mann-Whitney U-test) if the expression data violate the normality assumption and/or the total sample size is very small, but the data for all groups have the same dispersion. If sample groups have different dispersions, the non-parametric tests might not provide valid results.
“Two-group” tab: The most common microarray or RNA-seq data analysis is a two-group comparison. Users choose two sample groups here and this decision affects the following 5 tabs (“DEG”, “Heatmap”, “Volcano Plot”, “PPI”, and “ORA”) that use two sample groups. The two sample groups can be any subset of selected multi-groups from the “Data Upload” tab. During RNA-seq data analysis, raw gene counts from samples in two groups are re-processed for read count normalization.
“DEG” tab: To predict gene-level differentially expressed genes (DEGs), iGEAK uses the R/Bioconductor limma [10] package for microarray data and edgeR [9] or voom-limma [11] packages for RNA-seq data. For RNA-seq data analysis, users need to choose one method between edgeR and voom-limma. Once users set 3 filtering parameters (minimum fold, p-value, multiple testing), this tab reports a filtered DEG list and statistics, such as log2fold (logFC), p-value, adjusted p-values after a chosen multiple testing method. If users do not choose a method of multiple testing (“none”), adjusted p-values and original p-values are the same. The report table provides the URL links to NCBI Gene database (https://www.ncbi.nlm.nih.gov/gene) in the last column.
Three parameters used in the “DEG” tab affect “DEG”, “Volcano Plot”, “Heatmap”, “PPI”, and “ORA” tabs. To change output results from these tabs, users need to revisit the “DEG” tab to adjust filtering parameters. The “Broad-GSEA” and “GSEA” tabs are not affected by these parameters because the GSEA method uses all genes in the original expression data matrix.
“Volcano Plot” tabs: Differentially expressed genes are conveniently visualized using this interactive volcano plot. This tool is useful to visualize the distribution of DEGs in terms of log10 (adjusted) p-value and log2 fold change. Users can quickly identify extremely changed genes and get their differential expression information by setting a window area around genes using a mouse. This plot can be downloaded as an image file.
“Heatmap” tab: iGEAK provides a highly reconfigurable heatmap and boxplot generation tool. Users can easily adjust width, height, font size, tree height, scaling method, clustering type, and color using this tool. A heatmap is generated by clicking “Create/Reset” button. If no gene (symbol) of interest is submitted, a heatmap including all DEGs is generated. No boxplot is created because the total numbers of DEGs could be too many. If a subset of DEGs are submitted, a heatmap and boxplots are created. These plots can be downloaded as image files.
“PPI” tab: This tab is a protein-protein interaction (PPI) and transcription-control network visualization tool using the visNetwork (https://github.com/datastorm-open/visNetwork) package (Fig. 3e). The network nodes (proteins) and edge (interactions) are color-coded based on their fold change levels and interactions type (PPI, TF, PPI + TF), where TF represents “transcription factor (TF)– target interaction”.
Users can easily change network layouts, search genes, edit the network directly, and download it as an image file. This PPI network could be scarce if there are only small numbers of predicted DEGs. In this case, users may revisit the “DEG” tab and adjust the filtering parameters (i.e. lower minimum fold, higher p-value cutoff, no multiple testing) to get more DEGs.
The physical PPI information was extracted from BioGrid (https://thebiogrid.org, v3.4). Transcription factors (TFs) and target genes with conserved (human, mouse, rat, and dog) TF binding sites are also visualized. The backbone PPI network is extended by adding TFs (“star” shaped node) and genes having transcription factor binding sites (TFBS) within promoters and/or 3′-UTRs. This information is extracted from MSigDB’s C3 dataset (http://software.broadinstitute.org/gsea/msigdb). Currently, iGEAK provides human and mouse data only.
“ORA” tab: iGEAK provides over-representation analysis (ORA) based on the Reactome database (http://www.reactome.org) and the ReactomePA package [12]. A summary dot plot (top-30 highly enriched gene sets, Fig. 3f) and detailed enrichment result table are reported in this tab. This analysis uses DEGs predicted in the “DEG” tab. If no result is reported, users may need to adjust filtering parameters in the “DEG” tab to increase the numbers of DEGs.
“GSEA” tab: iGEAK provides a light-version of Gene Set Enrichment Analysis (GSEA) [13]. The current version is based on the Reactome database and ReactomePA package. Currently, iGEAK supports human and mouse data only. This tab is not affected by parameter settings in the “DEG” tab.
“Broad-GSEA” tab: For users who prefer a stand-alone Java-based GSEA program from Broad Institute (http://software.broadinstitute.org/gsea), iGEAK provides three GSEA input files (expression (txt), phenotype (cls), annotation (chip)). Users can choose many different reference gene set databases and refined functions / metrics in the Broad’s GSEA program. This tab is not affected by the “DEG” tab because all genes (not DEGs) are included in the files.
“Venn Diagram” tab: This venn diagram tool creates a highly reconfigurable 1- to 5-way Venn diagram (Fig. 3g) and a summary table showing genes in each section. Users can freely change group title, font size, plot size, color, and transparency and download a final Venn diagram as an image file. We re-wrote several functions in R’s original vennDiagram package for better plotting and data extraction.
“Orthologs” tab: Gene symbols are intuitive, but not ideal unique identifiers because they can change. The HUGO Gene Nomenclature Committee (HGNC, https://www.genenames.org) and The Mouse Genome Informatics Database (MGI, http://www.informatics.jax.org) are the authoritative resources of the official human and mouse genes and their updated gene symbols.
Converting gene symbols between human and mouse are sometimes tricky. When a human gene symbol and its mouse counterpart use the same letters, the only difference is that human gene symbols are all uppercase, but mouse gene symbols are lowercase except for the first letter. The first tool in this tab provides this simple function and is also useful to convert gene symbols into protein symbols (all uppercase, for human and mouse). However, in many cases, a human gene and its orthologous/homologous mouse genes have different gene symbols. iGEAK provides a parsed Ensembl-v92 dataset (https://useast.ensembl.org) to retrieve inter-species (i.e. between human and mouse) orthologs/homologs.