KBCommons accounts, groups and data sharing
Account registration
KBCommons allows users to create personal account in the sign-up page with required information. Users can modify their personal profile, upload profile picture, and list all groups in KBCommons once they have completed the registration. With their accounts, users can bring in their private dataset for any organism and visualize any public or sharable dataset via KBCommons interface.
Creation of groups
Creating collaborative groups options are available for all users. The groups’ creators have all privileges to approve or reject any requests to join their group. All requests to join a group would be sent via KBCommons notification system. The creators of groups also have privileges to manage datasets, to share datasets with group members or to delete datasets. All groups are listed along with details of groups and status of the request in users’ profile page.
Sharing data with group members
All uploaded datasets are private by default and their ownership and access permissions can be modified by owner. Owner of dataset can share dataset to any groups and group members with their dataset privilege. All of group members having access permission can retrieve and visualize shared data.
KBCommons key features
Creating a new Knowledge Base
KBCommons provides the capacity to import new organism data to KBCommons and create an entirely new KB for organisms not in KBCommons. It also provides an easy-to-use automated procedure to import the 6 essential files including genome, CDS, protein, cDNA sequences, gene annotation and GFF files from Ensembl or Phytozome for animals and plants respectively to our database. Genome version verification is performed after uploading 6 essential files completed by comparing the MD5 checksum for uploaded files and Ensembl or Phytozome original files. The workflow creation of KBs and workflow of data contribution are shown in Fig. 2.
Contribution to KBCommons
KBCommons supports uploading users’ new multi-omics data including SNP, Indels, methylation, metabolomics expression, proteomics, RNAseq and microarray, etc. Users can use this feature on any existing KBs or following the creation of new KB for an organism. With data processing module, KBCommons processes uploaded data and imports these data to an appropriate database according to genome version, type of dataset and other customized options. KBCommons supports various standard file formats only including Fasta format for sequences data, FPKM or read count data for gene expression, and VCF format for single nucleotide polymorphisms (SNPs) data to ensure no incorrect or false-positive data is uploaded by user. It also uses validation rule for screening insertion or submission of any junk data / characteristics and incorrect information to prevent invalid data.
Adding version to KBCommons
KBCommons allows users to add new genome versions to existing organism KBs and update current organism KBs by uploading the 6 essential files and filling out the organism details such as organism type, name, model version and genome version. KBCommons also uses the data processing module to prepare the required database for further searches and utilization in tools like multiple sequence similarity analysis. Once a user adds a new genome version to existing KB it also enables them to start bringing in multi-omics datasets corresponding to this newly added genome version.
KBCommons browsing
In browse KBCommons tab, all of existing organism KBs with their versions are displayed. All of organisms are listed into four main categories including Animals and Pets; Plants and Crops; Microbes and Viruses; Humans and Diseases. Along with this classification, we also provide a model organism section, which displays model organisms from all the categories. All available genome versions are shows as a list in corresponding organisms KB drop down menu.
Data sources
The data in KBCommons comes from multiple sources. Many of the data incorporated in KBCommons are public data and accessible to all users without login. KBCommons also incorporates and integrates many of private data collected from our collaborators, only available for group members. All of data information are shown in Data Source page in KBCommons home page on the top menu bar. Currently, KBCommons incorporates genome data for Zea mays, Arabidopsis thaliana, Mus musculus, Homo sapiens, Rattus norvegicus, Canis familiaris and Caenorhabditis elegans. KBCommons also have information about traits, SNPs, annotated metabolites, miRNAs and gene entities. The gene models, genomic sequences and functional annotation information were acquired from Ensembl and Phytozome. KBCommons has experimental data for Illumina RNA-Seq experiments covering various tissue types. KBCommons also hosts data regarding miRNAs and their expression abundances came from Cancer Cell Line Encyclopedia (CCLE) [30] and The Cancer Genome Altas (TCGA) [31] and the microRNA database [32] (miRBase). It also hosts gene expression data of 9264 tumor samples across 24 cancer types came from TCGA. The pathway information is acquired from Kyoto Encyclopedia of Genes and Genomes (KEGG) [33].
KBCommons search options
The KBCommons home page (Fig. 3a) provides users with entry points to access all features provided by our Knowledge Base. All of Knowledge Base web pages (Fig. 3b) have similar layout and navigation bar at the top for easy access. The navigation bar has links to different sections including Search, Browse, Tools and General Information.
Gene card
The Gene Card page (Fig. 4a) provides users with information about gene name, gene version, gene family, alias names, gene models with the intron, exon, UTRs, chromosomal information including gene coordinates, strand, cDNA, CDS, protein sequences, and functional annotations including Pfam [34] and Panther [35], and links to pathway viewer. It provides visualization tools to show copy number variation (Fig. 4b) data, transcriptomics data from microarray (Fig. 4c) or RNAseq experiments (Fig. 4d), and other omics data types in graphic charts.
miRNA card
The miRNA Card (Fig. 5a) contains information about experimentally validated or predicted miRNAs, mature miRNA sequence, accession ID, and predicted target genes including corresponding gene coordinates, conservation value, align score, binding energy, and mirSVR score. The miRNA expression data from TCGA and miRBase have been incorporated for browsing on miRNA Card pages.
Metabolite card
The Metabolite Card (Fig. 5b) stores information about metabolites including alias names, pathway, molecular weight, chemical structure, chemical formula, mass-to-charge ratios and SMILES [36] formula. The expression of metabolomics is plotted as bar chart for easy understanding.
Trait card
The Trait Card (Fig. 5c) pages contains information about trait name, multiple QTL regions identified on each of chromosomes, and genes overlapping in individual QTL regions. Information about SNPs, insertions and deletions are also shown in tables.
SNP card
In the SNP Card (Fig. 5d), the predicted SNPs, reference bases, their chromosomal positions, and consensus bases are shown in table. The QTL traits and genes where the SNP falls and overlaps within a gene model’s coordinates are also listed.
KBCommons browse options
Differential expression
The Differential Expression provides a set of visualization tools showing the comparison results of transcriptomics data from Cuffdiff [26], VOOM [27] and edgeR [28]. These results can be filtered by p-value, q-value, fold change and gene regulation types including down-regulated, up-regulated and both. The Differential Expression have six different tags for Gene Lists, Venn Diagram, Volcano Plot, Function Analysis, Pathway Analysis and Gene Modules. The Gene Lists tab (Fig. 6a) shows a list of genes along with p-value, fold change and links to Gene Page in the form of tables. The Venn Diagram tab (Fig. 6b) visualizes overlapping of differential expression genes in different experimental conditions, and allows users to list and download all of genes name in the overlapping set. In Volcano Plot (Fig. 6c), down-regulated genes or up-regulated gene with log fold change and q-value are shown in scatter charts. In the Function Analysis tab (Fig. 6d), distribution of transcription factor gene families and distribution of protein families are shown as pie charts, and all gene families along with percentage are listed. In the Pathway Analysis tab (Fig. 6e), KEGG athways are categorized, and genes are listed under proper pathway. The Gene Modules (Fig. 6f) shows the correlation patterns among genes expression data identified by weighted correlation network analysis (WGCNA) [37]. All of gene names under these six tabs are linked to appropriate Gene Card pages to retrieve information of gene easily.
KBCommons tools options
Pathway viewer
The Pathway Viewer (Fig. 7a) shows KEGG pathways according to list of genes or list of metabolites. The Pathway Viewer provides two ways to show pathway, which are viewing a pathway containing specific compounds/genes and viewing an existing pathway. Downloading pathways mapped for genes and genes mapped for pathway are available.
Motif prediction and web logo
The Motif Sampler [38, 39] tool (Fig. 7b) is designed to make generation of web logo of sequence easy and predicts motifs, indicating domains or conserved consensus sequences, on multiple protein or genes sequences. The predicted motifs and ranking score are shown in form of tables. These motifs are visualized in web logo, a graphical representation of nucleic acid multiple sequence or an amino acid alignment.
Sequence similarity and phylogeny
The BLAST [40] tool and ClustalW2 [41] are included in KBCommons for pairwise sequences search and for multiple sequences search respectively. These two tools consider the customized parameters, and sequences information such as genome, CDS, cDNA and protein as input. The result of BLAST shows a list of hits starts with the best match and expected number of chance alignments in the Result page. The Phylogeny Tool (Fig. 7c) generates a diagram tree that represents evolutionary relationships among multiple sequences by either neighbor-joining (NJ) [42] method or unweighted pair group method with arithmetic mean (UPGMA) method [43].
Scatter plot
The Scatter Plot tool (Fig. 7d) retrieves all available expression datasets and corresponding experimental conditions/replicates. Then it visualizes correlation of genes from two chosen experimental conditions on a scatter chart. The data in scatter plot deviating away from the diagonal represents genes having variations in their expression patterns and it can be detected easily. In the scatter chart, moving cursor over a data point can display its particular expression value.
Heatmap and hierarchical clustering
The Heatmap and Hierarchical Clustering tool (Fig. 7e) displays a heat map representing level of expression of genes across multiple experimental conditions. It allows users to enter a list of gene names and experimental conditions to create a heat map. These genes are clustered according to their expression values in different experimental conditions. In the heat map, option to save heat map as an image is available. The operations of zoom in and zoom out are also available by either clicking the zoom in/out button or selecting a region of interest in the heat map.
Principal component analysis (PCA)
The PCA tool (Fig. 7f) is used for clustering and visualizing samples grouped by the cancer cell line type by reducing the dimensionality of the multi-dimensional gene expression data to three-dimensions. It projects the whole set or subset of gene expression data chosen by user onto three principle components which can be viewed as a gene-like pattern of expression across the samples. The PCA plots implemented by using Plotly [44] which generated a 3D point clustering chart. The coordinates represent the first three principal components that have the largest possible variance and highlight the most similar and different cancer cell lines based on their closeness and distance.
Data analytics
We have implemented two high-throughput cloud-based bioinformatics data analysis workflows in KBCommons: RNA-Seq analysis workflow (Fig. 8a), PGen [10] workflow (Fig. 8b), FastQC Quality Check workflow (Fig. 8c), Alignment workflow (Fig. 8d), Copy Number Variation (CNVs) workflow (Fig. 8e) and Methylation workflow (Fig. 8f). We make all the bioinformatics workflows managed by Pegasus Workflow Management System (WMS) [45] and run them on the XSEDE [13] HPC resources using SoyKB and KBCommons Gateway Analytics allocations.
The RNA-seq analysis workflow is used for performing quantitation of gene expression from RNA-Seq transcriptomics data and statistical analysis to discover differential expressed genes/isoform between various experimental groups/conditions.
The PGen workflow allows users to identify SNPs and insertion-deletions (indels), perform SNP annotations and conduct copy number variations analyses on multiple resequencing datasets in a user-friendly and seamless way.
The FastQC workflow is used to conduct quality control checks on raw NGS data coming from high-throughput sequencing projects, to ensure the data looks good and there are no problems or biases which may affect its further downstream analysis and use.
The Alignment workflow is used to align NGS data or RNA-Seq reads to reference genome. The outputs are in ‘BAM’ format files.
The Copy Number Variation workflow is used to perform efficient analysis to detect CNVs in the form of gains and losses, from NGS reads. This workflow requires user to input a reference sequence and one or more multiple sample/condition sequences which should in ‘BAM’ format. The methylation workflow is used to analysis the high-throughput NGS bisulfite sequencing reads to estimate the methylation level for every cytosine site. There are many other methylation analyses such as hypo-methylated regions (HMRs), hyper-methylated regions (HyperMR) and differentially methylated regions (DMR) between two methylomes can be achieved by this workflow.
Data download
The Data Download (Fig. 9) capacity provides an easy access way to allow users to download data for their gene list of interest. Users can choose genome version and type of data for their gene list. The chromosome coordinates for genes, exons and UTR; CDS, cDNA and protein sequences; Pfam, Panther, Gene Family and Function description; are the data currently available for bulk download.