Soybean Knowledge Base (SoyKB): a web resource for soybean translational genomics
- Trupti Joshi1, 2, 3, 4,
- Kapil Patil1, 2,
- Michael R Fitzpatrick1, 2,
- Levi D Franklin1, 2,
- Qiuming Yao1, 2,
- Jeffrey R Cook1, 2,
- Zheng Wang1,
- Marc Libault2, 3, 5,
- Laurent Brechenmacher2, 3, 5,
- Babu Valliyodan3, 5,
- Xiaolei Wu3, 5,
- Jianlin Cheng1, 2, 3, 4,
- Gary Stacey2, 3, 5,
- Henry T Nguyen2, 3, 5 and
- Dong Xu1, 2, 3, 4Email author
© Joshi et al.; licensee BioMed Central Ltd. 2012
Published: 17 January 2012
Soybean Knowledge Base (SoyKB) is a comprehensive all-inclusive web resource for soybean translational genomics. SoyKB is designed to handle the management and integration of soybean genomics, transcriptomics, proteomics and metabolomics data along with annotation of gene function and biological pathway. It contains information on four entities, namely genes, microRNAs, metabolites and single nucleotide polymorphisms (SNPs).
SoyKB has many useful tools such as Affymetrix probe ID search, gene family search, multiple gene/metabolite search supporting co-expression analysis, and protein 3D structure viewer as well as download and upload capacity for experimental data and annotations. It has four tiers of registration, which control different levels of access to public and private data. It allows users of certain levels to share their expertise by adding comments to the data. It has a user-friendly web interface together with genome browser and pathway viewer, which display data in an intuitive manner to the soybean researchers, producers and consumers.
SoyKB addresses the increasing need of the soybean research community to have a one-stop-shop functional and translational omics web resource for information retrieval and analysis in a user-friendly way. SoyKB can be publicly accessed at http://soykb.org/.
A hallmark of modern biology is tremendous amounts of complex omics data, which require large-scale data management, comprehensive computational analyses, fast retrieval and efficient integration for better understanding of the data and more effective hypothesis generation. Such an infrastructure has already been developed for some model organisms such as TAIR  for A. thaliana, Wormbase  for C. elegans, MGD  for M. musculus, SGD  for S. cerevisiae, Flybase  for D. melanogaster, Oryzabase for O. sativa  and Gramenefor grasses . There are a number of soybean-specific databases such as Phytozome , Soybase , Soybean Genome database  and Soybean Genomics and Microarray Database . However, these databases do not contain comprehensive large-scale data for soybean. Since the newly sequenced G. max genome became available in 2010 , the focus of soybean research has shifted towards performing genome-scale experiments, leading to a deluge of biological data being generated. There is an overwhelming amount of high-throughput data including transcriptomics, proteomics and metabolomics data, generated by labs working on soybean. These data can benefit the entire research community as well as soybean producers, consumers and breeders if compiled, integrated and utilized in a novel and comprehensive way.
Motivated by this emerging need that cannot be addressed by existing soybean databases, we conceptualized and developed Soybean Knowledge Base (SoyKB). SoyKB was designed in a modular fashion with easily expandable architecture, making it feasible to accommodate any new additional requirements for years to come. It seamlessly integrates the biological data for genes/proteins, microRNAs (miRNAs), metabolites and SNPs using a unified framework of genome visualization, biological function annotation, and pathway information. It provides users with a web portal to bring their data into SoyKB and compare with the huge inventory of public data in SoyKB. Many of SoyKB entries are also linked to other soybean databases such as Soybase  to allow easy and seamless navigation between the two.
Database structure, design and implementation
1. MySQL database module
The site uses a MySQL database to store large amounts of experimental data and their annotations or analysis results. Through various fast search capacities and tools in SoyKB, the search result is presented in an organized manner that allows for further analysis. This database module incorporates and integrates all the soybean genomics and experimental omics data from various experiments. It is designed to contain information on the four entities namely genes/proteins, miRNAs, metabolites and SNPs.
2. Web interface module
3. Genome browser module
All the genomic data in SoyKB has been deposited into a genome browser, which is set up locally for soybean utilizing the architecture provided by UCSC . The module allows users to visualize the gene models and their supporting evidence, SNPs and other experimental data such as gene expression profiles of RNA-Seq, microarray and small RNA. Users can visualize the entire chromosome in a single view to help understand the overall picture. The browser also allows users to zoom in and out to focus on regions of their interest and gives the users the flexibility to load or hide any experimental tracks.
4. Pathway integration module
This module integrates data from various experimental conditions and portrays the information on pathways to highlight expressed genes/proteins and metabolites based on the selected microarray, RNA-Seq, proteomics, and metabolomics data.
The data in SoyKB comes from multiple sources. Many of the data incorporated in SoyKB are public data and accessible to all users without login. Currently, SoyKB contains information about 75,778 gene entities, 129 miRNAs and 959 annotated metabolites for Williams 82 cultivar of G. max. It also has information regarding 7947 SNPs between cultivars Williams 82 and Forrest as well as 2631 SNPs between Magellan and PI567516C. The gene models, genomic sequences and functional annotation information were acquired from Gmax 1.0 release  of soybean genome from Phytozome. The gene models contain sequence-based evidence from EST, 5' RATE (Robust analysis of 5'-transcript ends) and full-length cDNA experiments.
SoyKB has many microarray experimental datasets for public access under 99 stress conditions and 25 tissue types acquired from NCBI GEO  and Array Express , in addition to 7 leaf and root tissue types and time-course data generated by our collaborators, currently only available for private access. The data for private access (as requested by our collaborators or the submitters) are password protected until ready for public access. The repository also has experimental data for 28 Illumina RNA-Seq experiments covering various tissue types and time points, all available for public access. Proteomics datasets are publicly available for seeds, roots and roothairs for multiple time points, conditions and replications. The metabolomics datasets came from the SoyMetDB database  and have been fully incorporated in SoyKB.
SoyKB also hosts data regarding 129 miRNAs and their expression abundances from five small RNA tissue libraries including root, nodule, flower, seed and stripped root . It also has a set of 7947 SNPs  and another set of 2631 SNPs (Nguyen lab, unpublished) available for public and private access, respectively. The pathway information was acquired from KEGG , Genebins  and Mapman .
Access and retrieval
1. Website browsing
Each entity in SoyKB has a dedicated entity card page containing all information associated with that entity in the database.
i. Gene card
ii. miRNA card
The miRNA card stores information about the experimental or predicted miRNAs, mature miRNA sequence, miRNA family, links to corresponding miRBase accession ID and family, expression abundance in small RNA libraries, and predicted target genes.
iii. Metabolite card
The metabolite card page provides users with information about metabolites including alias names, mass-to-charge ratios, retention times, chemical formula, chemical structure, molecular weight, links to the pathway viewer and Simplified Molecular Input Line Entry Specification (SMILES) formula. It also provides expression data from GCMS-polar, GCMS-nonpolar and LCMS datasets plotted as bar graphs for easy visualization.
iv. SNP card
The SNP card includes information about the predicted SNPs, their chromosomal positions,reference bases, consensus bases, read quality, and sequencing depth along with other quality scores. It also lists any genes where the SNP overlaps and falls within a gene model's coordinates.
2. Querying the database
3. Bulk downloads
Users can download data for their gene lists of interest by using the download capacity on SoyKB. The chromosome coordinates for genes, exons and UTR; CDS, cDNA and protein sequences; Pfam, Panther and KOG domains; and microarray, transcriptomics, proteomics, EST, 5'RATE and full-length cDNA are some of the data currently available for bulk download.
4. Data submission
To expand the data repository, SoyKB also provides interested users the capacity to contribute their data to SoyKB and choose when to allow that data for public access. This can be done using the "Upload Data File" option under the "Data Files" menu on the top menu bar. Based on the type of data chosen for submission, the selected option will specify the accepted formats for each data type. Accepted file types include .txt, .xls, .xlsx and .csv. Data submissions undergo internal evaluations to look for inconsistency in the data format and any missing or unreliable information, before getting uploaded to the database.
SoyKB contains Java applications for displaying a pathway with genes and metabolites as well as Flash-powered experimental data charts and 3D protein structures. SoyKB also provides users with an array of comprehensive analysis tools including co-expression analysis for multiple genes/metabolites, Affymetrix probe ID mapper, and gene family browser.
1. Pathway viewer
The pathway viewer is targeted towards integrating data from various experimental conditions and portraying the data on pathways to highlight expressed genes and metabolites based on the selected data. It runs as a Java application that uses standard Apache web server technologies such as PHP version 5 to manage web-based data input and query.
2. Protein 3D structures
3. Co-expression analysis
4. Affymetrix probe ID mapper
Many microarray experiments provide expression values for a list of probes instead of genes. We have developed the Affymetrix probe ID mapper tool to allow researchers to map probes to genes automatically using the most up-to-date gene models. The gene lists identified are all linked to the respective gene card pages for easy access to other information about the genes.
5. Gene family browser
SoyKB also allows users to browse entire gene families by using the "Browse" feature. "Gene Families" include the transcription factor families predicted in the SoyDB website , the cytochrome P450 gene families identified by Guttikonda et al. , and a few other gene families. Selecting a gene family provides a list of all genes known to belong to this gene family with individual genes linked to their gene card pages.
6. Blast sequence similarity
7. Motif prediction and web logo
An application example
Many other new capacities are also currently under development, including tools to handle epigenomics methylation data, breeder's toolkit with QTL and traits information, comparison against G. soja, and tools for phenotype prediction using omics data. The SoyKB development team is actively working towards incorporating more datasets and making them available for public access. We will setup an ftp site to give users access to the entire datasets. We will also provide mirror sites for more stable and fast access of SoyKB around the world.
The authors wish to thank all researchers who have contributed data to the database. The development has been supported by Missouri Soybean Merchandising Council (MSMC #306) to DX, JC, HTN, GS; United Soybean Board (project 8236) to HN, GS and DX; National Science Foundation (#DBI-0421620) to GS, DX, JC; National Institute of Health (grant #1R01GM093123) to JC for protein structure prediction, Department of Energy (DE-SC0004898) to GS, DX, JC, and the National Center for Soybean Biotechnology.
This article has been published as part of BMC Genomics Volume 13 Supplement 1, 2012: Selected articles from the Tenth Asia Pacific Bioinformatics Conference (APBC 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/13?issue=S1.
- Huala E, Dickerman A, Garcia-Hernandez M, Weems D, Reiser L, LaFond F, Hanley D, Kiphart D, Zhuang J, Huang W, Mueller L, Bhattacharyya D, Bhaya D, Sobral B, Beavis B, Somerville C, Rhee SY: The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Res. 2001, 29 (1): 102-5. 10.1093/nar/29.1.102.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen N, Harris TW, Antoshechkin I, Bastiani C, Bieri T, Blasiar D, Bradnam K, Canaran P, Chan J, Chen CK, Chen WJ, Cunningham F, Davis P, Kenny E, Kishore R, Lawson D, Lee R, Muller HM, Nakamura C, Pai S, Ozersky P, Petcherski A, Rogers A, Sabo A, Schwarz EM, Auken KV, Wang Q, Durbin R, Spieth J, Sternberg PW, Stein LD: WormBase: a comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Res. 2005, 33: D383-D389.PubMed CentralView ArticlePubMedGoogle Scholar
- Blake JA, Bult CJ, Kadin JA, Richardson JE, Eppig JT, Mouse Genome Database Group: The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics. Nucleic Acids Res. 2011, 39 (Suppl 1): D842-D848.PubMed CentralView ArticlePubMedGoogle Scholar
- Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, Weng S, Botstein D: SGD: Saccharomyces Genome Database. Nucleic Acids Res. 1998, 26 (1): 73-80. 10.1093/nar/26.1.73.PubMed CentralView ArticlePubMedGoogle Scholar
- Tweedie S, Ashburner M, Falls K, Leyland P, McQuilton P, Marygold S, Millburn G, Osumi-Sutherland D, Schroeder A, Seal R, Zhang H, FlyBase Consortium: FlyBase: enhancing Drosophila Gene Ontology annotations. Nucleic Acids Res. 2009, 37: D555-D559. 10.1093/nar/gkn788.PubMed CentralView ArticlePubMedGoogle Scholar
- Kurata N, Yamazaki Y: Oryzabase. An integrated biological and genome information database for rice. Plant Physiol. 2006, 140: 12-17.PubMed CentralView ArticlePubMedGoogle Scholar
- Ware DH, Jaiswal P, Ni J, Yap IV, Pan X, Clark KY, Teytelman L, Schmidt SC, Zhao W, Chang K, Cartinhour S, Stein LD, McCouch SR: Gramene, a tool for grass genomics. Plant Physiol. 2002, 130: 1606-1613. 10.1104/pp.015248.PubMed CentralView ArticlePubMedGoogle Scholar
- Phytozome. [http://www.phytozome.net/soybean]
- Grant D, Nelson RT, Cannon SB, Shoemaker RC: SoyBase, the USDA-ARS soybean genetics and genomics database. Nucleic Acids Res. 2010, 38: D843-D846. 10.1093/nar/gkp798.PubMed CentralView ArticlePubMedGoogle Scholar
- Shultz JL, Kurunam D, Shopinski K, Iqbal MJ, Kazi S, Zobrist K, Bashir R, Yaegashi S, Lavu N, Afzal AJ, Yesudas CR, Kassem MA, Wu C, Zhang HB, Town CD, Meksem K, Lightfoot DA: The Soybean Genome Database (SoyGD): a browser for display of duplicated, polyploid, regions and sequence tagged sites on the integrated physical and genetic maps of Glycine max. Nucleic Acids Res. 2006, 34 (Database issue): D758-D765.PubMed CentralView ArticlePubMedGoogle Scholar
- Alkharouf NW, Matthews BF: SGMD: the Soybean Genomics and Microarray Database. Nucleic Acids Res. 2004, 32 (Suppl 1): D398-D400.PubMed CentralView ArticlePubMedGoogle Scholar
- Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, Xu D, Hellsten U, May GD, Yu Y, Sakurai T, Umezawa T, Bhattacharyya MK, Sandhu D, Valliyodan B, Lindquist E, Peto M, Grant D, Shu S, Goodstein D, Barry K, Futrell-Griggs M, Abernathy B, Du J, Tian Z, Zhu L, Gill N, Joshi T, Libault M, Sethuraman A, Zhang XC, Shinozaki K, Nguyen HT, Wing RA, Cregan P, Specht J, Grimwood J, Rokhsar D, Stacey G, Shoemaker RC, Jackson SA: Genome sequence of the palaeopolyploid soybean. Nature. 2010, 463: 178-83. 10.1038/nature08670.View ArticlePubMedGoogle Scholar
- Apache. [http://httpd.apache.org]
- PHP. [http://www.php.net]
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12 (6): 996-1006.PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30 (1): 207-210. 10.1093/nar/30.1.207.PubMed CentralView ArticlePubMedGoogle Scholar
- Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, Burdett T, Dylag M, Emam I, Farne A, Hastings E, Holloway E, Kurbatova N, Lukk M, Malone J, Mani R, Pilicheva E, Rustici G, Sharma A, Williams E, Adamusiak T, Brandizi M, Sklyar N, Brazma A: ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res. 2011, 39 (Database issue): D1002-D1004.PubMed CentralView ArticlePubMedGoogle Scholar
- Joshi T, Yao Q, Franklin LD, Brechenmacher L, Valliyodan B, Stacey G, Nguyen H, Xu D: SoyMetDB: The Soybean Metabolome Database. Proceedings of IEEE International Conference on Bioinformatics & Biomedicine (BIBM 2010). 2010, Hong Kong, 203-208.View ArticleGoogle Scholar
- Joshi T, Yan Z, Libault M, Jeong DH, Park S, Green PJ, Sherrier JD, Farmer A, May G, Meyers BC, Stacey G: Prediction of novel miRNAs and associated target genes in Glycine max. BMC Bioinformatics. 2010, 11 (Suppl 1): S14-10.1186/1471-2105-11-S1-S14.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu X, Ren C, Joshi T, Vuong T, Xu D, Nguyen HT: SNP discovery by high-throughput sequencing in soybean. BMC Genomics. 2010, 11: 469-10.1186/1471-2164-11-469.PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28: 27-30. 10.1093/nar/28.1.27.PubMed CentralView ArticlePubMedGoogle Scholar
- Goffard N, Weiller G: GeneBins: a database for classifying gene expression data, with application to plant genome arrays. BMC Bioinformatics. 2007, 8: 87-10.1186/1471-2105-8-87.PubMed CentralView ArticlePubMedGoogle Scholar
- Thimm O, Bläsing O, Gibon Y, Nagel A, Meyer S, Krüger P, Selbig J, Müller LA, Rhee SY, Stitt M: MAPMAN: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. Plant J. 2004, 37 (6): 914-39. 10.1111/j.1365-313X.2004.02016.x.View ArticlePubMedGoogle Scholar
- Wang Z, Eickholt J, Cheng J: MULTICOM: a multi-level combination approach to protein structure prediction and its assessment in CASP8. Bioinformatics. 2010, 26 (7): 882-888. 10.1093/bioinformatics/btq058.PubMed CentralView ArticlePubMedGoogle Scholar
- Jmol. [http://jmol.sourceforge.net]
- Wang Z, Libault M, Joshi T, Valliyodan B, Nguyen H, Xu D, Stacey G, Cheng J: SoyDB: a knowledge database of soybean transcription factors. BMC Plant Biol. 2010, 10: 14-10.1186/1471-2229-10-14.PubMed CentralView ArticlePubMedGoogle Scholar
- Guttikonda SK, Joshi T, Bisht NC, Chen H, An YQ, Pandey S, Xu D, Yu O: Whole genome co-expression analysis of soybean cytochrome P450 genes identifies nodulation-specific P450 monooxygenases. BMC Plant Biol. 2010, 10: 243-10.1186/1471-2229-10-243.PubMed CentralView ArticlePubMedGoogle Scholar
- Thijs G, Moreau Y, De Smet F, Mathys J, Lescot M, Rombauts S, Rouzé P, De Moor B, Marchal K: INCLUSive: INtegrated Clustering, Upstream sequence retrieval and motif Sampling. Bioinformatics. 2002, 18 (2): 331-332. 10.1093/bioinformatics/18.2.331.View ArticlePubMedGoogle Scholar
- Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouzé P, Moreau Y: A higher order background model improves the detection of regulatory elements by Gibbs sampling. Bioinformatics. 2001, 17 (12): 1113-1122. 10.1093/bioinformatics/17.12.1113.View ArticlePubMedGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14: 1188-1190. 10.1101/gr.849004.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.