SoyXpress: A database for exploring the soybean transcriptome
© Cheng and Strömvik; licensee BioMed Central Ltd. 2008
Received: 1 April 2008
Accepted: 1 August 2008
Published: 1 August 2008
Experiments using whole transcriptome microarrays produce massive amounts of data. To gain a comprehensive understanding of this gene expression data it needs to be integrated with other available information such as gene function and metabolic pathways. Bioinformatics tools are essential to handle, organize and interpret the results. To date, no database provides whole transcriptome analysis capabilities integrated with terms describing biological functions for soybean (Glycine max (L) Merr.). To this end we have developed SoyXpress, a relational database with a suite of web interfaces to allow users to easily retrieve data and results of the microarray experiment with cross-referenced annotations of expressed sequence tags (EST) and hyperlinks to external public databases. This environment makes it possible to explore differences in gene expression, if any, between for instance transgenic and non-transgenic soybean cultivars and to interpret the results based on gene functional annotations to determine any changes that could potentially alter biological processes.
SoyXpress is a database designed for exploring the soybean transcriptome. Currently SoyXpress houses 380,095 soybean Expressed Sequence Tags (EST), linked with metabolic pathways, Gene Ontology terms, SwissProt identifiers and Affymetrix gene expression data. Array data is presently available from an experiment profiling global gene expression of three conventional and two genetically engineered soybean cultivars. The microarray data is linked with the sequence data, for maximum knowledge extraction. SoyXpress is implemented in MySQL and uses a Perl CGI interface.
SoyXpress is designed for the purpose of exploring potential transcriptome differences in different plant genotypes, including genetically modified crops. Soybean EST sequences, microarray and pathway data as well as searchable and browsable gene ontology are integrated and presented. SoyXpress is publicly accessible at http://soyxpress.agrenv.mcgill.ca.
Microarrays are most often developed from transcript information in the form of EST (Expressed Sequence Tags) sequences [e.g. ]. The annotation of those sequences with information on genetics, homology, functions, metabolic regulations and toxicology, are key to unlocking the biological meaning of the microarray results. Since each single microarray hybridization experiment produces massive amounts of data, handling, processing and analyzing the data become challenging tasks and the application of bioinformatics is absolutely essential. It is desirable to store and organize the results in a database, which needs to be extensible and flexible in order to have the capabilities to compare data from different microarray experiments. The Stanford Microarray Database (SMD)  is an example of such a resource, developed primarily with Stanford researchers and their collaborators in mind. Many other communities also develop resource databases, building the tools and functions to suit their organism, for example BarleyBase/PlexDB  for Barley genomics, MELOGEN  for melon genomics and the Tomato Expression Database (TED) .
A soybean gene expression database has been published, SGMD (the Soybean Genomics and Microarray Database – please see Availability and requirements for more information) . SGMD stores EST and microarray data to explore the interaction of soybean with the major pest soybean cyst nematode (SCN). The SGMD web interface provides on-the-fly statistics analysis to compare cDNA microarray data, which consists of around 4,000 spots and 20,000 EST sequences from the soybean root libraries .
We developed a new database, SoyXpress, with web tools to retrieve and explore the results of Affymetrix microarray experiments, linking also to other soybean genomic information in order to help researchers identify changes in gene expression and determine whether these changes alter biological processes in soybean. We designed SoyXpress for the potential of exploring the entire soybean transcriptome, integrating Affymetrix gene expression data (37,583 soybean probe sets) with 380,095 ESTs from G. max and G. soja, annotated with metabolic pathways, Gene Ontology terms, with SwissProt identifiers for maximum knowledge extraction. Currently, SoyXpress houses array data from 25 chips, comprising a leaf gene expression profiling experiment including two transgenic and three conventional (non-transgenic) soybean genotypes. SoyXpress is expansible and future gene expression experiments will be integrated.
Construction and Content
Schema and implementation: Sequence core tables
Schema and implementation: Sequence annotation tables
Figure 2B shows the annotation section of the database. The SEQ_ACCESSION table links to the BLAST table by using the sequence ID as the query ID for linking to the BLASTX search results. The GenBank accession number from the SEQ_ACCESSION table links to the DFCI (Dana Farber Cancer Institute) Gene Index (formerly TIGR Gene Index ) contig information to obtain the corresponding contig ID for the EST sequences (from table TIGR_GB) and the GO terms associated with each contig (from the table TIGR_GO) (Figure 2A). Other information for the additional 8,936 EST sequences downloaded from NCBI websites are stored in the table GB_ACCESSION, which also links to the BLAST table using the GenBank accession number as the query ID. The Gene Ontology databases  include the MySQL tables: TERM, TERM_DEFINITION, TERM2TERM, a n d GRAPH_PATH (please see Availability and requirements for more information), which were downloaded and directly reproduced in our database. The BLASTX analysis  against SwissProt  allowed us to assign protein annotations to 175,910 ESTs (over half of the 318,422 EST sequences). The BLAST table contains the BLASTX search results and links our EST data to their corresponding protein information. Of the 37,637 soybean probe sequences on the Affymetrix GeneChip, we assigned protein annotations to 8,667 sequences. These BLASTX search results are also incorporated into the BLAST table and link to other protein and function annotations. The SwissProt protein names are stored as the hit IDs. Other information about the proteins such as the protein descriptions, hit scores and e-values (negative exponents) are also stored in the BLAST table. The SwissProt protein IDs link to other functional annotations such as gene ontology (GO terms) and KEGG molecular pathways  through the GENE_ANNOTATION and EC_SWISS tables. The protein descriptions that describe the enzymes with appropriate EC (enzyme commission) numbers are linked to the KEGG pathways (stored as tables EC_DEF, and EC_MAP) through EC_SWISS table. There are 73,996 EST sequences with assigned EC numbers, around 23% of the EST sequences were enzymes. By linking the transcript sequences data to SwissProt annotations through BLASTX search results in the BLAST table, we can map the transcript sequences to their corresponding functional annotations such as GO terms and KEGG molecular pathways providing a more comprehensive description of the soybean data.
Schema and implementation: soybean microarray experiment data tables
The section of the database that organizes the microarray data is shown in Figure 2C. Data for the Affymetrix Soybean GeneChip , for example the probe IDs, the sequences of the probes, and the locations of the probes on the chip are stored in the table CDF_FILE. The whole transcript sequences representing the genes with the corresponding probe IDs and GenBank accession number are stored in the table PROBE_SEQ. The PROBE_SET table contains the probe IDs, GenBank accession number, and the corresponding sequence and clone IDs to map to our soybean EST data, and hence associates the microarray data with corresponding transcript, protein and functional annotations. Also, the microarray data can directly link to the BLAST table by using probe ID as the query ID to provide biological context for our microarray experiment. The raw data for our microarray experiment are stored in the table CEL_DATA, which contains the information for every chip, such as the chip IDs, probe IDs, and probe intensity. The processed data for our microarray experiment using three normalization methods RMA, MAS, dCHIP are stored in three tables RMA_RESULT, MAS_RESULT and LIWONG_RESULT respectively. All the raw and processed microarray data is linked to the PROBE_SET table by the probe IDs. For the analyzed results, the EXPERIMENT table describes which chips are used for the pair-wise comparison. The NORMALIZE table describes which normalization method are used in each pair-wise comparison. The microarray results for each pair-wise comparison analyzed by the LIMMA package are stored in the LIMMA_RESULT table. It includes the scores and p-value from the statistical test for each probe in all pair-wise comparisons. Also, the fold change and average intensity for each probe in all pair-wise comparisons are stored in the table FOLD_CHANGE. All the analyzed microarray results are linked to the PROBE_SET table and hence integrated with the soybean transcript, protein and functional annotations that can provide insight into biological and functional differences between samples.
Sequence and microarray data sources
The sequence data annotated and stored in SoyXpress comprises a total of 380,095 public ESTs from G. max and G. soja. Information on 31,928 tentative consensus (TC) sequences was downloaded from The TIGR (The Institute for Genomic Research) Glycine max Gene Index Project (Release 12.0)  (now hosted at the Dana Farber Cancer Institute/Computational Biology and Functional Genomics Laboratory at Harvard University – please see Availability and requirements for more information). The microarray data currently available consists of twenty-five raw data files (CEL files) of an experiment using the Affymetrix Soybean GeneChip . These were pre-processed and analyzed by standard methods as previously described ). The data consists of five biological replicates of leaf gene expression measure of three conventional and two genetically engineered soybean lines. The microarray data is accessible at NCBI (Gene Expression Omnibus) GEO under the accession numbers: GSE9374: GSM238030, GSM238031, GSM238032, GSM238033, GSM238034, GSM238036, GSM238038, GSM238039, GSM238041, GSM238043, GSM238047, GSM238048, GSM238049, GSM238050, GSM238051, GSM238052, GSM238053, GSM238054, GSM238055, GSM238056, GSM238057, GSM238058, GSM238059, GSM238060, GSM238061. Microarray chip information (from Affymetrix), raw data and results are stored in SoyXpress, and each probe is linked to the sequence information and meta-data.
Informatics of data generation and quality control
The EST sequences were annotated by command line BLASTX  searches against 168,297 SwissProt protein sequences (please see Availability and requirements for more information)  to obtain corresponding protein annotations. The SwissProt protein IDs were used to associate the sequences with GO terms, using the file: "UniProt GO Annotations" (please see Availability and requirements for more information). Recommended enzyme names and EC numbers were obtained from the Enzyme Nomenclature site (please see Availability and requirements for more information) and also extracted from MeSH (Medical Subject Headings, National Library of Medicine – please see Availability and requirements for more information). Enzyme EC numbers to SwissProt ID associations were obtained from the ExPASy Enzyme nomenclature database (version 36, please see Availability and requirements for more information). Metabolic and regulatory pathways were downloaded from KEGG (Kyoto Encyclopedia of Genes and Genomes – please see Availability and requirements for more information ). Enzyme identities within each pathway were obtained by extracting EC numbers from each of the pathways (downloadable XML files from the ftp KGML/map folders, version 0.6 Mar 2005). EC numbers, pathway names and map numbers where extracted and integrated into the database.
Utility and Discussion
SoyXpress was developed with two main types of users in mind: researchers and regulators with scientific background. Researchers can explore the annotated sequences in a way that relates to their metabolic pathway or process under study. For instance, a researcher interested in the flavonoid pathway could instantly retrieve all ESTs known from soybean that match the enzymes in this pathway. Regulators are asking for tools to help in assessment of novel crops, be they transgenic or obtained by conventional breeding. With SoyXpress, we have provided such a tool, where differences in global gene expression can be compared between any cultivars or groups of cultivars and where the gene expression is linked to metabolic pathways and to literature resources such as PubMed and TOXLINE. This can help regulators decide on whether a novel cultivar is, at the gene expression level, substantially equivalent to conventional cultivars that are generally recognized as safe (GRAS).
There is no other web-based database that makes the soybean transcriptome available. The SGMD is limited to 4000 genes expressed in root and has the aim to explore the Soybean Cyst Nematode and soybean interactions ; there are only GenBank IDs and BLASTX reports to show the homology of genes and proteins, and no annotations are provided to give information of the biological function and metabolic pathways. The draft of the soybean genome sequence was announced in January 2008 (please see Availability and requirements for more information), and it is envisioned that SoyXpress can be a helpful tool for the soybean genome annotation phase. Predicted genes can be compared with the information in SoyXpress and more reliable annotation can follow.
Planned future developments of SoyXpress include addition of promoter motif information, UTR features and links to the soybean genome sequence. It is also our hope that other groups will want to house their Affymetrix soybean data in SoyXpress, and an online submission protocol is planned.
Our scope was to develop a database with a suite of web interfaces to allow users to easily retrieve data and results of microarray experiments with cross-referenced annotations of the expressed sequence tags (EST) and hyperlinks to external public databases. The SoyXpress environment is the most comprehensive bioinformatics tool to date for soybean gene expression analysis and it makes it possible to explore differences in gene expression and to interpret the results based on gene functional annotations to determine any changes that could alter biological processes.
Availability and requirements
Project name: SoyXpress: a database for the soybean transcriptome
Project home page: http://soyxpress.agrenv.mcgill.ca/.
Operating system: Platform independent
Programming language: Perl
Other requirements: None
Licence: None required
Any restrictions to use by non-academics: None
SGMD (the Soybean Genomics and Microarray Database): http://psi081.ba.ars.usda.gov/SGMD/Default.htm
MySQL (version 5.0.18): http://www.mysql.com
Perl (version 5.8.6): http://www.perl.com
Perl CGI: http://search.cpan.org/dist/CGI.pm/
DBI and DBD::mysql: http://dev.mysql.com/downloads/dbi.html
Gene Ontology databases: http://www.geneontology.org/GO.downloads.database.shtml
Glycine max Gene Index Project: http://compbio.dfci.harvard.edu/tgi/
UniProt GO Annotations: http://www.geneontology.org/GO.current.annotations.shtml
Enzyme Nomenclature site: http://www.chem.qmul.ac.uk/iubmb/enzyme/
MeSH (Medical Subject Headings, National Library of Medicine): http://www.nlm.nih.gov/mesh/filelist.html
ExPASy Enzyme nomenclature database (version 36): http://ca.expasy.org/enzyme/
KEGG (Kyoto Encyclopedia of Genes and Genomes): http://www.genome.jp/kegg/download/ftp.html
SwissProt protein database: http://ca.expasy.org
Gene Ontology: http://amigo.geneontology.org
KEGG database: http://www.genome.jp
DFCI/TIGR Gene Index: http://compbio.dfci.harvard.edu
Draft of the soybean genome sequence: http://www.phytozome.org/soybean
The authors wish to thank Lee Zamparo, Julie Livingstone, Ernest Retzel and Frederic Latour for bioinformatics assistance. We are also grateful to Lee Zamparo for critical reading of the manuscript. This project was funded by the Advanced Foods and Materials Network (AFMNet). We also acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC), le Fonds de recherche sur la nature et les technologies (FQRNT) and the Centre Sève for financial support.
- Vodkin LO, Khanna A, Shealy R, Clough S, Gonzalez O, Philip R, Zabala G, Thibaud-Nissen F, Sidarous M, Strömvik M, Shoop E, Schmidt C, Retzel E, Erpelding J, Shoemaker R, Rodriguez-Huete A, Polacco J, Coryell V, Keim P, Gong G, Liu L, Pardinas J, Schweitzer P: Microarrays for global expression constructed with a low redundancy set of 27,500 sequenced cDNAs representing an array of developmental stages and physiological conditions of the soybean plant. BMC Genomics. 2004, 5: 73-PubMedPubMed CentralView ArticleGoogle Scholar
- Demeter J, Beauheim C, Gollub J, Hernandez-Boussard T, Jin H, Maier D, Matese JC, Nitzberg M, Wymore F, Zachariah ZK, Brown PO, Sherlock G, Ball CA: The Stanford Microarray Database: implementation of new analysis tools and open source release of software. Nucleic Acids Res. 2007, D766-770. 35 Database
- Wise RP, Caldo RA, Hong L, Shen L, Cannon E, Dickerson JA: BarleyBase/PLEXdb: A Unified Expression Profiling Database for Plants and Plant Pathogens. Methods Mol Biol. 2007, 406: 347-364.PubMedGoogle Scholar
- Gonzalez-Ibeas D, Blanca J, Roig C, González-To M, Picó B, Truniger V, Gómez P, Deleu W, Caño-Delgado A, Arús P, Nuez F, Garcia-Mas J, Puigdomènech P, Aranda MA: MELOGEN: an EST database for melon functional genomics. BMC Genomics. 2007, 8: 306-PubMedPubMed CentralView ArticleGoogle Scholar
- Fei Z, Tang X, Alba R, Giovanni J: Tomato Expression Database (TED): a suite of data presentation and analysis tools. Nucleic Acids Res. 2006, 34: D766-770.PubMedPubMed CentralView ArticleGoogle Scholar
- Alkharouf NW, Matthews BF: SGMD: The soybean genomics and microarray database. Nucleic Acids Research. 2004, D398-D400. 32 Database
- Firth D: CGIwithR: Facilities for processing web forms using R. Journal of Statistical Software. 2003, 8: 1-8.View ArticleGoogle Scholar
- Kumar CG, LeDuc R, Gong G, Roinishivili L, Lewin HA, Liu L: ESTIMA, a tool for EST management in a multi-project environment. BMC Bioinformatics. 2004, 5: 1-10. article 176View ArticleGoogle Scholar
- Quackenbush J, Lian F, Holt I, Pertea G, Upton J: The TIGR Gene Indices: reconstruction and representation of expressed gene sequences. Nucleic Acids Research. 2000, 28: 141-145.PubMedPubMed CentralView ArticleGoogle Scholar
- The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology. Nature Genetics. 2000, 25: 25-29.PubMed CentralView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology. 1990, 215: 403-410.PubMedView ArticleGoogle Scholar
- Gasteiger E, Gattiker A, Hoogland C, Ivanyi I, Appel RD, Bairoch A: ExPASy: the proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Research. 2003, 31: 3784-3788.PubMedPubMed CentralView ArticleGoogle Scholar
- Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Research. 2006, D354-D357. 34 Database
- Affymetrix: Data Sheet: GeneChip Soybean Genome Array. Documentation from Affymetrix website. 2005, 1-2. Accessed on August 7, 2007, [http://www.affymetrix.com/support/technical/datasheets/soybean_datasheet.pdf]Google Scholar
- Cheng KC, Beaulieu J, Iquira E, Belzile FJ, Fortin MG, Strömvik MV: Effect of transgenes on global gene expression in soybean is within the natural range of variation of conventional cultivars. Journal of Agricultural and Food Science. 2008, 56: 3057-3067.View ArticleGoogle Scholar