EasyGO: Gene Ontology-based annotation and functional enrichment analysis tool for agronomical species
© Zhou and Su; licensee BioMed Central Ltd. 2007
Received: 04 April 2007
Accepted: 24 July 2007
Published: 24 July 2007
It is always difficult to interpret microarray results. Recently, a handful of tools have been developed to meet this need, but almost none of them were designed to support agronomical species.
This paper presents EasyGO, a web server to perform Gene Ontology based functional interpretation on groups of genes or GeneChip probe sets. EasyGO makes a special contribution to the agronomical research community by supporting Affymetrix GeneChips of both crops and farm animals and by providing stronger capabilities for results visualization and user interaction. Currently it supports 11 agronomical plants, 3 farm animals, and the model plant Arabidopsis. The authors demonstrated EasyGO's ability to uncover hidden knowledge by analyzing a group of probe sets with similar expression profiles.
EasyGO is a good tool for helping biologists and agricultural scientists to discover enriched biological knowledge that can provide solutions or suggestions for original problems. It is freely available to all users at http://bioinformatics.cau.edu.cn/easygo/.
High-throughput technologies such as microarray techniques can study thousands of biological entities simultaneously. Extracting the important biological facts from the results of such experiments is of crucial importance, but has proven difficult for experimental biologists. To solve this problem, a systemized annotation vocabulary describing biological knowledge and tools to uncover hidden knowledge automatically using such a vocabulary are required. The Gene Ontology (GO) annotation system  can meet this requirement by providing a set of expert-curated terms describing biological entities in three aspects (biological process, molecular function, and cellular component) organized into a hierarchical structure. Genes and microarray probe sets could be associated with certain GO terms according to the biological functions they perform or represent, and enriched terms in a GO-annotated list of genes or probe sets could be used to characterize biological "theme" in the list. Many software and web servers have been developed for this purpose and are summarized in a recent paper . However, from the vantage point of the agronomical research community, almost no tools have been designed to support agronomic species like crops and farm animals, except for a few model organisms [3–10]. In addition, many current tools display analysis results in the form of tables or ranked lists [5–9, 11], which is uninformative to users as Gene Ontology is hierarchical in nature.
This paper presents EasyGO, a web-based tool to perform GO-based functional enrichment analysis for crop and farm animal species, including Affymetrix GeneChips for 12 plants and 3 farm animals, together with Arabidopsis and rice (indica and japonica) gene names. The annotation data for all GeneChip probe sets were regenerated by the best BLAST hit method to obtain better annotation coverage than that available from manufacturer-provided data in a reasonable way, thus making EasyGO's service more informative. In the form of statistically enriched terms, analysis results are visualized within the rich structure of a GO hierarchical tree, thus becoming much comprehensible. By focusing on the above points, EasyGO is expected to be more suitable than other currently available tools for the needs of the agronomical research community.
Construction and content
EasyGO is a web-based tool, so that no software installation effort is required. It is composed of two parts: a MySQL database containing GO annotation data for supported data types, and server-side Perl scripts for functional enrichment analysis and results display. The R software  is used to process statistical tests, and the dot program of the Graphviz software  is used to generate directed acyclic graphs.
Generation of GO annotation data
Currently, EasyGO supports Affymetrix GeneChips for 12 plant species (Arabidopsis, rice, wheat, maize, barley, sugar cane, soybean, poplar, medicago, citrus, cotton, and tomato) and 3 animal species (chicken, bovine, and porcine). We regenerated GO annotation for the GeneChip probe sets to obtain better annotation coverage than could be achieved using manufacturer-provided data (comparison of annotation coverage between the two sources of data is available online as additional file 1). For this purpose, the best BLAST hit method  was used to transfer GO annotation from the annotated sequence to the unannotated sequence if the annotated sequence is the BLAST top hit of the unannotated sequence under a certain E-value cutoff. Gene product GO annotations are available on the Gene Ontology Consortium website for some of the above species (Arabidopsis, rice, chicken, bovine, and UniProt multi-species GO annotations) and were downloaded in November 2006. Meanwhile, gene product sequences were retrieved from public sequence databases (TAIR, UniProt, Ensembl, and GenBank). These data were used to construct BLAST databases for annotating GeneChip probe sets. Consensus or exemplar sequences of GeneChip probe sets were blasted against corresponding sequence databases, and top hits were selected using an E-value cutoff of 10-30. Probe sets failed to obtain top hits were re-blasted against a sequence database with wider scope, and the same E-value cutoff was used to select top hits for them, so that more probe sets could be annotated. BLAST database selection and annotation status for all GeneChips is available online as Additional file 1.
Functional enrichment analysis
In EasyGO, functional enrichment analysis is done by finding GO terms with unbalanced distribution between two groups of genes or probe sets. By default, EasyGO compares a query list with a previously computed background composed of all known genes for a species or all probe sets on a GeneChip. In practice, user can submit a customized reference list when the default background is inappropriate (e.g., if it is desired to use expressed probe sets as a background, while only a portion of probe sets are expressed in a typical expression profiling experiment). Mapping count, which is the number of list entries annotated by the term, is calculated for each term in both lists, and the "true path rule" is applied so that each list entry contributes not only to the mapping counts of the terms assigned to it, but also to all parental terms on paths to the root term. The mapping counts are used to calculate each term's enrichment level in the query list, for which purpose three statistical test methods (binomial, χ2, and hypergeometric tests) can be used in EasyGO. In the binomial test, the annotation status of each query entry (whether or not the entry is annotated by a certain term) is regarded as a Bernoulli trial, and its probability of being annotated equals the frequency of annotated entries in the background (reference) list, thus the P-value of generating annotation status from the query list can be calculated from the resulting binomial distribution. In the χ2 test, term mapping counts in the query and background (reference) lists are used to form a 2 × 2 contingency table, from which the difference between observation and expectation for each category is measured to derive a P-value from a χ2 distribution with one degree of freedom. The hypergeometric test uses the hypergeometric distribution to calculate the probability of obtaining the contingency table as created above by chance. When the input list is compared with the previously computed background, or when it comprises a subset of the reference list, the enrichment problem is best modeled by the hypergeometric distribution. When the input list has few or no intersections with the reference list, the binomial and χ2 tests are more appropriate. In consideration of multiple testing issues, a false discovery rate (FDR) correction  is performed on the P-values to control falsely rejected hypotheses.
Utility and discussion
Web interface and results display
To demonstrate the use of EasyGO, a case study is performed using 168 Arabidopsis ATH1 GeneChip probe sets. The probe sets show coordinated expression level in shoot tissue under cold treatment, and a plot of their expression levels can be seen in Figure. 1a. By default, all non-control probe sets were used as a background, and analysis was performed on the aspect of "biological process" using the hypergeometric test. From the results display shown in Figures. 1b and 1c, terms associated with the stress-response property are enriched in the query list, for example GO:0009266 (response to temperature stimulus) and GO:0006970 (response to osmotic stress). This indicates that the query list contained plenty of cold- and water stress-responsive genes, which agrees with current findings that cold- and water stress-response mechanisms have cross-talk in Arabidopsis .
Currently, the GO annotation data stored in EasyGO are derived from sequence similarity search. With this approach, the actual similarity between query sequence and top hit sequence may be of concern when judging the reliability of transferred GO annotation. To address this concern, the authors are planning to weight transferred GO annotation using the percentage of the sequence region conserved between query and top hit. Also, a suitable test method that operates with a weighted mapping count needs to be developed. In addition, to make EasyGO more widely applicable, the authors intend to include expression microarrays for agronomical species from other companies such as Agilent and Operon.
The research reported here has given researchers studying the molecular biology of crops and farm animals a new tool to interpret high-throughput experimental results, such as a list of probe sets from expression microarrays. As described above, minimum user effort is required to use EasyGO, and its analysis results are displayed in an easy-to-read style. The authors believe that in practice, EasyGO will meet the general requirements of users in this field and facilitate their research work.
Availability and requirements
EasyGO is freely available for all users at http://bioinformatics.cau.edu.cn/easygo/.
The authors would like to thank Haiyan Wu for helpful discussions and Yijie Ma for proofreading of the manuscript. This work was supported by the National Basic Research Program of China (grant no. 2006CB100100).
- Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource. Nucl Acids Res. 2004, 32 (suppl_1): D258-261. 10.1093/nar/gkh036.View ArticleGoogle Scholar
- Dopazo J: Functional Interpretation of Microarray Experiments. OMICS: A Journal of Integrative Biology. 2006, 10 (3): 398-410. 10.1089/omi.2006.10.398.View ArticleGoogle Scholar
- Al-Shahrour F, Minguez P, Vaquerizas JM, Conde L, Dopazo J: BABELOMICS: a suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments. Nucl Acids Res. 2005, 33 (suppl_2): W460-464. 10.1093/nar/gki456.PubMed CentralPubMedView ArticleGoogle Scholar
- Berriz GF, King OD, Bryant B, Sander C, Roth FP: Characterizing gene sets with FuncAssociate. Bioinformatics. 2003, 19 (18): 2502-2504. 10.1093/bioinformatics/btg363.PubMedView ArticleGoogle Scholar
- Shah NH, Fedoroff NV: CLENCH: a program for calculating Cluster ENriCHment using the Gene Ontology. Bioinformatics. 2004, 20 (7): 1196-1197. 10.1093/bioinformatics/bth056.PubMedView ArticleGoogle Scholar
- Al-Shahrour F, Diaz-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics. 2004, 20 (4): 578-580. 10.1093/bioinformatics/btg455.PubMedView ArticleGoogle Scholar
- Carmona-Saez P, Chagoyen M, Tirado F, Carazo J, Pascual-Montano A: GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists. Genome Biology. 2007, 8 (1): R3-10.1186/gb-2007-8-1-r3.PubMed CentralPubMedView ArticleGoogle Scholar
- Beissbarth T, Speed TP: GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics. 2004, 20 (9): 1464-1465. 10.1093/bioinformatics/bth088.PubMedView ArticleGoogle Scholar
- Martin D, Brun C, Remy E, Mouren P, Thieffry D, Jacq B: GOToolBox: functional analysis of gene datasets based on Gene Ontology. Genome Biology. 2004, 5 (12): R101-10.1186/gb-2004-5-12-r101.PubMed CentralPubMedView ArticleGoogle Scholar
- Zhong S, Storch KF, Lipan O, Kao MC, Weitz CJ, Wong WH: GoSurfer: a graphical interactive tool for comparative analysis of large gene sets in Gene Ontology space. Applied Bioinformatics. 2004, 3 (4): 261-264. 10.2165/00822942-200403040-00009.PubMedView ArticleGoogle Scholar
- Dennis G, Sherman B, Hosack D, Yang J, Gao W, Lane H, Lempicki R: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biology. 2003, 4 (9): R60-10.1186/gb-2003-4-9-r60.PubMed CentralView ArticleGoogle Scholar
- R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. 2006Google Scholar
- Graphviz software: [http://www.graphviz.org]
- Jones CE, Baumann U, Brown AL: Automated methods of predicting the function of biological sequences using GO and BLAST. BMC Bioinformatics. 2005, 6: 272-10.1186/1471-2105-6-272.PubMed CentralPubMedView ArticleGoogle Scholar
- Benjamini Y, Yekutieli D: The Control of the False Discovery Rate in Multiple Testing Under Dependency. The Annuals of Statistics. 2001, 29 (4): 1165-1188. 10.1214/aos/1013699998.View ArticleGoogle Scholar
- Yamaguchi-Shinozaki K, Shinozaki K: Organization of cis-acting regulatory elements in osmotic- and cold-stress-responsive promoters. Trends in Plant Science. 2005, 10 (2): 88-94. 10.1016/j.tplants.2004.12.012.PubMedView ArticleGoogle Scholar