Genome sequencing projects have made available whole genome sequences of hundreds of different organisms. These valuable resources have reshaped the landscape of biology and genetics in particular. Using these genome sequences, researchers have predicted thousands to tens of thousands of genes in a typical eukaryote genome. How these genes function in an organism, however, is not immediately clear from the sequence alone. Developing better testable hypotheses requires the functional characterization of the predicted genes. This is a well recognized bottleneck for geneticists working even with the most established genetic model organisms such as the nematode Caenorhabditis elegans. A particular challenge is the large number of genes in any given genome in the context of the inability to quickly characterize a large number of genes in detail. Consequently the careful selection of genes for functional characterization is of particular importance in reverse genetic approaches.
C. elegans is one of the favorite organisms for large-scale reverse genetic screens. This is mainly due to the ability to do RNAi experiments by feeding  and the availability of an almost genome-wide RNAi library for such experiments . Consequently genome-wide RNAi screens have been done for a number of phenotypes including survival, growth, cell division, longevity, fat storage and others [3–13]. Even though RNAi experiments are straightforward in C. elegans genome-wide screens are still a challenge due to the large number of genes and are effectively limited to phenotypes that can be scored quickly. Genome-wide screens completely ignore information about gene function that is already available. Selecting candidate genes using additional information available can reduce the number of genes significantly and allows screens for more sophisticated phenotypes, which tend to be more labour intensive and difficult to scale up. One example is screening for axon navigation defects, which has been done with RNAi recently, but not on a genome-wide scale . Our database is designed to assist with experimental design of large-scale reverse genetic experiments in C. elegans in particular, since the dataset is currently limited to C. elegans genes.
Several lines of evidence can be used to infer the function of an uncharacterized protein. Most important are sequence similarities to known proteins, either overall similarity or at least the presence of functionally characterized protein domains. For completely uncharacterized proteins this is typically the only information available. A number of protein domain databases exist. Well established ones include ProDom , Pfam , SMART  and InterPro , which integrate a large number of data sets from various sources. All these databases have their major emphasis on the protein domains and their search and display interfaces tend to be centered on them. Consequently it is straightforward to get lists of all proteins containing a particular domain, but more difficult or impossible to do more sophisticated searches.
Additional data sets helping to elucidate gene function are expression data, either from DNA microarray experiments, SAGE experiments or even from large-scale reporter gene expression studies [19, 20]. In C. elegans SAGE data obtained from cells and tissues purified by FACS sorting have been used to establish transcriptional profiles of the intestine [21, 22], groups of neurons  or even individual neurons . In addition stage-specific SAGE libraries have been generated [25, 26]. Databases and web servers exist to probe and examine the corresponding data sets. The Stanford Microarray Database  is probably the most prominent site allowing users to analyse microarray data. Among other things it has been used to correlate expression patterns across a large number of microarray experiments from different species to identify genes belonging to the same pathway . Gene Recommender is a novel tool, which allows researchers to exploit the microarray data set to identify genes that are regulated in a similar fashion compared to a set of candidate genes given as input . The multiSAGE web site  allows access to the C. elegans SAGE data sets mentioned above. Most of these databases hold only one type of data (e.g. microarray data). Essentially only the organism-specific databases and web sites allow some access to integrated data sets. Every genome-scale experiment like a microarray experiment leaves the experimenter with a list of genes fulfilling particular experimental criteria. Usually this list of genes tends to be quite large (several hundred or even thousands of genes) and has to be narrowed down further or at least grouped for further analysis. The Gene Ontology (GO) project  has emerged as the quasi-standard to functionally group large sets of genes. In the absence of any other information proteins are tagged with GO terms based on protein domains with recognizable functions such as kinase domains. The GO vocabulary is rather extensive - special viewers exist to browse the vocabulary alone, which makes it difficult to use the vocabulary directly in simple interactive searches. Furthermore since many protein domains carry information about biochemical function but not biological function, the current situation with respect to meaningful functional grouping of proteins is somewhat unsatisfactory. Consequently any further analysis of large sets of genes from genome-scale experiments requires human input and intervention and therefore benefits from a simple, easy-to-use user interface.
The major integrated database for C. elegans genes is Wormbase . Its history lies in the genome sequencing project and it has sophisticated user interfaces to access and display features at the DNA level. Data above the DNA level are organized around genes, and the major user interface at this level displays all the information and data sets related to a particular gene. Large-scale data mining and searches across different data sets is possible using a special search interface (WormMart), but the response time is slow and only selected data are accessible in this way. For many data sets at the protein level, like presence and location of protein domains, Wormbase will display the raw data from competing prediction programs, leaving the interpretation and integration to the user. This is in contrast to data at the DNA level, where the output of various gene prediction programs is integrated and only one gene model is presented. In short, even though all kinds of data related to genes and proteins are contained in Wormbase, not all data sets are equally accessible and not all are displayed in the most useful way. Missing in particular is a multi-gene interface to display data at the protein level.
A major goal of GExplore is to provide a simple and fast search and display interface that allows a multi-gene display of large data sets. Searches are generally executed within seconds. The result can be surveyed quickly and the search parameters adapted. In fact, the speed and simplicity of the output allows the researcher to quickly probe any of the underling data sets for usefulness. Researchers with their own data, e.g. a list of genes from their own genome-scale experiments, can simply paste this list of genes (up to several thousand) into the gene search field and start searching. The underlying database currently is limited to selected datasets relevant for predicting gene/protein function. It includes a search and display interface for protein domains, combined with data sets on gene expression (microarray and SAGE) and phenotype information. In addition GO terms linked to the genes are available for combinatorial searches. Currently the database is limited to C. elegans genes, but the overall structure is flexible enough to allow expansion of the database to incorporate data from other organisms in the future.