GEM-TREND: a web tool for gene expression data mining toward relevant network discovery

Background DNA microarray technology provides us with a first step toward the goal of uncovering gene functions on a genomic scale. In recent years, vast amounts of gene expression data have been collected, much of which are available in public databases, such as the Gene Expression Omnibus (GEO). To date, most researchers have been manually retrieving data from databases through web browsers using accession numbers (IDs) or keywords, but gene-expression patterns are not considered when retrieving such data. The Connectivity Map was recently introduced to compare gene expression data by introducing gene-expression signatures (represented by a set of genes with up- or down-regulated labels according to their biological states) and is available as a web tool for detecting similar gene-expression signatures from a limited data set (approximately 7,000 expression profiles representing 1,309 compounds). In order to support researchers to utilize the public gene expression data more effectively, we developed a web tool for finding similar gene expression data and generating its co-expression networks from a publicly available database. Results GEM-TREND, a web tool for searching gene expression data, allows users to search data from GEO using gene-expression signatures or gene expression ratio data as a query and retrieve gene expression data by comparing gene-expression pattern between the query and GEO gene expression data. The comparison methods are based on the nonparametric, rank-based pattern matching approach of Lamb et al. (Science 2006) with the additional calculation of statistical significance. The web tool was tested using gene expression ratio data randomly extracted from the GEO and with in-house microarray data, respectively. The results validated the ability of GEM-TREND to retrieve gene expression entries biologically related to a query from GEO. For further analysis, a network visualization interface is also provided, whereby genes and gene annotations are dynamically linked to external data repositories. Conclusion GEM-TREND was developed to retrieve gene expression data by comparing query gene-expression pattern with those of GEO gene expression data. It could be a very useful resource for finding similar gene expression profiles and constructing its gene co-expression networks from a publicly available database. GEM-TREND was designed to be user-friendly and is expected to support knowledge discovery. GEM-TREND is freely available at .

analysis, a network visualization interface is also provided, whereby genes and gene annotations are dynamically linked to external data repositories.
Conclusion: GEM-TREND was developed to retrieve gene expression data by comparing query gene-expression pattern with those of GEO gene expression data. It could be a very useful resource for finding similar gene expression profiles and constructing its gene co-expression networks from a publicly available database. GEM-TREND was designed to be user-friendly and is expected to support knowledge discovery. GEM-TREND is freely available at http:// cgs.pharm.kyoto-u.ac.jp/services/network.

Background
One of the major challenges in the post-genomic era is to understand how genes and their products interact to form functional networks. DNA microarray technology, which can simultaneously measure the expression of thousands of mRNAs, provides us with the first step toward the goal of uncovering gene functions on a genomic scale [1]. In recent years, vast amounts of gene expression data have been collected, much of which are available in public databases, such as the Gene Expression Omnibus (GEO) [2], ArrayExpress [3] and researchers' websites. These resources serve at least two purposes. One is as an archive of the data, which allows other researchers to confirm results that have been already published. A second use is to permit novel analyses of the data that go beyond what was envisioned or possible at the time of the original study [1,[4][5][6]. However, to date, most researchers manually retrieve data from databases through web browsers using accession numbers (IDs) or keywords and switch to other tools for further analysis (e.g. network analysis), hence the need to continually import/export and reformat data [7,8]. The data retrieved using keywords or IDs is also usually limited by experimental conditions such as microarray platform, reagent, and cell type. In recent years, gene-expression patterns have been introduced as a new strategy to connect different biological states, and several methods were proposed to detect similarities among the gene-expression patterns of different biological states. Lamb et al. [9] introduced the Connectivity Map as a web tool to detect similar gene-expression signatures quantitatively among their original microarray dataset, which was observed under unified experimental conditions (the usage of cultured human cells treated with bioactive small molecules) by the specified laboratory teams. The other tools, L2L [10] and LOLA [11] have been also provided to compare users' data to published microarray data from different experimental conditions. As the similarity metrics, L2L and LOLA used the co-occurrence of genes (the number of overlap genes) between query gene list and pre-defined lists of differentially expressed genes compiled from published microarray data, and cannot measure gene-expression patterns quantitatively. Thus the existing mining tools allow users to search gene expression data from public databases, but these are also restricted by gene annotation, pre-selected gene lists, or experimental conditions. In order to detect similar geneexpression patterns across a public gene expression database, which consists of diverse data generated using different microarray platforms and by individual laboratory groups, we have developed a web tool named GEM-TREND (Gene Expression data Mining Toward Relevant Network Discovery) to automatically retrieve gene expression data across a wide range of microarray experiments in the publicly available GEO database by comparing geneexpression patterns between a query and the database entries. Subsequently, the system generates a gene coexpression network for retrieved gene expression data and may provide insights into unknown functional relationships of the genes.

Implementation
GEM-TREND runs on a Linux server (Intel Xeon 2.8 GHz, 4G RAM). It combines a MySQL database management system (5.0.22) to store the pre-processed data with a dynamic web interface based on PHP (5.2.3). Data processing is performed using PHP and the R statistical package (2.5.1), and graphical representations are generated using a Java Applet graphical user interface. GEM-TREND provides both gene-expression pattern-based and text-based searches to retrieve gene expression data from GEO. For the former searches, the input data can be geneexpression signatures represented by a set of genes with up-or down-regulated labels or by gene expression ratio data. For text-based searches meanwhile, the input data can be keywords and accession IDs. Retrieved gene expression data can then be viewed as a co-expression network with gene ontology (GO) annotation, whereby genes and annotations are dynamically linked to external data repositories.

Construction of reference gene expression profiles
The current system stores a wider spectrum of reference gene expression profiles compared to the Connectivity Map. In this study, the reference gene expression profiles were constructed as described below: (1) Gene expression data annotated as treatment instances (i.e. treatment versus control) were extracted from the GEO database, amounting to 1540 GEO series and 41516 samples; (2) For each sample, genes were ranked in descending order, according to the log ratio of treatment to control; (3) Varying gene names/IDs dependent on microarray platforms were converted to UniGene IDs in accordance with the respective gene annotation files.
These steps are schematically illustrated in Fig. 1. Samples lacking the associated annotation were filtered out, hence resulting in a total of 995 GEO series and 25974 samples. Table 1 summarizes the numbers of platforms, series and samples for each species. These samples were stored in a MySQL database as reference gene expression profiles.

Gene expression data search
Gene-expression pattern-based search GEM-TREND provides a gene-expression pattern-based search, by which we can explore reference gene expression data that resemble a given query in terms of pattern. The similarity is measured by the nonparametric, rank-based pattern matching approach of Lamb et al. [9]. In brief, Kolmogorov-Smirnov (KS) scores are calculated for both the up-regulated gene set (KS up ) and down-regulated gene set (KS down ) of the query, and these scores are integrated into a single score on the basis of the magnitude and signs of KS up and KS down (see Fig. 1 and Ref. [9] for the detailed calculation). Note that the gene expression profiles derived from multiple chips in the same experiment are counted as different hits. GEO samples corresponding to the reference profiles are then ranked in descending order of scores. Samples with larger positive scores are considered to be more closely correlated with the query, and vice versa.
GEM-TREND accepts a maximum of 500 genes as a query gene-expression signature. If the query is given in the form of gene expression ratio data, its signature will be automatically generated by GEM-TREND. Specifically, all genes are ranked in descending order according to their absolute ratio value, and genes exhibiting more than 2fold change are selected from the 500 top-ranked genes. Finally, the selected genes are divided into up-and downregulated sets according to the signs of their values.
In order to detect and reduce false negatives, we propose calculating the p-value associated with a similarity score using a randomization test. The procedure is as follows (Fig. 2): (1) Given a query signature Q consisting of u up-regulated genes and v down-regulated genes, calculate the numbers of up-and down-regulated genes that overlap between Q and a reference profile R consisting of n genes; let the numbers be u' (≤u) and v' (≤v), respectively; (2) Select u' and v' genes sequentially and randomly from the n genes of R without replacement, and construct a random signature; (3) Calculate the similarity score (random score) between R and the random signature; (4) Repeat steps 2 and 3 to generate a total of 10,000 random scores.
(5) Estimate the p-value associated with the similarity score between Q and R, as the proportion of random scores that are no less than the observed similarity score.
Text-based search GEM-TREND also allows users to search gene expression data by text (i.e. keywords, platform IDs, and series IDs). In this way, an N-gram based search engine is used, and GEO series title, series summary, platform IDs, and series IDs are considered as search criteria.

Network generation and cluster analysis
In order to support delineating the relationship between genes, gene expression data retrieved by GEM-TREND is converted into a gene co-expression network that can identify the functionally related genes using GEO series The data recently used as reference gene expression profiles in GEM-TREND.
data based on Pearson correlation coefficients and Kmeans clustering. First, the pairwise correlation coefficients are calculated for genes with more than 2-fold changes in expression levels, whereupon these genes are then clustered into N clusters using K-means clustering. N is determined using the DB index [12]. The cluster number with the largest DB is considered as N. Each gene represents a node and is connected to all the other genes in the same cluster based on correlation coefficients, hence sub-networks corresponding to clusters are generated and subsequently inter-connected, based on the Euclidean distance between them. Thus, the network was constructed. To reduce false positive links and to keep the graph size reasonable, the threshold of correlation coefficients and Euclidean distance is set to 0.92. Furthermore, each gene that appears on the network is annotated based on the associated GO term [13].

Results and Discussion
Overview of GEM-TREND GEM-TREND is designed to be user-friendly. Only a few simple steps are required to search GEO gene expression data and visualize the network. The main page of GEO gene expression data search comprises a query input area (Fig. 3a), and a results area (Fig. 3b). For a GEO gene expression data search, both gene-expression patternbased searches (either gene-expression signatures or gene expression ratio data as inputs) and text-based searches (accepting keywords, platform IDs, or series IDs as inputs) are available, but similarity scores and p-values are calculated only for gene-expression pattern-based searches. To further analyze retrieved data (e.g. network analysis), GEM-TREND provides the GEO series that links together a group of related samples instead of providing reference gene expression profiles. The results consist of GEO series ID (GSE ID), GEO platform ID (GPL ID), series title, similarity score, and p-value displayed in the results area.
Here, the similarity score of the GEO series is the maximum similarity score among samples in the same GEO series. The full series title can be displayed as a tool-tip when the mouse is over the title, and each series links to GEO by clicking the GSE ID or GPL ID (Fig. 3e). In addition, the series of interest can be selected for further The procedure of reference gene expression profiles construction and similarity score calculation Un processing. Both search results and selected series can be downloaded in CSV format.
Genes in each series can be viewed as a co-expression network by clicking the network icon (Fig. 3b). The network with GO annotation is shown on the network visualization page that comprises three major parts: the network graphical display area (Figs. 3c-1, 3d-1), the cluster information area (Figs. 3c-2, 3d-2), and the gene search window (Figs. 3c-3, 3d-3). The network graphical display area dynamically shows the full or sub-network according to the user's operation. On the network, genes from the query are highlighted, and the gene name is displayed as a tool-tip when the mouse is over the node (gene ID). The neighbor nodes (genes) can be expanded or hidden by right clicking on the node of interest to bring up a pop-up menu. In the cluster information area, clusters including their member genes are shown under the Gene Cluster tab. Users can click the Cluster Name to view the sub-network which includes co-expression genes, and click the UniGene ID to access the UniGene database [14] (Fig. 3f). Under the GO tab, the top three significant shared GO terms of genes in each of the ontologies (cellular component, biological process, molecular function) are shown for each cluster (Fig. 3d-2). The genes will be highlighted on the displayed network once the common function or process they perform is selected, and they also have a link to GO [13] (Fig. 3g). In addition, users can search for a gene of interest in the network using an ID or gene name through the gene search window. GEM-TREND also allows users to retrieve the previous results of both GEO data searches and network visualizations using the JOB ID and the network ID (the IDs are valid for two weeks) (See additional file 1-PDF-User guide for GEM-TREND for example).

Validation of GEM-TREND
To validate whether GEM-TREND could retrieve the gene expression entries biologically related to a query, we evaluated the similarity of biological annotations between the query and the retrieved microarray data by using their MeSH terms. As a biomedical vocabulary thesaurus, the MeSH Term [15] is used by the National Library of Medicine (NLM) to index articles for the MEDLINE/PubMed The method for P-value calculation Figure 2 The method for P-value calculation.
(1) Calculate the numbers of up-and down-regulated genes that overlap between Q and a reference profile R; let the numbers be u' (≤ u) and v' (≤ v), respectively. (2) Select u' and v' genes sequentially and randomly from the n genes of R without replacement, and construct a random signature; (3) Calculate the similarity score between R and the random signature; (4) Generated a total of 10,000 random scores by repeating steps 2 and 3. (5) The pvalue associated with the similarity score (query score) between query Q and reference R is the proportion of random scores that are no less than the observed similarity score (query score).
Calculate similarity score between random signatures and reference profile R Rank vector pair of random signatures Select u' and v' genes randomly from the n genes of R without overlap as the up-and down-regulated genes of a random gene-expression signature Compare score between similarity score of query and random score Calculate the numbers of upand down-regulated genes that overlap between Q and R database [16]. NCBI's Entrez link system [17] connects GEO data with related literature in PubMed. Hereby we assigned biological annotations in MeSH terminology to each entry of GEO microarray data via related literature, and we can estimate the biological relationship between a query and its retrieved data using their expression patterns. The validation was carried out with a set of 100 human species samples (gene expression ratio data) randomly extracted from GEO as queries as follows. First, for each query, GEM-TREND results in a ranking list of the gene expression data (GEO series) with their similarity score and P-value were estimated by the gene-expression patterns. Subsequently, we calculated the constituent ratio of the query's MeSH terms to those of the top ranked expression data on the selected criterion (top 10, 30 or 50 rank, and with or without the P-value). Fig. 4 shows the distribution of the ratio of a query's MeSH terms in the retrieved top-ranked entries. The distribution was gener-  The results were retrieved from human species reference profiles.

Screenshot of GEM-TREND
The distribution of the ratio of the query's MeSH terms in the top-ranked entries for 100 randomly selected queries Figure 4 The distribution of the ratio of the query's MeSH terms in the top-ranked entries for 100 randomly selected queries. The groups in different color are the top 50 entries without a P-value filter, top 50 entries with a P-value <= 0.01, top 30 entries without a P-value filter, top 30 entries with a P-value <= 0.01, top 10 entries without a P-value filter, top 10 entries with a P-value <= 0.01, and total entries, respectively. The total entries represent all human species microarray series (corresponding to 444 series). Top 30 entries with P-value<=0.01 Top 10 entries with P-value<=0.01 ated by the retrievals of 100 randomly selected queries. As shown in Fig. 4, the peaks shifted to the right in the order of total, top 50, 30 and 10 entries. Importantly, filtering by P-value enabled the ratio of query's MeSH terms contained in the top ranked dataset to be increased more efficiently, indicating that our implemented P-value score is available to promote more effective exploration. These results demonstrate that GEM-TREND could retrieve bio-logically relevant microarray data across a wide range of microarray experiments in GEO by detecting the similarity of gene-expression pattern.
For further validation, we next used three types of inhouse microarray data, which we previously reported but did not deposit in GEO, as the query examples: query-1) microarray data of human bladder cancer (Additional file  The results were retrieved from mouse species reference profiles. The results were retrieved from rat species reference profiles. 2-CSV-Gene expression profile of human bladder cancer) [18]; query-2) microarray data of rat chemical hepatocarcinogenesis (Additional file 3-CSV-Gene expression profile of rat chemical hepatocarcinogenesis) [19]; and query-3) microarray data of mouse mast cells pooled from stomach subregions (Additional file 4-CSV-Gene expression profile of mast cells pooled from mouse stomach subregions) [20]. In the score-ordered results of query-1 (Pvalue < 0.01), GSE1827 (titled "Waldman Bladder tumors") was ranked in fourth. Moreover, the top 10 entries showed appropriate annotations related to tumors, inflammatory and immune responses ( Table 2). For the query-2 (P-value < 0.01), all among the top 5 entries were related to chemical-treated experiments, and seven entries among the top 10 were observed using rat liver samples (Table 3). The biological relationships among the top 10 results of query-3 (P-value < 0.02) were not clear, but GSE6192 (titled "Gene expression changes during murine mucosal mast cell in vitro differentiation") was found out in the twelfth rank (Table 4). These findings indicate the general applicability of GEM-TREND to external microarray queries independent of GEO. For further analysis, we generated gene co-expression networks from GSE1827 (a series of GEO microarray data) which was one of query-1 results (Fig. 5). GEM-TREND can provide us with the bladder tumor-associated networks from the query-1 consisting only the two DNA Chips. Note that in general a number of congeneric microarray data are required to construct gene co-expression networks. Thus, GEM-TREND can help comprehensive re-analysis of the primary data by merging data from multiple studies and provide insights into unknown functional relationships of the genes.
GEM-TREND can be considered as an extension of the Connectivity Map and a supplementary tool of GEO.
A co-expression network generated using GSE1827 data Figure 5 A co-expression network generated using GSE1827 data. The yellow-colored genes are categorized as GO0003700: transcription factor activity. Interestingly, these are hub genes or the neighbor genes in the sub-network, suggesting that the transcriptional factors might be key molecules for bladder tumors.
Compared to the other microarray comparison tools such as L2L and LOLA, GEM-TREND has unique features in data resources, search method and main focus. The existing web tools use the pre-annotated lists of the limited genes (only a fraction of all available microarray data) as reference data, while GEM-TREND directly calculates complete raw data from GEO microarray resource, suggesting that GEM-TREND has an ability to access a greater number of public raw data. For the search method, the existing tools compare microarray data only by calculating the number of overlap genes, but GEM-TREND uses gene-expression pattern matching algorithm based on the nonparametric, rank correlation statistics. Moreover, compared to the existing tools which interpret new data using biologically significant genes annotated with published information, GEM-TREND focuses on data retrieval from GEO and gene-network analyses using GO annotation. GEM-TREND would thus be a unique and useful web tool to help researchers utilize GEO database more effectively, and to support knowledge discovery.

Conclusion
GEM-TREND was developed to retrieve gene expression data by comparing the gene-expression pattern of queries with those of gene expression data in a public database based on the nonparametric, rank-based pattern matching approach with the additional calculation of statistical significance and to provide network visualization. It could be a very useful resource for finding similar gene expression profiles in an available public database and generating the associated co-expression networks. GEM-TREND was designed to be user-friendly and is expected to support knowledge discovery by providing a new means of data retrieval.
In future, the reference data will be automatically updated from GEO and other public databases. We also intend to find other appropriate ways to solve the limitations of false negatives caused by missing UniGene IDs and improve search speed.