IntegromeDB web interface
IntegromeDB report page uses a two-dimensional newspaper-like layout (Figure 2) rather than as a search results page as in standard search engines (e.g., Google). Data are grouped by data sources and similarity and sorted in each group by the relevance to the query gene/protein as is shown in Figures 2, 3, 4 and 5. The report page also provides a list of topics related to the query gene/protein (Figure 4). The links to the corresponding data sources are provided. Among data displayed for each gene/protein on the left panel of the report page (Figure 3) are aliases and external IDs, ontological terms, chromosome localization with links to NCBI and UCSC genome browsers, orthologs/homologs from multiple organisms, protein and genomic sequences, and related publications. On the right panel of the report page (Figures 4, 5) are shown interaction data and pathways (from KEGG [25], NCI [26], REACTOME [27], BioCARTA [28]), experiments (e.g., expression, metabolomics and proteomics data from GEO [29] and ArrayExpress [30]), miRNA data (from miRBase [31] and microRNA.org [32]), images (e.g., protein structures from PDB [33] and Wikipedia [34]), and relative mRNA expression frequencies derived from various cell and tissue types (from descriptions/metadata of experiments in GEO and ArrayExpress).
For example, for the query 'p53', the report page (Figures 2, 3, 4 and 5) describes that p53 is involved in apoptosis, cancer, prostoglandin metabolism pathway, and MAPK, Wnt, cell cycle and other canonical pathways. Detailed information for each pathway, including genes, proteins, and small molecules involved in it is provided. For example, for prostoglandin metabolism pathway, the user can learn that it is involved in metabolizing lipids into prostaglandins and plays an important role in pain and inflammation; that the protein encoded by human PTGS1 gene is involved in the conversion of prostaglandin PGG2 into inflammation-causing prostaglandin PGH2; and aspirin has been shown to bind to the PTGS1 gene product (prostaglandin-endoperoxide synthase 1), blocking the ability of this enzyme to produce PGH2 and thereby reducing pain and inflammation.
Web Data
For each gene/protein, information retrieved from web pages is clustered by data sources (Figure 3). The user can scan the content of the web page by rolling over the respective term (e.g. 'Binary Interactions', 'Sequence Annotation', 'Structure') at the bottom of the snippet under 'In this topic'. For example, for the 'p53' query, among the most relevant (valuable) resources were OMIM, UniProt, UniGene, GenBank, PubMed, Wikipedia, GeneCards, InterPro, p53.free.fr (mutation database), SYSTERS (protein families database), tp53.org. They were followed by (accessible via clicking the 'Show More Pages' button) EMBL/EBI, MGD, STRING, WikiGenes, Genetic Atlas, CancerIndex, ProteinAtlas, KEGG, UCSC genome Browser, HAGR (Human Ageing Genomic Resource), SwissProt, PharmGKB and others.
Table data
In contrast to table data stored in listed and well-maintained databases, the tables published in the PubMed articles, as well as separate tables distributed across the web, are barely searchable by any search engine. Here, these data are integrated and become searchable using the same approach that is applied to the tables in the databases. Specifically, it concerns tabular data on gene regulatory regions, gene/protein interactions, and gene expression experiments.
To integrate the tables, IntegromeDB looks for web pages containing the HTML 'Table' tag and distinguishes relational tables (that are further parsed) from non-relational (that cannot be integrated), using a combination of in-house and statistical classifiers, e.g., calculating the ontology term in each column of the table; the classifiers have been trained on manually selected examples and use the information already integrated in IntegromeDB. If the columns containing significant percentage (empirically defined value) of the object IDs from IntegromeDB can be found, the table will be further processed; otherwise, filtered out. For example, a two-column table containing gene names and promoter sequences will be identified as a relational table as the percentage of ontologically defined objects in the first (gene names) column is high. Data from the other column will be considered as gene attributes if types of data can be defined, or named; that is, the column contains a data label in the first row. Thus, if in the example table the first cell of the second column contains the word 'promoter' and the other cells, DNA sequences, the postprocessor will understand that data in this column are promoter sequences. It will therefore scan the upstream region of the respective gene from the first column, and in case of finding the matching sequence, remembers its chromosomal localization in the global genomic coordinates and assigns to the gene the sequence in the corresponding cell as the attribute of the type promoter_sequence. If the sequence is not found, the postprocessor will assign the sequence as a new attribute with the name as provided in the first cell of the second column, and the type of the attribute will be sequence. Figure 4E shows gene regulatory data that was found for the 'p53'query from the databases provided information on gene regulatory regions and transcription factors.
Experimental tables containing, for example, microarray data, are clustered based on the Z-scores obtained using Fisher r-to-Z transformation of Pearson correlation coefficients calculating co-expression of the query gene with each gene in each experiment (data for each gene are averaged over multiple probes). The resulting co-expression matrix (Figure 5) highlights the experiments and genes (the intensity of blue correspond to the increasing value of the Pearson correlation coefficient above 0.75 and up to 1.0) co-expressed with the query gene (p53, in this case), and the word cloud shows a variety of conditions, tissues, and disease states found in the descriptions of these experiments. Visual inspection of the matrix allows detecting patterns of correlation across data sets and spot significantly strong co-expression profiles. Most often correlating genes (with the largest average Z-score over all experiments) and most related experiments (with the largest average Z-score over all genes) are located in the left-top corner of the matrix. For example, p53 was found to be overexpressed in cancers (breast, brain, bone marrow, squamous carcinoma, etc.), fibroblasts, astrocytes, and other cell types.
Integration with other resources
One of the central aims of the IntegromeDB is to maintain cross-connectivity and integration with other public resources in a user-friendly manner. Therefore, we provide the programmatic access to our SQL database that is accessible via the following routes. First, the integrated content of IntegromeDB is available via the IntegromeDB API, which is implemented in Java. Through an XML-RPC service, API provides functions to access programmatically most of the features available in the IntegromeDB web interface, such as retrieving aliases, promoter sequences, or transcriptional regulators for a set of genes. Example code of using API and access to the XML-RPC service are available at http://integromedb.org/api.jsp ('API XML-RPC' tab). Second, if the external user/resource wants to visualize retrieved object(s) (interaction network of a protein, promoter region of a gene, microarray experiments for a set of genes, etc.) on his web resource using the BiologicalNetworks integrated research environment [13] he/she should use API access described at the http://integromedb.org/api.jsp 'API BiologicalNetworks' tab. Thus IntegromeDB/BiologicalNetworks maintains mutual cross-referencing with other web resources, that is not limited to simple text-based HTML links, but also enables partner websites to embed visualization of the BiologicalNetworks objects within their own web pages.
Most of the data (pathways, networks, microarray experiments, sequences, etc.) returned by the IntegromeDB web site as a search results can be opened for detailed exploration and analysis in BiologicalNetworks. Figure 6 illustrates several examples of the interactions between IntegromeDB and BiologicalNetworks; in these scenarios, IntegromeDB is used for browsing large volumes of data, while BiologicalNetworks is used for exploring individual datasets in finer detail.
Partner websites or third-party software programs can choose to embed the entire IntegromeDB website into their own software. Thus, an IntegromeDB 'plugin' can be established at the BioGPS [6] portal, which provides 'plug-ins' through which the users can connect any number of external websites into freely configurable screen layouts.