Datgan, a reusable software system for facile interrogation and visualization of complex transcription profiling data

Background We introduce Glaucoma Discovery Platform (GDP), an online environment for facile visualization and interrogation of complex transcription profiling datasets for glaucoma. We also report the availability of Datgan, the suite of scripts that was developed to construct GDP. This reusable software system complements existing repositories such as NCBI GEO or EBI ArrayExpress as it allows the construction of searchable databases to maximize understanding of user-selected transcription profiling datasets. Description Datgan scripts were used to construct both the underlying data tables and the web interface that form GDP. GDP is populated using data from a mouse model of glaucoma. The data was generated using the DBA/2J strain, a widely used mouse model of glaucoma. The DBA/2J-Gpnmb+ strain provided a genetically matched control strain that does not develop glaucoma. We separately assessed both the retina and the optic nerve head, important tissues in glaucoma. We used hierarchical clustering to identify early molecular stages of glaucoma that could not be identified using morphological assessment of disease. GDP has two components. First, an interactive search and retrieve component provides the ability to assess gene(s) of interest in all identified stages of disease in both the retina and optic nerve head. The output is returned in graphical and tabular format with statistically significant differences highlighted for easy visual analysis. Second, a bulk download component allows lists of differentially expressed genes to be retrieved as a series of files compatible with Excel. To facilitate access to additional information available for genes of interest, GDP is linked to selected external resources including Mouse Genome Informatics and Online Medelian Inheritance in Man (OMIM). Conclusion Datgan-constructed databases allow user-friendly access to datasets that involve temporally ordered stages of disease or developmental stages. Datgan and GDP are available from http://glaucomadb.jax.org/glaucoma.


Background
Transcription profiling is a powerful tool for understanding biological process and the role they play in the pathogenesis of disease. For complex diseases, it is necessary to assess many different samples, resulting in very large amounts of data that are cumbersome to analyze and understand. Specific analyses often require significant computing power, time and analytical expertise. These needs hinder detailed interrogation of deposited datasets by many members of the scientific community. Transcription profiling datasets are deposited in central databases such as Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI) [1] and ArrayExpress at the European Bioinformatics Institute (EBI) [2]. GEO and ArrayExpress are designed to efficiently store large data sets and to provide mechanisms that allow the scientific community to query, locate, review and download experiments of interest. Although useful, we have found that these platforms are not ideal for optimal interrogation of large and complex datasets. We became aware of the need for a more optimized and facile environment for analyzing complex datasets when studying glaucoma, a complex and asynchronous neurodegenerative disease. Our datasets consists of multiple, temporally ordered stages of disease and existing datasets do not allow simultaneous searching and retrieving of differentially expressed genes in multiple stages of disease compared to a no-disease control.
To meet this need, we developed Glaucoma Discovery Platform (GDP), an online environment to visualize and interrogate large and complex expression datasets in a user-friendly manner. GDP allows simultaneous querying of multiple genes, viewing results across multiple datasets and assessing expression differences for multiple probe sets for each gene. To our knowledge, no other resource is available that provides this combination of user-friendly functionality provided by GDP. We deployed GDP to derive maximum benefit from an extensive transcription profiling study of glaucoma. The study focused on identifying early molecular stages that occur prior to significant damage (described in more detail below and [3]). More than 100 different samples from at least 50 mice were profiled and 70 pairwise group comparisons were made. GDP utilizes a webbased user interface to access the underlying gene expression profiling data, which is organized on a MySQL database server. GDP has greatly facilitated our understanding of these glaucoma data. Its user-friendly interface provides easy and instant data interrogation without a need for specialized knowledge or training. It thereby allows general access to our glaucoma datasets and is freely available for the benefit of the wider scientific community.
We also report the availability of Datgan (Welsh, meaning 'to express'), the suite of scripts that was used to construct both the underlying data tables and the web interface that form GDP. Datgan was written in the python language. It is packaged as a reusable software system for constructing and populating discovery platforms to visualize user-determined sets of transcription profiling data. Datgan is organized to allow biologists with some experience running scripts (or a systems administrator) to establish their own personalized discovery platforms and can be readily adapted to incorporate transcription profiling data generated using RNA-seq.

Construction and content
Populating GDP with profiling data for glaucoma Glaucoma is a complex, neurodegenerative disorder affecting 70 million people worldwide and is associated with the death of retinal ganglion cells (RGCs) and the associated degeneration of the optic nerve [4]. DBA/2J is a widely used mouse model of glaucoma that shows hallmarks of human glaucoma including age-related IOP elevation, optic nerve excavation and regional patterns of RGC loss [5][6][7][8][9][10]. DBA/2J mice develop glaucoma as a result of a disease of the iris that leads to an elevation in IOP. The disease of the iris is caused by mutations in two genes, Gpnmb R150X and Tyrp1 b [5,6]. DBA/2J-Gpnmb + mice have a functioning Gpnmb gene and serve as a genetically matched control strain that does not develop glaucoma [11]. An important insult occurs to RGC axons at the optic nerve head in DBA/2J glaucoma [7]. However, other compartments of the RGC also are likely to undergo early changes in glaucoma such as the RGC soma [12] and synapses [13]. The mechanisms involved in these early changes are not well understood.
The gene expression profiling study that is included in GDP, investigated early changes in both the optic nerve head and retina for individual eyes from DBA/2J mice (described in detail elsewhere, [3]). Briefly, the optic nerve head and retina for each eye were separately profiled using Mouse 430 v2 arrays (Affymetrix). 50 DBA/ 2J eyes and 10 DBA/2J-Gpnmb + control eyes were studied. All data were processed and analyzed using MAA-NOVA [14]. DBA/2J eyes were initially grouped based on conventional morphological criteria including degree of optic nerve damage (dataset 1: four groups, Figure  1A). However, comparisons of these groups were not sensitive at identifying disease changes that precede morphological damage. Therefore, hierarchical clustering, a method widely used in cancer biology, [7,11,15] was used to group eyes undergoing early stages of disease and allowed much more sensitive detection of early disease changes. Eyes were clustered into different stages using both the expression profiles for the optic nerve head (dataset 2: five stages, Figure 1B) and the retina (dataset 3: four stages, Figure 1C). To identify differentially expressed genes for all three datasets, all possible pairwise comparisons were performed. In total, more than 70 different pairwise comparisons were made and many thousands of differentially expressed genes identified. All raw data has been deposited in NCBI GEO (Accession number: GSE26299).

Building GDP using Datgan scripts
GDP is constructed as a series of interconnected data tables ( Figure 2). Data was loaded in four phases using the first major component of Datgan: First, raw normalized expression values (generated using R/Maanova) were loaded for each probe set for each sample. Second, analyzed data was loaded, including relative fold change and q value, from all pairwise comparisons for all 3 datasets (Morphological, Molecular ONH and Molecular retina). Third, using a previously constructed design file, relationships between the raw data for each sample and the sample groups were established. Finally, gene annotations to the probe sets were loaded from the Mouse Genome Database [16] using public reports available from the Mouse Genome Informatics (MGI) ftp site Figure 1 Datasets available in the first release of GDP. Each of the three datasets follows a progression through glaucoma. Both optic nerve head and retinal expression data are represented within each dataset [3]. (a) Morphological dataset. Glaucoma developing DBA/2J samples (white boxes) and strain-matched D2-Gpnmb + no glaucoma control samples (grey box). Stages of glaucoma were determined morphologically by assessing optic nerve damage just behind the orbit (see Methods and [11,15,34]). (b) Molecular ONH dataset. Hierarchical clustering using the expression levels of a set of glaucoma specific genes grouped the optic nerve heads into 5 molecularly defined stages. Stages 1, 2 and 3 represent early states of glaucoma, which precede morphologically detectable glaucoma and so were not previously distinguishable using conventional analyses. Stages 4 and 5 contain eyes with moderate and severe optic nerve damage respectively. (c) Molecular retina dataset. A similar hierarchical clustering was performed using the retinal expression data. Four stages of disease were identified with stages R1 and R2 not previously detectable using morphological analysis. Stages R3 and R4 contain eyes with moderate and severe optic nerve damage respectively. Optic nerve heads and retinas were assessed from the same set of eyes.  (table name: raw data), samples and probes are added. Second, the analyzed data (statistics, comparisons) and groupings (describing the datasets such as molecular ONH and tissue ONH) are loaded. Third, a previously generated design file allows the associations between samples and groups to be established (sampletogroups). Finally, MGI annotations (all symbols, synonyms, markers, human_orthologs and probeset_to_mgid) are loaded and the convenience look up tables (probecounts, representativeprobes and ave_qvalue) established. Within each table, required columns are indicated. VARCHAR indicates a string is required, and the number in brackets indicates the number of characters allowed in that string. Lines indicated connections between tables. http://www.informatics.jax.org. This provides the ability not only to search by gene symbols, but also by their synonyms/aliases.
The web interface is implemented using the Ruby on Rails web application framework http://rubyonrails.org in combination with the second major component of Datgan, a series of custom and public Javascript libraries. The interface leverages AJAX technology (asynchronous JavaScript and XML) to allow dynamic regeneration of plots in the same page view, while maintaining the main query panel. The database application infrastructure was implemented in a generic manner, allowing for its reuse for other profiling datasets, and to make it easier to load additional experimental results into GDP. The web-based interactive search and retrieve component provides the ability to assess multiple gene (s) of interest in temporally and/or spatially defined developmental or disease stages. The output is returned in graphical and tabular format with statistically significant differences highlighted for easy visual analysis. Data for all probe sets for a given gene can be accessed as well as all data for individual samples within groups. Additionally, a bulk download component allows lists of differentially expressed genes to be retrieved as a series of tab delimited files. To facilitate access to additional functional information for a given gene, links are provided to external resources.

Visualizing and interrogating profiling data with GDP
The web-based interface is divided into 4 main sections; (a) homepage/new search, (b) results, (c) probe set details and (d) expression values ( Figure 3). A number of links are provided to selected external resources such as Mouse Genome Informatics [17], EntrezGene and the Online Medelian Inheritance in Man (OMIM) databases [18] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [19] that allows users to access genetic, phenotypic and functional information for genes of interest (see External links below). The platform is also well supported by help pages.

Home page/new search (Figure 3a)
The home page provides an overview of the database and permanent headers for convenient links to the bulk download page and the detailed users quick guide. The bulk downloads page allows all differentially expressed genes and associated information for individual comparisons to be downloaded in excel-friendly format. The quick search feature allows a single gene to be searched in all datasets. Information on downloading Datgan is also provided.
There are a variety of different ways to interrogate the data using the detailed search tool. Firstly, if a user has a set of genes that they are interested in assessing, the official gene symbols (or MGI-recognized aliases) can be entered or pasted into the appropriate text box ( Figure  4). Alternatively, there is a wild card (*) capability where, for instance, the results for all members of the tumor necrosis (Tnf) superfamily can be retrieved by searching for "Tnf*" (described in detail below and Figure 4). By default, datasets and tissues "Molecular ONH: tissue ONH" and "Molecular retina: tissue retina" are selected as these are the most sensitive at identifying differentially expressed genes in the optic nerve head and retina respectively. However, the user is also able to select the dataset(s) of interest (see Figure 1), the tissue (s) of interest (optic nerve head or retina) and the reference group (e.g. the D2-Gpnmb + control group). Including additional datasets and tissues does slow the search time. It is possible to restrict the results by fold change and/or significance value (q value). The user can select whether the results are ordered by significance (lowest average q value across all groups) or by gene symbol (alphanumeric). The q value is a measure of the false detection rate and gives an indication of the significance of the fold change. It is a standard statistic for microarray analyses [20]. The lower the q value, the more significant a fold change is considered to be. In our study, genes are considered differentially expressed, with respect to the reference, if the q value is less than 0.05 (roughly equivalent to a false detection rate of 5%). Finally, fold changes can be reported as either relative fold change (compared to reference), or as log 2 fold change (compared to reference).
The results summary page (Figure 3b) Results are returned in both graphical and tabular format. Any gene names that were not recognized from the search page are listed and links to MGI are given (for clarification of official gene symbols). A 'summary of results' table is shown indicating the number of genes found for each dataset/tissue selected on the search page. Below the summary of results are separate tables and associated graphs for each dataset/tissue searched. Each table contains the gene names searched, the official gene symbol (or description), the representative probe set, and the fold change with associated q value for each stage of glaucoma, compared to the chosen reference group. For ease of identification, significantly differentially expressed values are shown in white on a red background. Genes are ranked in the table based on the chosen option in the detailed search tool (q value or gene symbol). The graph provides a visual display of the fold changes across all stages of disease for the genes of interest, up to a maximum of ten. For searches containing more than 10 genes, the first 10 genes in the table are displayed in the graph. Check boxes are available to the left of each gene to allow the user to select up to 10 genes of choice to view in the graph. Results from the table can be exported as comma separated values (csv, Excel-friendly format). Links to useful publicly available databases are also provided (see External resources below).

Probe set details page (Figure 3c)
Each gene on the Affymetrix 430 v2 array is interrogated by a probe set of eleven 25 mer probes and a gene can have multiple probe sets [21]. Each probe set corresponds to a particular region of a gene and finding that multiple probes sets for the same gene behave similarly can add confidence to a result. Alternatively, for those probe sets corresponding to alternatively spliced exons/transcripts, multiple probe sets can give insight into the behavior of splice variants [22]. Unfortunately, some probe sets were designed to early versions of gene sequences (prior to accurate genome sequence). This results in some probes in a probe set not being identical to the gene sequence. These probe sets will not accurately reflect the level of transcript for these genes. For each probe set, a link is provided to the Ensembl website where mapping information for each probe set is provided. The values for the representative probe set for each gene is displayed in the results summary page. It is possible to view details for all probe sets for a given gene, to view probe sets for selected genes or for all genes in the table. The probe set detail page provides a table/graph for each gene.

Expression values page (Figure 3d)
Values provided in the summary pages are relative fold changes or log 2 fold changes compared to a selected reference. Each stage of disease contains biological replicates, and the fold change for a given probe set (for a given gene) is calculated as the average of the normalized expression value of all replicates for a defined stage of disease, relative to the reference group. The normalized expression value reflects the relative abundance of the gene in the tissue interrogated (either retina or optic nerve head) with respect to all other genes. Those transcripts with the highest normalized expression value are most abundant. Conversely, those with the lowest normalized expression value are less abundant. An expression level of approximately 4 or less may be considered to represent a gene that is likely not expressed in the assessed tissue (i.e. is close to background levels).
The relative abundance of transcripts corresponding to particular probe sets is important information. For instance, a small fold change in some lowly expressed gene may have greater biological importance than a small fold change in some abundant gene. Knowing the variability in expression levels for different genes in individual eyes within groups also is important. Within any stage of disease (determined either morphologically or molecularly), transcripts within individual biological replicates may behave differently. Genes with low variability within groups may be better targets for intervention strategies than variable genes. The expression values corresponding to each probe set for individual samples (replicates) can be accessed from the probe details page. The expression levels are displayed as a histogram. The average (± 1 standard deviation) for the reference group is indicated.

External Resources
Selected external resources can be accessed directly from the results tables (Table 1). Resources were selected to allow users the maximum access to current information for genes of interest. The links to external resources provided by Datgan can be easily adapted for other user-specific databases. A major resource for mouse-based expression is Mouse Genome Informatics (MGI). This resource provides the research community with information on the genetics, genomics and biology of mice [17,23]. EntrezGene and OMIM are databases that form part of the Entrez system at the National Center for Biotechnology Information (NCBI) [18]. Entrez-Gene provides gene-relevant information such as transcript/protein sequences and links to relevant publications and genome browsers. OMIM provides a simplified disease-oriented description for a given gene including mutations that have been shown to cause diseases in humans. Finally, there are links to The Gene Ontology (GO) database and the Kyoto Encyclopedia of Genes and Genomes (KEGG), two databases that provide functional descriptions of genes. GO provides a controlled vocabulary of terms for biological process, cellular compartments and molecular function and is accessed through MGI [17]. KEGG uses known functional information to construct biologically relevant pathways [19]. Given the uniformity of gene symbols between the external resources and GDP it is possible to identify groups of genes of interest (such as genes in a given KEGG pathway or genes with the same GO term) in the appropriate database and paste these genes into the search tool in GDP.

Utility
In this section, we describe the workflow for extracting the data relevant to members of the TNF superfamily. This serves both to reinforce key features of the database described above and as an example that could be followed to interrogate any gene (or group of genes) of interest. Assessing members of the tumor necrosis factor (TNF) superfamily TNF (formerly TNFα) has been suggested to be important in retinal ganglion cell loss during glaucoma [24,25]. Therefore, we assessed all members of the TNF superfamily in both the molecular ONH dataset ( Figure  2B, most sensitive dataset of early disease changes in the optic nerve head), and the molecular retina dataset (Figure 2C, most sensitive dataset of early disease changes in the retina).
Step 1: Performing detailed search (Figure 4) First, we determined the best search option. In this case, a wild card search using 'TNF*' will retrieve all TNF superfamily members as no a priori knowledge about which genes are present in the TNF superfamily is required. Second, we selected the datasets and tissues of interest -molecular ONH dataset, tissue 'ONH' and molecular retina dataset, tissue 'retina'. Finally, we selected the reference group; in this case, D2-Gpnmb + control eyes. In this example, we did not limit our search based on fold change or q value, and chose the default options for the layout of the results page (display expression differences as fold change rather than log 2 of fold change, and ordered genes based on significance of gene expression differences (q value) rather than gene symbol).
Step 2: Visualizing the results ( Figure 5) The results were returned below the original search box, allowing straightforward modification of search options if necessary. First, the summary table indicates that 50 genes met the criteria of the search for both the molecular ONH and molecular retina datasets ( Figure 5A). These 50 genes have 'TNF' in either their official gene symbol or in any MGI-approved alias(es). Below the summary of results table, detailed results for each dataset are displayed. In this example, we assess the results for the molecular ONH dataset, tissue ONH. First, the top 10 most significant genes are visualized in the graph ( Figure 5B). The results for all 50 genes are shown in the table, the top 10 most significant of which are shown in Figure 5C. Of the 50 genes, 34 are differentially expressed in at least one stage in the molecular ONH dataset. To download all expression data in the table, use the 'Download as comma separated values (CSV)' option below the table. Expression values (and associated q values) for the 50 genes across all stages of disease are exported in excel-friendly format ( Table 2).
Step 3: Interpreting the results (Figure 6) Tnf shows only a modest increase in expression in stage 4 (1.2 fold, q value = 0.0413) in the molecular ONH dataset. The five most significant genes are Tnf, alphainduced protein 8-like 2 (Tnfaip8l2), Tnf superfamily receptor 1a (Tnfrsf1a, formerly Tnfr1) and 1b (Tnfrsf1b, formerly Tnfr2), Tnf alpha-induced protein 2 (Tnfaip2) and Tnf superfamily member 9 (Tnfrsf9) ( Figure 5C, Table 2). Tnfrsf9 has the highest fold change of all TNFrelated genes in the molecular ONH dataset, with a 2.9 fold expression difference compared to D2-Gpnmb + controls in stage 3. Of direct relevance to DBA/2J glaucoma, Tnfrsf9 is involved in the proliferation of monocytes that are precursors of microglia and macrophages [26][27][28]. This information was retrieved from either OMIM or Entrez Gene. Microglia/macrophages have been shown to increase in the optic nerve head and retina early in glaucoma [3,29].
As described above, many genes are represented by multiple probe sets on the Affymetrix 430 v2 array. Tnfrsf9 is represented on the array by three probe sets (1460469_at, 1428034_a_at and 1421481_at) with '1460469_at' being shown in the results table as the most significant probe set of the three. Given the importance of all probe sets, full information can be accessed through the probe set details page ( Figure 3C and 6). Only 1460469_at is differentially expressed in the molecular ONH dataset, tissue ONH, the other two probe sets show no significant difference compared to the D2-Gpnmb + control group ( Figure 6A). To view the normalized expression levels of the probe sets in the optic nerve head it is necessary to access the expression values page ( Figure 6B). The normalized raw intensity    values for individual eyes for the DE probe set 1460469_at range between 4 (considered background) and 7. The raw normalized values for the other non-DE probe sets do not increase significantly above background.
The differing expression levels for the three Tnfrsf9 probe sets may result from detection of alternative splice forms by the different probe sets or from errors in the original probe set designs. The apparent differences in expression levels between Tnfrsf9 probe sets are likely due to probe sequence errors. Detailed data for all probes within each probe set is available in Ensembl and can be accessed directly from GDP. Only for probe set 1460469_at do all eleven probes match identically within the Tnfrsf9 gene (Figure 6c). Ensembl does not report an alignment for any probes in the 1421481_at probe set. For the remaining probe set, 1428034_a_at, three of the eleven probes have mismatches, and another three probes include intronic sequences that would not be included in the messenger RNA. This explains a lack of expression for this probe set as an average of the expression level of all probes is taken. Given this detailed probe information, it is clear that the expression level of Tnfrsf9 is only truly represented by the 1460469_at probe set.
In summary, using GDP, we have readily performed a detailed search of all TNF superfamily members and readily identified all genes that change significantly compared to controls in two out of three different datasets. In particular, the expression of Tnfrsf9 increases in the optic nerve head at very early stages of glaucoma and may be related to an accumulation of microglia. These results are of great interest to the glaucoma community and are immediately and freely available. They highlight major advantages of accessing these gene expression datasets through GDP, and the utility of Datgan for providing similar environments for many other datasets.

Discussion
Here, we describe GDP, constructed using the reusable software system Datgan, for visualizing and interrogating complex gene expression profiling data. Datgan is freely available for download (via http://www.simonjohnlab. org) and can be used to construct platforms for viewing and interrogating any set of microarray or RNA-seq based transcription profiling datasets. Datgan will be particularly useful for complex diseases that have variable onset and progression. As exemplified by GDP, platforms developed using Datgan allow general access and improved understanding of complex datasets.
Use of GDP is proving very important for understanding our glaucoma datasets. For publication in most journals, all profiling datasets are required to be deposited in either GEO (NCBI) or ArrayExpress (EBI). All data and associated files for our study were deposited in NCBI GEO (Accession number GSE26299). Data deposition in these archives is necessary to ensure access to all the raw data for interested parties. However, these archives were not designed to allow detailed interrogation of multiple genes or biological pathways across multiple datasets. GDP includes this functionality and massively facilitates interrogation of our glaucoma datasets by us, and the scientific community, ultimately allowing the most benefit to be derived from the data. Other gene expression profiling studies relevant to glaucoma, which have been or will be carried out, also could be incorporated into GDP. Many investigators, irrespective of computational expertise, can perform detailed 'on the fly' searches quickly and easily.
Datgan is specifically designed to allow additional datasets to be easily incorporated after the initial online environment has been constructed. For glaucoma, other gene expression profiling studies have been carried out that could be included into GDP. These include studies that have profiled additional animal models of glaucoma [30][31][32]. In addition, comparing cell-and tissue-specific transcription profiling studies (e.g [33]) in non-diseased settings would enable hypotheses to be made about which cells are changing early in glaucoma. Finally, useful predictions could be made by comparing studies that have profiled other relevant diseases such as other neurodegenerative disorders or diseases affecting retinal ganglion cells.

Conclusion
Datgan is a powerful software package for developing user-friendly platforms to visualize transcription profiling data. Implemented as GDP, it is already proving an essential tool for interrogating the molecular pathogenesis of glaucoma.

Availability and Requirements
The database can be accessed via http://glaucomadb.jax. org/glaucoma. Web browsers: Tested on the major web browsers including the latest versions of Firefox, Safari and Chrome. It has also been tested on smart phones and tablet computers. Operating system(s): Tested on SUSE Linux Enterprise Server 11 & OpenSUSE 11.0, and should work on other versions of Linux and Mac OSX. Tools required to establish Datgan-derived databases include: Ruby 1.8.7, Rails 2.3.5, Python 2.6, Percona-Server-5.5.13 (should work with most MySQL), apache2, Mongrel Web Server 1.1.5. Source code details can be obtained from the Datgan tab at the top of the home page. Authors' contributions GRH, RTL and SWMJ designed the database and contributed to manuscript preparation. BLK contributed to the design and implemented an initial prototype version of the database. DOW wrote the Datgan suites, constructed and populated Glaucoma Discovery Platform. All authors read and approved the final manuscript.