Researchers have been trying to compile and compare microarray data from different studies. ArrayExpress, TranscriptomeBrowser , Genevestigator  and COXPRESdb  are some examples. Variations in experimental conditions across studies have hindered such efforts. Recommendations for systematic reporting of experimental conditions [24, 25] and new methods for cross-platform comparisons of microarray results [9, 26] have improved our ability to make use of the available microarray data. However, many microarray data-sets are not comparable due to various reasons. These reasons include non-compliance with MIAME  and complications in statistical methods of data processing, particularly when the studies have used different microarray platforms . There are also a large number of reports where raw data is not available and only selected genes with specific expression pattern are listed. Such gene-lists lack the necessary basic information for comparing expression level-related information.
The method reported here allows comparison of the gene-sets irrespective of associated information such as intensity values, statistics, platform and probes. Obviously, this simplification would mean loss of other important information such as relative up/down regulation and levels of expression. Many 'meta-analysis' approaches have considered such details of expression data [28, 29], but they are applicable to raw data only. It is also difficult to find consensus by the traditional methods. For example, if a study (A) finds gene xyz to be expressed very high in condition 1 (C1) compared to condition 2 (C2), study B finds xyz to be expressed only marginally higher in C1 than in C2, and study C finds it to be expressed almost equal in both C1 and C2 - they all disagree with each other as far as the 'relative levels of expression' is considered. The novelty of our approach is that, by considering only the 'transcribed' or 'dormant' status, the data could be compiled across different microarray platforms, and we can state that gene xyz is 'present' in C1. Thus, this method of comparing expression status in a binary form allows use of most of the available microarray data, including simple gene-lists.
In fact, obtaining a list of genes expressed or not expressed in specific conditions, and deriving consensus across studies, can provide an extremely important alternative to biomarker identification. Generally, the genes that are up regulated or down regulated and those whose function/ontology is well established are considered as possible markers. With the new approach one can compare the list of genes with high reliability score for the 'expressed status' under a normal condition with a similar set of genes absent in an abnormal condition. The union list derived after such a comparison would have a unique value as a set of potential biomarkers. Similarly, genes that are more likely to be dormant in normal conditions, but expressed in abnormal tissues would also be important. We are currently trying to use this approach to identify genes that have a strong correlation with azoospermia.
Nevertheless, the current database can also be used as a single source for identifying most of the mass scale gene expression data as it directs the user to the original data in all cases. Those interested in using the original microarray data can do so, and perform their own comparison and analysis.
Approaches similar to the one used in this study have been used for other purposes earlier too: Smith et al.  applied such a method for meta-analysis of breast cancer microarray data and Harsha et al.  for identifying potential pancreatic-cancer biomarkers. Very recently, Culhane et al.  also reported a very similar approach to create a gene expression database, GeneSigDB, which considers gene lists from tables or figures embedded in publications or included as supplementary material on the journal's or the author's website. But, GeneSigDB does not use raw data, cover testis-related conditions or derive a consensus across data-sets (from different studies). New methods such as Gene Set Enrichment Analysis (GSEA) , Parametric Analysis of Gene set Enrichment (PAGE)  and Generally Applicable Gene set Enrichment for pathway analysis (GAGE)  process data across multiple data-sets in such a way that, the specific details of data-processing within each study are not required to bring out meaningful information from the microarray experiments. However, the objective of GSEA and PAGE was to gain insights into biological mechanisms by clustering genes across studies, while our focus was in deriving the consensus information along with a reliability score.
Compilation of gene-sets corresponding to comparable conditions and locations, and deriving a reliable ESLC for each gene, can be useful in various ways. One can cluster genes based on their expression pattern in different ESLCs. Such clustering can help to identify genes having strong association with specific conditions and/or locations. For example, genes with consistent expression in normal testis but absent in infertility conditions might be of significance for researchers. The higher the reliability-score of a gene, the higher will be its chances of being a biomarker and/or a candidate for research in diagnostics, prognostics and therapeutics. Moreover, tissue-specific databases, such as MGEx-Tdb, also have the potential to assist in exploring the variation or conservation of expression of genes across different species in multiple tissues.
The need for systematically compiling gene-expression data in one place is obvious from previous efforts. In fact, TisGeD , a new database, has been reported during the last stages of the writing of this manuscript. This database is a compilation of data for most tissues and species, mainly from existing databases. But it seems to have failed to make the best use of all available information, at least for the testis tissue. On the contrary, an effort like the current one may not be always practical. The biocuration process consumed a significant amount of time (about 3 years) and is eventually limited to only one tissue. However, it would provide more reliable information. There is perhaps a compromised approach possible. While about 222 gene-sets in the database were retrieved from literature, 156 of them had less than 500 genes per set. By avoiding such smaller gene-sets, one might save time - albeit with some loss of information.
Even though this study has compared MGEx-Tdb with a few well-established databases, the purpose is of course not to undermine the value of these pre-existing resources. Such databases have their own specific advantages and, in many cases, a wider variety of applications. The objective of comparing the different databases was to validate the novel approach.
While MGEx-Tdb can facilitate unique applications in the gene expression studies in the context of mammalian testis, it has a few limitations and there is a scope for further improvement in different aspects. For example, incorporation of level of expression along with the basic expression status might be possible in many cases. The method of calculating 'reliability-scores' for the expression patterns can be improvised by considering the details such as sample size and validation of the microarray data, reported along with gene-sets. Factors such as unavailability of complete data in many cases, diversity in analytical methods used, and lack of experimental details in many of the published gene expression studies have been major hurdles for the compilation of parameters mentioned above. Nevertheless, we are already making attempts to make the possible improvements. We are also trying to include data from other types of mass scale studies. In the current database, we have used the non-microarray data in some cases only, particularly when a list of genes was reported in the manuscript or in the supplementary notes. The data in the repositories could not be included due to complications in the process of converting the unique identifiers (e.g., SAGE tags) to standard gene names or ids. We shall complete these tasks in a revised version of the database. Moreover, efforts are on to include data from more mammalian species for the testis tissue, further improve query features of this database and even develop a few other tissue-specific databases.
Most of the existing data permit only predictions, rather than actually establishing a final expression status for different genes. This can be explained as follows: a) There is a larger amount of data available for the expression of genes at the RNA level, compared to protein level, and transcription doesn't guarantee continued translation into proteins. Thus, the mRNA data can only be used to suggest or predict the expression of genes into final proteins. b) The expression status of some genes can vary across samples, even within a study. The genes which behave the same way across samples and studies are more likely to have a stronger association with the physiological condition of the tissue/cell type of interest. This means, the data can only be used to predict the expression possibilities. And, it will be useful to 'predict' expression patterns of genes, using a reliability-score such as the one reported here.