To create the co-expression map, normalized microarray data obtained from the GEO database was used . GEO files GSE1 to GSE18120 were downloaded containing 16,916 datasets in total. From these, 3,850 Mus musculus datasets containing 64,849 microarrays and the corresponding annotation files were extracted. As mouse experiments are generally better controlled than human studies and there is less variation caused by genotypic and environmental factors in the mouse, Mus musculus data was chosen over Homo sapiens to reduce noise. Using mouse data also allows more datasets, coming from a more diverse set of experiments . This also potentially allows for the investigation of target genes in the different mouse models of aging and complex diseases.
All datasets containing annotation files that did not include gene symbols for at least 90% of the probes present in the data were removed. All microarray datasets containing values higher than 25 were log transformed, under the assumption this data was non-log-transformed data. To remove poor signal, low quality or nonsense values up to 1099 datasets containing no values above 2log(5,000) or one or more values over 2log(20,000,000) were removed. Datasets with no reference to any annotation file were removed. Even though it is not feasible for us to perform a comprehensive evaluation of the quality of the data in each experiment, a meta-analysis is in its essence a technique to eliminate poor quality data and hence we are confident that there are no systematic errors in our analysis that artificially originate false results. After these steps 1,678 datasets containing 8,417 different conditions and 21,744 individual samples remained. The probe IDs were converted into gene symbols. If multiple probe IDs mapped to the same gene symbol, they were averaged. Within each dataset the experimental conditions were manually determined. Microarrays from individuals under the same conditions were averaged; in other words, duplicates were averaged. Missing values were ignored as long as there were duplicates; if there were no duplicates this gene symbol was removed.
Constructing the co-expression map
To create GeneFriends we first constructed a genome-wide co-expression map, using normalized Mus musculus microarray data from the GEO database. This describes which genes are related based on how often they are co-expressed. In total, 1,678 mouse datasets containing 8,417 different conditions and 21,744 individual samples met our data selection criteria. To construct our expression map the different conditions within each dataset were compared to each other. Since different datasets contain different probes mapping to different gene symbols a selection was made. Only those gene symbols that are present in gene platform file GPL1261 (Affymetrix GeneChip Mouse Genome 430 2.0 Array) were used. This platform contains 20,676 gene symbols and is the most common platform used for microarrays amongst those included in this work. All of these gene symbols were present in over 850 datasets.
In this work we have used a vote counting approach to quantify co-expression for approximately 400 million (20,676*20,676) gene pairs. We used these pairs to establish if genes were co-regulated; co-regulation being defined as both genes increasing or decreasing in expression at least two-fold simultaneously, a standard (even if arbitrary) measure of differential expression. Then based on how often gene pairs were co-regulated compared to how often the single genes showed a two-fold increase or decrease in expression we calculated a co-expression ratio, which quantifies how strongly two genes are co-expressed, for all 20,676*20,676 gene pairs. The number of times two genes were simultaneously differentially expressed in the same direction (i.e. relative up or down regulated) was calculated using the equation:
Where x is the total number of comparisons and N the number of times two genes are differentially expressed (in the same direction) simultaneously.
For simplification the terms “UP” and “DOWN” are used in this formula. The actual directionality of the change is irrelevant as it is dependent on the direction of the comparison (i.e. whether one compares group 1 with group 2 or the other way around). Either way the results will be the same.
To reduce the effects of noise present in microarrays, an arbitrary two fold cut-off was selected, as is used in the majority of microarray analyses to indicate differential expression of genes. The total number of times each gene was relatively up or down regulated (i.e., >2 fold) was calculated using the following equation:
Where x is the total number of comparisons and i describes the current comparison between the different conditions.
From the values N and Q the co-expression ratio was deducted. The genes were then ranked based on their N:Q ratio. A ratio of 0.50 would indicate that if gene 1 is increased or decreased in expression in 50% of the cases gene 2 is also increased or decreased in expression. Each gene pair is present in at least 850 datasets and so the ratio is based on a large number of measurements.
Testing the co-expression map
To investigate the capacity of the co-expression map to provide biologically-significant results nine genes known to play a role in specific biological processes were investigated: three genes that are known to be active in fatty acid metabolism: Ppara, Acaa2 and Acadm; three genes known to be involved in immune response: Cd4, Cd8 and Il10; and three cell cycle genes: Cdc6, Cdc7 and Cdc8. These were selected before the analyses to reduce biases.
It was expected that genes co-expressed with these nine genes would be involved in the same biological processes; to test this assumption we inspected if any specific categories were over-represented by these groups of genes (see below for functional enrichment analysis).
Prediction of novel candidate genes in aging and complex diseases
In order to identify genes co-expressed with known disease genes, three disease related gene sets were included. The first of these was an aging gene set. It consisted of genes over-expressed with age obtained from a meta-analysis of aging microarray studies in mice, rats and humans that revealed several conserved genes increasing or decreasing in expression with age  (Additional file 9). The second gene set included was a set of cancer-related genes  (Additional file 9). This is a manually curated cancer set that includes only heritable cancer genes with strong evidence that mutations in these genes are causative for cancer. The third gene set added included genes known to cause diseases through mitochondrial complex I deficiencies. The genes in this set contain the nuclear mitochondrial complex I deficiency genes in the OMIM database (Additional file 9). Gene symbols that were not present in the co-expression map were not included in the analysis.
Using the above seed lists, a "guilt-by-association" approach was employed to find new potential disease-related gene targets. In this approach the top 5% most co-expressed genes with each gene were considered “friends” of that particular gene. For each of the 20,676 genes we calculated how many times it was “friends” with the disease related genes. Next, the probability that a gene was “friends” with this number of disease genes was calculated, as follows. How often each gene was “friends” with any other gene was counted, from which the chance a gene is "friends" with another gene was calculated:
Where p is in effect the chance that a particular gene occurs in the top 5% of a random gene.
We assume the following null hypothesis: The probability of a gene being a “friend” with one of the n
disease genes equals the probability p
of being a “friend” with a random gene. Then the probability of a gene being a “friend” with k
or more genes from the disease list can be calculated by using the right-tail of the binomial distribution.
Where Pr(K > = k) is the probability that a gene would be “friends” with k or more genes in the disease gene set; k is the number disease gene “friends”; n is the number of genes in the gene set. When calculating p the number of occurrences of a gene in the top 5% of all genes was included. This is necessary since some genes tend to be co-expressed more often in general than other genes.
Experimental validation of cancer-related genes Bc055324 and 4930547N16Rik
To test the predictions from the analyses using GeneFriends, we took un-annotated genes that were the most co-expressed with the cancer disease gene list. Validated siRNAs were available from Qiagen for two the human homologs of the top un-annotated genes: Bc055324 (C1ORF112) and 4930547N16Rik (C12ORF48). The experiment was conducted in human HeLa cells using standard culture conditions. A negative and a positive control were also included (Qiagen). The positive control contained a mix of several apoptosis inducing siRNAs, demonstrating that the transfection was successful through elevated cell death. The negative control consisted of siRNAs targeting non-mammalian genes. The full protocol followed for this experiment is described in Additional file 10.
Gene function enrichment analysis
The Database for Annotation, Visualization and Integrated Discovery (DAVID)  was used to identify enriched functional groups within these genes. The default settings were used in this analysis. The results were ranked based on p-value and genes with a p-value <10-6 were selected. This is a stringent cut-off since using a Bonferroni correction for multiple testing results in: 0.05/20,677 > 10-6
. In addition, a randomization test showed this is a very stringent cut-off in which no false positives are expected by chance. The test entailed the construction of several random sets of genes which were then used to find co-expressed genes. The p-values found for all results were >10-5, indicating that genes with a p-value <10-5 are very unlikely false positives.
Next, to understand the significance of the DAVID enrichment score, 1000 genes were randomly selected and used as an input for DAVID. This resulted in an enrichment score of 2.2 with an FDR score of 0.7 for the most significant category found. The same was done for smaller numbers resulting in similar scores. This indicates that the Enrichment scores of >10 and FDR <10-10 are very significant and cannot be found by random chance.
Benchmarking using our co-expression map and COXPRESSdb revealed similar results (Additional file 2), suggesting our co-expression map is not inferior to those built using correlation measures. Therefore, our work demonstrates that vote counting is a viable method to build co-expression maps.
One concern is that some of the genes in functional categories were assigned based on their expression pattern; if this would be the case it could lead to circular reasoning. However, an analysis conducted by the DAVID team shows that less than 1% (116/20676) of the genes is grouped based solely on their expression pattern (see Additional file 11 for a list of these genes).
A BLAST search was conducted on the Bc055324 gene. The protein sequence for this gene was recovered from GenBank version: Bc055324.1. This sequence was PSI-blasted against the non-redundant protein sequence database. Then the search was iterated twice including all sequences recovered in the initial search.