Microarray data integration for genome-wide analysis of human tissue-selective gene expression
© Wang et al. 2010
Published: 2 November 2010
Skip to main content
© Wang et al. 2010
Published: 2 November 2010
Microarray gene expression data are accumulating in public databases. The expression profiles contain valuable information for understanding human gene expression patterns. However, the effective use of public microarray data requires integrating the expression profiles from heterogeneous sources.
In this study, we have compiled a compendium of microarray expression profiles of various human tissue samples. The microarray raw data generated in different research laboratories have been obtained and combined into a single dataset after data normalization and transformation. To demonstrate the usefulness of the integrated microarray data for studying human gene expression patterns, we have analyzed the dataset to identify potential tissue-selective genes. A new method has been proposed for genome-wide identification of tissue-selective gene targets using both microarray intensity values and detection calls. The candidate genes for brain, liver and testis-selective expression have been examined, and the results suggest that our approach can select some interesting gene targets for further experimental studies.
A computational approach has been developed in this study for combining microarray expression profiles from heterogeneous sources. The integrated microarray data can be used to investigate tissue-selective expression patterns of human genes.
There are many different types of cells in the human body, and similar cells group together to form a tissue with a specialized function. Multiple tissues constitute an organ such as brain, heart or liver. Gene expression variation is the primary determinant of tissue identity and function. Certain genes are expressed specifically or preferentially in a particular tissue. These genes are broadly called tissue-selective genes . Note that tissue specificity is regarded as a special case of tissue selectivity, and tissue-specific genes are expressed only in a particular tissue. It is a fundamental question in biology to understand how selective gene expression underlies tissue development and function. Moreover, tissue-selective genes are implicated in many complex human diseases , and identification of these genes may provide valuable information for developing novel biomarkers and drug targets .
Tissue-selective expression was traditionally studied at the single-gene level with time-consuming techniques such as Northern blot and in situ hybridization. With the recent development of high-throughput technologies, biologists can perform genome-wide gene expression profiling in various tissues. These high-throughput technologies include Expressed Sequence Tag (EST) sequencing, Serial Analysis of Gene Expression (SAGE), and DNA microarrays. Yu et al.  analyzed the NCBI EST database (dbEST) to select a set of human genes that are preferentially expressed in a tissue of interest. The selection was based on the expression enrichment score, which was defined as the ratio between observed and expected number of ESTs for a gene. For the selected tissue-selective genes, regulatory modules were detected by examining the promoter motifs and their relationships with transcription factors. However, EST data are generated mainly for transcript sequence information, and EST counts can only be used as rough estimates of gene expression levels. Siu et al.  investigated gene expression patterns in different regions of the human brain by using SAGE, and identified some brain region-selective genes. Kouadjo et al.  also used the SAGE strategy to identify housekeeping and tissue-selective genes in fifteen mouse tissues. While SAGE tag counts can provide reliable estimation of gene expression, it is rather inefficient and expensive to use SAGE for profiling a large number of tissue samples with biological replicates.
The DNA microarray technology has been widely used to simultaneously profile the levels of thousands of mRNA transcripts in various tissues, and may hold great promise for elucidating the molecular mechanisms of complex human diseases. Many microarray datasets have been generated for identifying disease-associated biomarkers, classifying disease types, and predicting treatment outcomes. However, only a few microarray studies were designed to investigate human tissue-selective gene expression. Su et al.  used custom oligonucleotide arrays to examine the expression patterns of predicted genes across a panel of human and mouse tissues. The NCBI Gene Expression Omnibus (GEO at http://www.ncbi.nlm.nih.gov/geo/) has an Affymetrix microarray dataset for human body index of gene expression (GEO accession: GSE7307). Since each individual dataset does not contain a large number of expression profiles of various tissues, computational methods may be used to integrate the gene expression data from different microarray studies. Greco et al.  investigated tissue-selective expression patterns with an integrated dataset of microarray profiles publicly available at the GEO database. The relatively small dataset contained 195 expression profiles from six different microarray studies. The results suggested that gene expression data from Affymetrix GeneChip experiments could be integrated through pre-processing raw data (CEL files) with commonly used methods.
In this study, we have compiled a compendium of 2,968 microarray expression profiles of various human tissues from the NCBI GEO database. These expression profiles have been selected from 131 microarray datasets generated at different laboratories. Our data integration approach includes microarray data normalization, transformation, and quality control. The integrated data have been used to identify brain, liver and testis-selective genes using a new computational method based on both microarray hybridization intensities and detection calls. The results further suggest that the publicly available microarray expression profiles from heterogeneous sources can be integrated into a single dataset for examining gene expression patterns across various tissues.
Human microarray gene expression data are accumulating in public databases. These expression profiles have been generated for various research objectives, and show significant variations in data quality. To compile a compendium of high-quality microarray profiles for studying gene expression patterns, we manually curated the human microarray data publicly available in the NCBI GEO database (as of November 3, 2009). The following criteria were used to select microarray expression profiles in this study. First, the profiles had to be generated using the Affymetrix HG-U133 Plus 2.0 Array, a platform for complete coverage of the human genome with 54,675 probe sets. This array platform was used by the majority of human gene expression profiles deposited in the GEO database. Second, a detailed description of the microarray profiling study and raw data in CEL file format was available. The description contained important information about a microarray sample (e.g., tissue source, clinical condition, treatment, etc). Third, the expression profiles had to be obtained using normal tissue samples. Microarray profiles of cancer cells or diseased tissues were excluded from selection. Fourth, the tissue sample used for microarray profiling should not be cultured in vitro or treated with any drugs before RNA extraction. No expression profiles of primary or secondary cell cultures were selected for this study.
List of human tissues and microarray expression profiles
# of profiles selected
# of profiles integrated
Brain (various regions)
Blood (various cell types)
Microarray raw data in CEL file format were downloaded from the GEO database, and then normalized by using the dChip software (available at http://www.dchip.org). As a widely used tool for microarray data analysis, dChip can display and normalize CEL files with a model-based approach . For a given group of CEL files, dChip can be used to calculate the model-based expression values and make the qualitative detection calls for each array. The detection call (Present, Marginal or Absent) provides a statistical assessment about whether the perfect matches (PMs) show significantly more hybridization signal than the corresponding mismatches (MMs) in a probe set. Since the detection call and expression level are computed in different ways, a gene transcript with an Absent call may still be given an expression value (although usually low).
Second, each group of microarray profiles was normalized by using the invariant set method . For each normalization group, the expression profile with median overall intensity was chosen as the baseline array, against which the other profiles were normalized at probe intensity level. A subset of PM probes with small rank difference between the profile to be normalized and the baseline array were chosen as the invariant set for fitting a normalization curve. The normalization transformation was then performed for all the probes in the profile based on the curve . While the invariant set normalization method could reduce the variation in microarray profiles from different studies, it might not be applied to an expression dataset with various tissue types. Owing to the biological variation of gene expression across different tissues, a baseline array should be used to normalize the microarray profiles of each tissue type (or similar tissues).
Finally, the normalized microarray profiles were integrated into a single dataset after outlier array exclusion and global median transformation. When fitting the statistical model to a probe set, dChip used an outlier detection algorithm to identify array-outliers whose response pattern for the probe set was significantly different from the consensus probe response pattern in the other arrays . After the model was fitted for all probe sets, the percentage of probe sets detected as array-outliers was calculated for each array. If the percentage exceeded 15%, the array was discarded as an outlier array. In this study, only 62 outlier arrays were detected for all the 3,030 selected expression profiles (Table 1). Global median transformation was then applied to the remaining profiles. Each expression value in a profile was divided by the profile’s median value. The transformation was necessary because the expression profiles from different normalization groups often had different median values. Thus, the integrated dataset had 2,968 expression profiles with the same median value (i.e., 1.00).
In this study, a new computational method has been designed to analyze the integrated microarray data for identifying tissue-selective genes, which refers to the genes specifically or preferentially expressed in a particular tissue. The computational task is not trivial for the following reasons. First, the expression profiles have been compiled from various studies, in which tissues at different ages and in different conditions were used for microarray profiling. Thus, the microarray profiles of the same tissue type should not be considered as biological replicates. Second, some tissue-selective genes can be expressed at certain developmental stages or in specific conditions, and their expression may not be consistently detected in all the microarray profiles of a tissue type. Third, microarray data are inherently noisy. It was thus desired that both the expression values and detection calls of microarray profiles can be utilized for tissue-selective gene identification.
where N e is the total number of expression profiles in the experiment set, and N c is the total number of expression profiles in the control set.
where is the mean expression level of the selected probe set in the S e experiment arrays with significant expression, and is the mean expression level in control arrays. In this study, the control arrays were sorted according to their expression values for the selected probe set, and the top S e control arrays with the highest expression values were used to compute the mean, . The probe sets with Score 2 ≤ 0 were excluded from consideration for tissue-selective genes.
where w 1 and w 2 are two weights for Score 1 and Score 2, respectively. In this study, w 1 = 1 and w 2 = 1 were used to calculate the priority score for each selected probe set. Moreover, the statistical significance of the tissue-selective expression pattern was evaluated by the permutation analysis. The hybridization signals of a probe set, including its expression values and detection calls, were permuted, and then divided into the experiment and control set to calculate the priority score. After one million permutations were performed for each selected probe set, the significance level (p -value) was calculated as the fraction of permutations that gave rise to scores greater than or equal to the actual priority score of the probe set. The p -value thus provided an estimation of the probability for observing the tissue-selective expression pattern by chance.
A compendium of 2,968 expression profiles of various human tissues have been compiled from 131 microarray studies. These expression profiles have been combined into a single dataset after global normalization, and then used for the genome-wide analysis of tissue-selective gene expression. Although the analysis can be performed for any tissues with available microarray data (Table 1), we present in this paper the results from three case studies on brain, liver and testis-selective gene expression.
The human brain is highly complex, and contains 50-100 billion neurons. There are many different brain regions with specific functions. For example, the frontal lobe is involved in higher mental functions and long-term memories, whereas the occipital lobe is the visual processing center. In this study, the microarray expression profiles of different brain regions were combined into the experiment set (616 profiles), and compared with the expression profiles of non-brain tissues in the control set (2,352 profiles). Thus, the brain-selective genes identified in this study might be involved in basic neuron functions such as neural signal processing and transmission via synapses (complex membrane junctions between neurons).
List of high-scoring genes with selective expression in the brain1
Glutamate receptor, ionotropic, N-methyl D-aspartate 1
Myelin basic protein
Leucine-rich repeat LGI family gene 3
Myelin oligodendrocyte glycoprotein
Neurotensin receptor 2
Glial fibrillary acidic protein
cDNA sequence (GB: R44603)
cDNA sequence (GB: H12214)
Contactin 2 (axonal)
Protocadherin gamma subfamily C, 5
cDNA sequence (GB: BC040662)
Tetratricopeptide repeat domain 9B
Calcium binding protein 1
GABA-A receptor subunit delta
Myelin-associated oligodendrocyte basic protein
GABA-A receptor subunit alpha 1
Protein tyrosine phosphatase, non-receptor type 5
Chromosome 21 open reading frame 131
Peroxisomal biogenesis factor 5-like
F-box and leucine-rich repeat protein 16
List of high-scoring genes with selective expression in the liver 1
Mannan-binding lectin serine peptidase 2
Complement factor H-related 5
Complement factor H-related 3
C-reactive protein, pentraxin-related
Serpin peptidase inhibitor, clade C, 1
Coagulation factor II (thrombin)
Cytochrome P450, 7A1
Solute carrier family 17, member 2
Complement factor H-related 4
Hydroxyacid oxidase 1
Alcohol dehydrogenase 4 (class II), pi
Mannose-binding lectin 2, soluble
Cytochrome P450, 2C8
Similar to asparaginase
Chromosome 14 open reading frame 68
Perhaps more importantly, many genes identified in this study have been previously suggested to be expressed specifically or preferentially in the brain. These genes include GRIN1, MBP, LGI3, MOG, NTSR2, GFAP, CNTN2, PCDHGC5, CABP1, GABRD, MOBP and GABRA1 (Table 2). The protein encoded by the GRIN1 gene is a critical subunit of the glutamate receptor channel, and plays a key role in the plasticity of synapses underlying memory and learning . Genetic alterations in GRIN1 have been shown to be associated with Alzheimer's disease  and bipolar disorder . In this study, GRIN1 has the highest priority score with significant expression in 284 brain samples but none in the other tissues (Table 2). GABRD and GABRA1 encode two subunits of the GABA-A receptor, which binds the major inhibitory neurotransmitter GABA in the brain . GABA-A receptors are chloride channels that regulate membrane potential, and play structural roles in synapse maturation and stabilization. LGI3 encodes a leucine-rich repeat protein involved in the regulation of neuronal exocytosis . CABP1 is a neuron-specific member of the calmodulin superfamily, and modulates Ca2+-dependent activity of inositol 1, 4, 5-trisphosphate receptors . Both CNTN2 and PCDHGC5 encode immunoglobulin-like proteins important for the establishment and function of neural connections in the brain [17, 18]. In addition, MBP, MOG and MOBP encode constituents of the myelin sheath of oligodendrocytes, and GFAP encodes an intermediate filament protein of mature astrocytes in the central nervous system.
However, the expression and function of many other genes selected by the above analysis have not been well documented in the literature. For example, the TTC9B protein contains the tetratricopeptide repeat domain, and is conserved in other mammals, but its function in the brain is still unclear. In this study, the TTC9B gene shows significant expression in 408 out of 616 brain samples (Table 2). By contrast, in only 3 out of 2,352 control samples, significant expression is detected. Moreover, the mean expression level of TTC9B in the brain samples is 13.64-fold higher than that in the other tissues. As shown in Table 2, brain-selective expression patterns have also been demonstrated for four other genes (PTPN5, C21orf131, PEX5L and FBXL16) and three cDNA sequences (R44603, H12214 and BC040662), even though their functions in the brain remain to be characterized. The three sequences were obtained from brain cDNA libraries, but their corresponding genes were not determined. Altogether, the results suggest that the approach developed in this study can be used to not only confirm the brain-selective expression of some known genes, but also identify interesting targets for further experimental studies.
The liver plays a key role in metabolism, and its functions include plasma protein synthesis, detoxification, and production of bile necessary for digestion. To identify liver-selective genes, the microarray data were grouped into the experiment set consisting of 117 liver expression profiles and the control set containing 2,851 profiles of non-liver tissues. The parameters for the analysis are as follows: θ = 1.00 (array median value), min = 23 (~20% of liver arrays), and max = 29 (~1% of control arrays), where θ is the relative intensity threshold for significant expression, min is the minimum number of significant expression in the experiment set, and max is the maximum number of significant expression in the control set. There are 69 gene targets identified for potential liver-selective expression, and the priority score ranges from 1.64 to 5.88 (see Additional file 2). Based on the permutation analysis, the liver-selective expression patterns of all the selected genes are statistically significant (p < 0.000001). The expression patterns of these genes are shown in Figure 3.
Interestingly, 17 of the top 20 high-scoring genes listed in Table 3 are previously known to be expressed predominantly in the liver. In particular, nine genes (MASP2, CFHR5, CFHR3, CRP, SERPINC1, F2, CFHR4, APOA 5 and MBL2) are highly expressed in the liver, and their protein products are secreted to blood plasma. MASP2, CFHR5, CFHR3, CRP, CFHR4 and MBL2 play important roles in the innate immune defense against pathogens . SERPINC1 and F2 are involved in regulating the blood coagulation cascade . APOA5 encodes an apolipoprotein important for the regulation of plasma triglyceride level, a major risk factor for coronary artery disease . Six of the known liver-selective genes encode metabolic enzymes involved in cholesterol catabolism and bile acid biosynthesis (CYP7A1), the urea cycle (ARG1), glyoxylate detoxification (AGXT), and the oxidation of alcohols (ADH4) and other compounds (CYP2C8 and HAO1). In addition, HGFAC encodes a peptidase involved in hepatocyte growth factor activation, and C14orf68 encodes a liver-specific mitochondrial carrier protein. The other three high-scoring genes (SLC17A2, ASPG and TDO2) have not been previously shown to be expressed preferentially in the liver.
When compared with brain and liver tissues, many other tissues have fewer number of microarray expression profiles available (Table 1). The microarray dataset has only 36 expression profiles of the testis, which produces sperm and male sex hormones. To identify testis-selective genes, these 36 expression profiles (experiment set) were compared with 2,932 microarray profiles of non-testis tissues (control set) by using the following parameters: θ = 1.00 (array median value), min = 7 (~20% of testis arrays), and max = 29 (~1% of control arrays). The analysis resulted in 581 gene targets with the priority score ranging from 1.35 to 6.05 (see Additional file 3). The testis-selective expression patterns of these targets were found to be statistically significant by permutation testing (p < 0.000001). Figure 3 shows the expression patterns of the testis-selective gene targets.
List of high-scoring genes with selective expression in the testis1
Chromosome 9 open reading frame 11
Transition protein 2 (histone to protamine)
cDNA sequence (GB: BF508261)
Chromosome 2 open reading frame 53
Testis-specific serine kinase 3
cDNA sequence (GB: BC033504)
cDNA sequence (GB: AI204633)
Family with sequence similarity 24, A
cDNA sequence (GB: AI423933)
CPX chromosome region, candidate 1
IQ motif containing F6
cDNA sequence (GB: AA398245)
5'-nucleotidase, cytosolic IB
cDNA sequence (GB: AI799028)
cDNA sequence (GB: BC041892)
cDNA sequence (GB: AA428659)
Ring finger protein 148
E3 ubiquitin-protein ligase parkin
cDNA sequence (GB: BC035261)
The other high-scoring targets have not been previously shown to be testis-selective genes. PARK2 is known to be expressed in the brain, and mutations in this gene cause Parkinson disease . The results from this study suggest that the highest expression of PARK2 appears to occur in the testis (Table 4). There are five other genes (C2orf53, FAM24A, CPXCR1, IQCF6 and RNF148) whose expression and function in the testis have not been well documented in the literature. In addition, the high-scoring targets include nine cDNA sequences. Interestingly, all the sequences except BC033504 and AI423933 were obtained from testis cDNA libraries (BC033504 from a brain library and AI423933 from a glioblastoma library). Considering the relative small sample size of testis expression profiles, it is uncertain whether all the selected probe sets represent true testis-selective genes. However, the targets with high priority scores should provide a good starting point for experimental studies on testis-selective gene expression and function.
A comprehensive microarray dataset has been compiled in this study for genome-wide analysis of human tissue-selective gene expression. The dataset contains 2,968 expression profiles of various normal tissues from 131 microarray studies. A new computational method has been designed to identify tissue-selective genes using both microarray intensity values and detection calls. To demonstrate that the integrated microarray data can be used to investigate human gene expression patterns, we have examined the lists of potential brain, liver and testis-selective genes. Notably, many of the high-scoring targets are actually known tissue-selective genes, suggesting that the approach developed in this study works effectively. Furthermore, the approach can be used to identify some interesting targets with tissue-selective expression patterns. These targets may be used for further experimental studies on human gene expression and function.
This work is supported by the CSREES/USDA, under project number SC-1700355. Publication of this supplement was made possible with support from the International Society of Intelligent Biological Medicine (ISIBM).
This article has been published as part of BMC Genomics Volume 11 Supplement 2, 2010: Proceedings of the 2009 International Conference on Bioinformatics & Computational Biology (BioComp 2009). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S2.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.