In silico identification and comparative analysis of differentially expressed genes in human and mouse tissues
© Pao et al. 2006
Received: 22 November 2005
Accepted: 21 April 2006
Published: 21 April 2006
Skip to main content
© Pao et al. 2006
Received: 22 November 2005
Accepted: 21 April 2006
Published: 21 April 2006
Screening for differentially expressed genes on the genomic scale and comparative analysis of the expression profiles of orthologous genes between species to study gene function and regulation are becoming increasingly feasible. Expressed sequence tags (ESTs) are an excellent source of data for such studies using bioinformatic approaches because of the rich libraries and tremendous amount of data now available in the public domain. However, any large-scale EST-based bioinformatics analysis must deal with the heterogeneous, and often ambiguous, tissue and organ terms used to describe EST libraries.
To deal with the issue of tissue source, in this work, we carefully screened and organized more than 8 million human and mouse ESTs into 157 human and 108 mouse tissue/organ categories, to which we applied an established statistic test using different thresholds of the p value to identify genes differentially expressed in different tissues. Further analysis of the tissue distribution and level of expression of human and mouse orthologous genes showed that tissue-specific orthologs tended to have more similar expression patterns than those lacking significant tissue specificity. On the other hand, a number of orthologs were found to have significant disparity in their expression profiles, hinting at novel functions, divergent regulation, or new ortholog relationships.
Comprehensive statistics on the tissue-specific expression of human and mouse genes were obtained in this very large-scale, EST-based analysis. These statistical results have been organized into a database, freely accessible at our website http://gln.ibms.sinica.edu.tw/product/HMDEG/EST/index.php, for easy searching of human and mouse tissue-specific genes and for investigating gene expression profiles in the context of comparative genomics. Comparative analysis showed that, although highly tissue-specific genes tend to exhibit similar expression profiles in human and mouse, there are significant exceptions, indicating that orthologous genes, while sharing basic genomic properties, could result in distinct phenotypes.
High-throughput analysis of gene expression offers a powerful means of studying how genes work and of uncovering the secrets encoded in genome sequences. Differential gene expression, which plays a key role in various cellular processes, can be quantified by analyzing a large number of transcription products. To do so, several large-scale transcript detection technologies have been developed, chief among which are variants of microarray technology [1, 2], expressed sequence tags (ESTs) , and serial analysis of gene expression (SAGE) . Although each of these has its own limitations [5–10], combined with bioinformatics and statistical analysis, they have been successful in revealing genes expressed differentially in different tissues or in different physiological or phenotypical states and in yielding unprecedented insights into the complicated interactions of expressed genes and their cellular functions [10–12].
In this work, the EST database for human and mouse was analyzed to identify tissue-specific and differentially expressed genes. ESTs are "single-pass" sequences of randomly selected clones of expressed genes from specific tissues, organs, or cell types . Because EST clone frequency is, in principle, proportional to the expression level of its corresponding gene in the sampled tissue, tissue-specific or differentially expressed genes can be identified by their significantly different number of EST transcripts seen in unbiased cDNA libraries from different tissues [13, 14]. Data on ESTs have been accumulating in the public domain for more than a decade and, at the present time, there are more than 5.3 million entries for human and more than 3.9 million for mouse. ESTs are also well-organized in UniGene clusters, which are linked to other types of information , allowing gene-centered analysis.
Several EST-based tools have been developed to extract gene expression profiles. BodyMap  uses its own standardized and non-normalized EST libraries exclusively for high-quality expression profiling, but its sample size of less than half a million EST sequences from 64 human and 39 mouse tissues may not give a complete picture of genome-wide gene expression [17, 18]. TissueInfo  and ExQuest (Expressional Quantification of ESTs)  are similar to each other in that they both compare EST sequences against dbEST  using MegaBlast  to extract the tissue information associated with each matching EST. However, they do not provide quantified expression profiles for genes identified as differentially expressed under a specified statistical cut-off.
The present work adopted a gene-centered strategy, taking advantage of the well-annotated and widely used UniGene clusters , in which ESTs are grouped in units of genes. This allows searching of genes, eliminates the need for sequence comparison, a computationally expensive procedure given the number of ESTs accumulated in the database, and avoids difficulties in matching and distinguishing between homologous genes.
Because some of the EST libraries were derived from unspecified tissues or under artificially modified expression conditions, we removed 1,898 such human libraries (out of 8,145; 23.3%) and 211 such mouse libraries (out of 841; 25.1%) from our analysis (see Methods) and organized the rest into a hierarchy of manually curated tissue/organ classes. These EST data were then subjected to the statistical test of Audic and Claverie , known as the A-C test, which has been shown to perform better than several other statistical tests for pairwise comparison of gene expression data in tag sampling experiments . In all, genes preferentially expressed in different tissues at various levels of specificity in 157 human and 108 mouse tissues were identified. The results were evaluated by comparison with microarray results for 17 tissues  and with the reported expression of several genes in different tissues and the genes reported to be expressed in a given tissue [24–29]. The expression profiles of human-mouse orthologous genes that were differentially expressed in normal tissues were also compared and analyzed.
Percentage of genes identified as differentially expressed, as defined by different p values.
A. Human differentially expressed genes
No. of tissues expressing the gene
B. Mouse differentially expressed genes
No. of tissues expressing the gene
To further evaluate the usefulness of our work, we compared our results with published data for several known tissue-specific genes. KLK3, TMEM10, and AMBP are three notable examples. KLK3, a member of the kallikrein gene family, is prostate-specific . In our analysis, KLK3 was identified in the prostate with a very high specificity (p < 1E-99). TMEM10, a recently reported novel human brain-specific gene , was also found to be specifically expressed in the forebrain (p = 2.57E-27), whole brain (p=9.49E-20), hippocampus (p = 8.77E-10), and hypothalamus (p = 2.43E-08). Alpha-1-microglobulin/bikunin precursor°(AMBP) is a well known gene exclusively expressed in liver both in human and mouse [27, 28], and our data showed that AMBP was expressed with very high specificity in the liver (p < 1E-99) for both human and mouse. In addition to these three specific examples, 97.2% of the human placenta-specific genes identified by Miner and Rajkovic  and all of the human brain-specific genes reported by Huminiecki et al.  were found to show the same tissue specificity (p < 1E-6) in our study (data not shown).
Correlation analysis of human and mouse orthologous genes. r indicates the Pearson correlation coefficient of the A-C test p values for orthologs expressed in at least 3 normal tissues in common in human and mouse and expressed in at least one human tissue and one mouse tissue with p < threshold.
p value threshold
No. of ortholog pairs
ave r (r ≥ 0)
ave r (r < 0)
No. of pairs with r ≥ 0
No. of pairs with r < 0
Number of significant tissue-specific (p < 1E-6) ortholog pairs with different strengths of association. r indicates the Pearson correlation coefficient for the A-C test p values for orthologs expressed in at least 3 normal tissues in common in human and mouse and expressed in at least one human tissue and one mouse tissue with p < 1E-6.
Strength of association
Number of pairs
Number of pairs with a positive correlation (r>0)
Number of pairs with a negative correlation (r<0)
Another example is IL2RG, which is reported to be essential for the development of T and NK lymphocytes and mutation of which can cause severe combined immunodeficiency disorder (SCID) . We found that human IL2RG and its mouse ortholog Il2rg were both expressed in 13 tissues with highly similar tissue specificities (r = 0.9), and, in accordance with their function , were preferentially expressed in T cells, lymphocytes, leukocytes, and whole blood (Fig. 3B).
In addition to those expressed in at least three common tissues, 324 orthologs were not expressed in any tissue in common in human and mouse; of these 240 showed a high specificity (p < 1E-6) for at least one tissue. One example is that human HATH6 was preferentially expressed in stomach ascites (p = 2.43E-11) and the stomach (p = 5.61E-07), whereas its mouse ortholog, Atoh8, was testis-specific (p = 2.75E-08). Our literature search revealed one study on Atoh8, which indicated that it is a distant mammalian homologue of the Drosophila proneural gene atonal and is expressed in neural cells, as shown by Northern blots, but, in this study, only brain and whole embryo were profiled and no data were given for expression in the stomach or testis .
Another human gene, LY64, was identified as preferentially expressed in human B cells (p = 5.34E-15), leukocytes (p = 5.28E-13), lymphocytes (p = 1.13E-13), lymph (p = 2.45E-10), and whole blood (p = 2.16E-12), whereas the mouse ortholog Ly64 was highly specific for the colon (p < 1E-99) and cecum (p = 1.20E-11). This drastic discrepancy was also seen in the microarray data . Mouse Ly64 was initially identified as the ortholog of human LY64 with 74% amino acid identity . However, another human gene, MUC13, was later shown to have 52% amino acid identity to mouse Ly64, and both MUC13 and Ly64 were found to be expressed at highest levels in the large intestine and rectum . In agreement with this, our analysis showed that MUC13 was specifically expressed in the colon (p < 1E-99).
We have created a web-based database, named HMDEG (a database for Human and Mouse Differentially Expressed Genes), along with search utilities to facilitate free access to, and easy searching of, our results for both normal and diseased tissues. For example, by selecting a specific tissue or organ in the pull-down menu, a full list of genes expressed differentially in that tissue/organ in order of increasing p value, along with the corresponding UniGene cluster ID, gene name, and gene description, is displayed. Other search options, such as gene name, EST accession number, and the expression profiles of the corresponding human or mouse orthologs, are also allowed. The whole database is available for download upon request.
The total numbers of genes we tested from normal tissues were 72865 for human and 30172 for mouse. The number of genes classified as "differentially expressed" was dictated by the p value threshold (Fig. 1), where one expects more false positives for larger p values. The number of false positives, genes falsely classified as "differentially expressed", can be estimated based on Bonferroni correction : at 1E-6 p value, for example, the predicted false positives were 0.07 for human (72865 × 1E-6) and 0.03 for mouse (30172 × 1E-6). This and the observation that most genes expressed in 3 tissues or less at p < 1E-6 (Table 1) suggested that 1E-6 was a reasonable threshold to use for detecting differentially expressed genes in our analysis. Note also that the p value was used here merely as an index to rank expression level and should not be taken as a bona fide probability measure .
Overall, our analysis showed that genes identified as differentially expressed by EST analysis generally did not correspond well to those detected by microarray; a similar observation of a weak correlation between the two systems has been previously noted . Nevertheless, as the p value threshold of the A-C test defining differential expression became more stringent, the correlation became more evident, although the degree to which this occurred varied with tissue type (Fig. 2). The factors responsible for the discrepancies between different experimental methods and between different tissues remain poorly understood and require future investigations.
Similar to the comparison with microarray, the tissue-based p value correlation between human and mouse orthologs also became stronger as the threshold for defining tissue-specific orthologs was set smaller, suggesting that tissue-specific orthologs tend to have more similar expression patterns than those lacking significant specificity (Table 2). At p < 1E-6, the results of our analysis of a few genes known to be tissue-specific agreed with the published data, and the majority (~60%) of human and mouse orthologs exhibited strong (0.8>r ≥ 0.6) or very strong (r ≥ 0.8) correlations in terms of their tissue distribution and specificity (Table 3).
Orthologs with significant disparity were also observed. Some, such as KIAA0748,MS4A1, and SLC2A6, differed from their orthologous counterpart only in the level of specificity (p value). Others, such as HATH6 and its mouse ortholog, are preferentially expressed in entirely different tissue(s). Many factors, such as heterogeneity of the tissue samples used to construct EST libraries and insufficient ESTs for theses genes, could contribute to these significant disparities. Inaccurate ortholog pairing is also a potential source of error. For example, with the identification of MUC13, it is now evident that Ly64 had been mistaken for the ortholog of LY64. This mistake has been corrected in a recent release of HomoloGene (on Mar 24, 2005), but is still present in MGI (Mouse Genome Informatics ), another widely used curated database of human and mouse orthologous genes. Of course, the observed disparities, especially those substantiated by other sources of data, may indeed represent real phenomena, suggesting that some orthologous genes, despite sharing similar genotypic features, could have disparate phenotypes.
The present analysis has yielded a useful tool to aid transcriptomic research into human and mouse genes. Obvious applications include the ready retrieval of information on genes expressed differentially in a tissue of interest and the tissue distribution and expression specificity of a particular gene or of a human and mouse ortholog pair. The presence of orthologs with divergent expression profiles may hint at novel functions, divergent regulation, or new ortholog relationship and guide future studies.
Raw data of EST reports from dbEST (at 2003/05/23 for human and 2003/07/10 for mouse) and cluster information from UniGene (build #161 for human and build #128 for mouse) were downloaded from NCBI. We parsed the EST reports to extract EST data of Homo sapiens and Mus musculus, from which we retrieved the EST unique identifier (GI number), GenBank accession number, and library information, including "Organism", "dbEST lib id", "Lib Name", "Tissue type", and "Organ". For each EST record, we retrieved its corresponding UniGene data, including cluster ID, gene name (gene symbol), and gene description.
For each EST library, we extracted a triplet consisting of title, tissue, and organ from, respectively, the fields "Lib Name", "Tissue Type", and "Organ" in the dbEST report files. Based on the triplet, each library was classified into a corresponding tissue category, according to the TissuDB tissue hierarchy . Our library classification process is illustrated in Fig. 6. Libraries without a definite pathological description in the triplet were considered to be derived from normal tissues. To mitigate variation due to unspecified tissue and artificially modified expression, libraries described as pooled, mixed, subtracted, differentially displayed, normalized, or coming from multiple tissues were excluded. Libraries without a clear description in the triplet were also discarded. There remains the possibility of some artificially modified libraries escaping from this screening, but their effect on the present analysis should be minimized, not to mention that some of them may in fact equalize the expression count, thus making detection of differential expression more stringent.
In all, we downloaded 5,372,149 human ESTs from 8,145 EST libraries and the screening process described above left us with 6,247 libraries and 3,352,546 ESTs distributed in 96,444 UniGene clusters for analysis. Similarly for mouse, 841 EST libraries were downloaded, of which 630 survived the same elimination process, leaving 3,009,721 ESTs (out of 3,132,883) distributed in 30,172 UniGene clusters for analysis.
The 6,247 human libraries were classified by the process shown in Fig. 6 into 157 tissue/organ categories, of which 94 were normal, 53 tumor-related, and 10 related to other diseases. The 630 mouse libraries were classified into 108 tissue/organ categories, of which 99 were normal, 9 tumor-related, and none were related to other diseases. To simplify matters, only the analysis results for normal tissues are presented here; those for diseased tissues will be reported elsewhere.
To profile the genes expressed in a tissue, we extracted the UniGene cluster ID of the ESTs that were classified to the target tissue. For each gene in the target tissue, we performed the A-C test  to evaluate tissue specificity:
where x and y are the numbers of ESTs clustered in the same gene, but expressed, respectively, in the target tissue and in all other tissues, and N1 and N2 are, respectively, the total number of ESTs from the target tissue and from all other tissues. Following the criteria for using the Poisson distribution , tissues with insufficient ESTs (N1 or N2 < 1000) and clusters with a biased data set (x ≥ N1 × 5% or y ≥ N2 × 5%) were excluded from the statistical test.
The raw data of HomoloGene (released on Feb. 2, 2004) were downloaded from NCBI. Using the taxonomy ID of this database, we extracted curated human and mouse orthologous gene pairs and discarded those annotated as putative. For the curated orthologous gene pairs, we obtained their gene names and UniGene cluster IDs and linked them to the expression profiles we had computed using the A-C test. For each ortholog pair expressed in at least 3 tissues in both human and mouse, the association between their expression profiles was analyzed by applying Pearson's correlation to their tissue specificity p values. We classified the strength of association, using the absolute value of Pearson's correlation coefficient (r), as follows: 0–0.19 was regarded as very weak, 0.2–0.39 as weak, 0.40–0.59 as moderate, 0.6–0.79 as strong, and 0.8–1 as very strong.
We are thankful for the help of the following people: Dr. John Hogenesch and Dr. Andrew Su, Genomics Institute of the Novartis Research Foundation, for providing the raw data for the microarray datasets published on symatlas.gnf.org; Mr. Chi-Yan Hsiao (Institute of Molecular Medicine, National Taiwan University College of Medicine), for discussions and suggestions on tissue hierarchy construction and library classification; Dr. Chun-houh Chen (Institute of Statistical Science, Academia Sinica), for discussions on statistics; Mr. Chia-Chin Wu, for help in constructing the HMDEG website. This work was supported by the Research Project on Genomics and Proteomics of the Academia Sinica (grant AS92IBMS1).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.