- Open Access
Improving gene expression similarity measurement using pathway-based analytic dimension
BMC Genomics volume 10, Article number: S15 (2009)
Gene expression similarity measuring methods were developed and applied to search rapidly growing public microarray databases. However, current expression similarity measuring methods need to be improved to accurately measure similarity between gene expression profiles from different platforms or different experiments.
We devised new gene expression similarity measuring method based on pathway information. In short, newly devised method measure similarity between gene expression profiles after converting them into pathway based expression profiles. To evaluate pathway based gene expression similarity measuring method, we conducted cell type classification test. Pathway based similarity measuring method shows higher classification accuracy. Especially, pathway based methods outperform at most 50% and 10% over conventional gene expression similarity method when search databases are limited to cross-platform profiles and cross-experiment profiles.
The pathway based gene expression similarity measuring method outperforms commonly used similarity measuring methods. Considering the fact that public microarray database is consist of gene expression profiles of various experiments with various type of platform, pathway based gene expression similarity measuring method could be successfully applied for searching large public microarray databases.
As microarray experiment has been widely used for various field of biology, public microarray databases have been rapidly growing each year. Currently, the two largest microarray databases, GEO  and ArrayExpress  are comprised of several hundreds thousands of expression profiles, representing various biological contexts of various species.
In accordance with this expensive collection of large scale gene expression databases, database searching methods have been developed to make the database easily accessible and practically useful. Since the microarray data is deposited in public microarray database as unit of experiment which is consist of several individual gene expression profiles, search methods also have evolved into two way, experiment dataset level search and individual gene expression profile level search.
Most of experiment dataset level search methods are depend on dataset annotation by authors of dataset. Atul B. et al. has been tried to classify the gene expression experiment dataset in GEO (GEO series) by annotating each GEO dataset with medical language terms such as UMLS and SNOMED [3–5] and to make gene expression variation based dataset search possible . Yuelin Z. et al. built GEOmetadb , which make text match based GEO dataset search more affordable than original GEO database.
Along the attempts to search large public microarray databases at experiment dataset level, individual gene expression profile level search method has also been conceptualized and developed . GEST  is the first implementation of individual profile level search method. It uses Bayesian similarity metric to measure similarity between gene expression profiles. Horton et al. devised fast similarity search algorithm and built web based similar gene expression search system, CellMontage [10, 11]. To make cross-platform gene expression profiles search possible, they transformed all gene expression profiles to Unigene ID based gene expression profiles, averaging expression values of genes for corresponding UniGene ID if multiple genes are mapped to a single UniGene ID, then measured similarity between expression profiles using Spearman rank correlation coefficient. Cell type classification as a validation of search power of CellMontage revealed that this method is good enough to search similar expression profiles from the same platform, but not from the different platform .
Here we try to improve similar gene expression profile search. For this purpose, we devised a pathway based gene expression similarity measuring method. Our pathway based methods outperform conventional method especially for cross-platform and cross-experiment profile search.
Gene expression data
We used set of gene expression profiles curated by CellMontage group . Each gene expression profile in CellMontage dataset, originally stored in GEO, is manually annotated with cell type and gene expression values of original profiles are averaged to represent expression values of corresponding Unigene ID.
For the classification procedure, we first selected cell types with which at least two different platform types are associated. For each selected cell type, we select at most two gene expression profiles from each platform in the same experiment. After the selection procedure, total 442 gene expression profiles of 40 different cell types from 54 different experiment with 23 different platform types were remained (See Additional file 1). Of these selected gene expression profiles, randomly selected one gene expression profile from each cell type was used as query profile and the other remaining profiles build up search database. Finally, the numbers of gene expression profiles in the query set and the search database are 40 and 402, respectively.
We used C2 database of MsigDB  as pathway data source for pathway summary profiling. Each UniGene ID in gene expression profiles was mapped to corresponding gene symbol of 1,892 pathways in MsigDB using NCBI unigene database .
Pathway expression profiling
Each gene expression profile was converted to pathway centric expression profile by averaging expression values of genes for corresponding pathways. Pathway expression for pathway k, consisted of N genes, is calculated by
where Gi denotes gene expression for gene i for i = 1, ..., N.
Gene expression similarity measurement
We used two different scoring methods to measure similarity between gene expression profiles. The first method, conventional method used by Cellmontage, compares common gene set between two comparing gene expression profiles. Let this method call CGSEP(common gene set expression profile) method. Another method compares common pathway set between two comparing pathway expression profiles converted from gene expression profiles. To measure the similarity between gene or pathway expression profiles, we used Spearman rank correlation test. Spearman rank correlation coefficient between profile X and Y is given by
where di = xi-yi, i = 1, ..., n and xi, yi = rank of ith gene or pathway in each profile X and Y. Spearman rank correlation coefficient ranges from -1 to 1, where similarity is maximum at 1 and minimum at -1.
Cell type classification
To evaluate the performance of similarity measuring methods, we conducted cell type classification using nearest neighbor classifier. For each of 40 query gene expression profiles, the similarities to all of 402 gene expression profiles in search database were calculated. Then the profile with highest Spearman rank correlation coefficient was predicted to have the same cell type to query profile. Predicted profile was considered as correct prediction if its cell type is the same as that of query sample. If there is no same cell type profile in search database for a query profile, the search for the given query profile is not counted in classification accuracy assessment. Accuracy of classification is calculated by number of correct predictions divided by number of predictions.
Similar profile search from the profiles of different platform or different experiment is harder than search from the profiles of the same platform or the same experiment . To evaluate the performance of pathway based similarity measuring method, we conducted two more cell type classifications, cross-platform and cross-experiment classification, where search space is consist of profiles whose platforms or experiments are different from that of query.
Results and discussion
We conducted cell type classification using two different similarity measuring methods and access the performances with overall, cross-platform and cross-experiment search databases.
Barplot shown in figure 1 summarizes all of the classification results. Pathway based similarity measuring method, PEPC, consistently shows higher classification accuracies than CGSEP method for classifications with three different search databases. As an example cases, pathway based method, PEPC, precisely classified cell types of query profile GSM18935 of thalamus cell type with overall search database (Table 1) and query profile GSM12641 of liver cell type with cross-platform search database (Table 2) while CGSEP failed. Pathway based method shows significant improvement when they were applied for cross-platform search database search as PEPC excel CGSEP with 48.6% increased accuracy. Pathway based method also outperformed up to 10% over CGSEP for cross-experiment search, however the improvement is not as significant as the cross-platform classification.
We next calculated average similarity score of top scoring hit for correct and incorrect classification cases (Table 3). Average similarity score of correct cases is higher than incorrect cases except cross-platform search using CGSEP method, in which CGSEP shows only 10% classification accuracy. Similarity scores for cross-platform search are lower than the other two classifications. This trend is cause by lower expression variations between expression profiles of the same type of platforms than that of different type of platforms [14–16].
We analyzed further to figure out the reason for low classification accuracy of cross-experiment search. More specifically, our question is why cross-experiment searches show lower classification accuracy than that of cross-platform searches even though the similarity scores for top hits are higher than that of cross-platform searches. To answer this question, we divided cross-experiment search database more specifically into cross-experiment with the same platform profiles and cross-experiment with different platform profiles and conducted cell type classification with those two search databases. Table 4 summarizes the classification accuracy with average similarity scores for correct and incorrect cases for limited search databases. We found again improved classification accuracies with up to 40% higher accuracy compared to original cross-experiment search by pathway based method if cross-experiment search is limited to cross-platform, but this trend is disappeared in search over cross-experiment and the same platform search databases. The average similarity scores to the top hits of same platform top hits are higher than average similarity scores to the top hits of different platforms in cross-experiment search. Even average similarity scores of incorrect cases with the same platform are higher than average similarity scores of correct cases with different platforms. Therefore, the correct profiles of different platforms to the query profile might not score higher than incorrect profiles of the same platforms. This seemed to be the reason for low classification accuracy of cross-experiment search. Considering this reason for low classification accuracy of cross-experiment search, different criteria to evaluate similarity score according to platform type could improve classification accuracy of cross-experiment search.
Reduced analytical dimension of pathway expression profiles from gene expression profiles might also contribute improved classification accuracy by pathway based methods. Not all genes in gene expression profile are converted to pathway expression profile for the incompleteness of current pathway information [12, 17]. In case of 442 query and profiles in search database used for cell type classification, average 56 ± 15% genes of common gene set for CGSEP method are made up common pathway expression profiles in PEPC. However, the reduced gene expression dimension dose not reduces analytical sensitivity, rather it was reported that classification accuracy is decreased with the addition of feature genes over than the moderate number [18, 19]. Likewise, reduced number of genes in the process of pathway expression profiling might increase analytical sensitivity by limiting analytical dimension under moderate size.
We first attempted to use pathway information for gene expression similarity measurement. As previously developed pathway based gene expression analysis methods were successfully improve intact gene expression based analysis methods [20–23], pathway based similarity measuring method outperformed conventional method. Along with the reduced analytical dimension effect described earlier, this improvement seems to be contributed by the averaging effect of expression variation of individual genes caused by both biological and technical reasons. Each human gene do not express or is not detected to expressed constantly even under the same biological condition within a specific microarray platform or across different type of platforms, rather it fluctuates [24, 25]. On the other hands, pathway expression, an overall expression pattern of gene set, is robust toward subtle outside stimulation . The pathway based gene expression similarity measuring methods, PEPC, we suggested here, compute pathway level expression by averaging expression of genes mapped to pathway.
Consequently, expression variations of multiple genes are summarized by a robust pathway expression, which represents the activity of the functional unit rather than a component of the unit, thus the pathway based methods result with higher classification accuracy, which demonstrates again that pathway level expression is more robust than individual gene level expression and pathway based similarity scoring methods could be successfully improve similar gene expression profile search.
We demonstrated that our new gene expression similarity measuring method improved the precision of similar gene expression profile search when it's applied to cell type classification. We showed pathway expression profiling based similarity measuring method outperformed conventional gene expression profile based similarity measuring method over at most 50% for cross-platform profile search and 10% for cross-experiment profile search. At the same time, the classification accuracy shows that the methods still need to be improved, especially for searching similar profiles across different experiment. We believe that our research shed new light on similar gene expression profile search over rapidly growing large microarray databases by showing that integrating gene expression profile with external data such as pathway could improve search accuracy.
Other papers from the meeting have been published as part of BMC Bioinformatics Volume 10 Supplement 15, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Bioinformatics, available online at http://www.biomedcentral.com/1471-2105/10?issue=S15.
Common Gene Set Expression Profile
Pathway Expression Profile for All Gene set
Pathway Expression Profile for Common gene set.
Barrett T, Troup DB, Wilhite SE, et al: NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucl Acids Res. 2007, 35: D760-765. 10.1093/nar/gkl887.
Parkinson H, Kapushesky M, Kolesnikov N, et al: ArrayExpress update--from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 2009, 37: D868-872. 10.1093/nar/gkn889.
Butte AJ, Kohane IS: Creation and implications of a phenome-genome network. Nat Biotechnol. 2006, 24: 55-62. 10.1038/nbt1150.
Butte AJ, Chen R: Finding Disease-Related Genomic Experiments Within an International Repository: First Steps in Translational Bioinformatics. AMIA Annu Symp Proc. 2006, 2006: 106-110.
Shah NH, Jonquet C, Chiang AP, et al: Ontology-driven indexing of public datasets for translational bioinformatics. BMC Bioinformatics. 2009, 10 (Suppl 2): S1-10.1186/1471-2105-10-S2-S1.
Chen R, Mallelwar R, Thosar A, Venkatasubrahmanyam S, Butte A: GeneChaser: Identifying all biological and clinical conditions in which genes of interest are differentially expressed. BMC Bioinformatics. 2008, 9: 548-10.1186/1471-2105-9-548.
Zhu Y, Davis S, Stephens R, Meltzer PS, Chen Y: GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus. Bioinformatics. 2008, 24: 2798-2800. 10.1093/bioinformatics/btn520.
Bassett DE, Eisen MB, Boguski MS: Gene expression informatics--it's all in your mine. Nat Genet. 1999, 21: 51-55. 10.1038/4478.
Hunter L, Taylor RC, Leach SM, Simon R: GEST: a gene expression search tool based on a novel Bayesian similarity metric. Bioinformatics. 2001, 17: S115-122. 10.1093/bioinformatics/17.2.115.
Horton PB, Kiseleva L, Fujibuchi W: RaPiDS: an algorithm for rapid expression profile database search. Genome Inform. 2006, 17: 67-76.
Fujibuchi W, Kiseleva L, Taniguchi T, Harada H, Horton P: CellMontage: similar expression profile search server. Bioinformatics. 2007, 23: 3103-3104. 10.1093/bioinformatics/btm462.
Subramanian A, Tamayo P, Mootha VK, et al: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America. 2005, 102: 15545-15550. 10.1073/pnas.0506580102.
Wheeler DL, Church DM, Federhen S, et al: Database resources of the National Center for Biotechnology. Nucl Acids Res. 2003, 31: 28-33. 10.1093/nar/gkg033.
Järvinen A, Hautaniemi S, Edgren H, et al: Are data from different gene expression microarray platforms comparable?. Genomics. 2004, 83: 1164-1168. 10.1016/j.ygeno.2004.01.004.
Tan PK, Downey TJ, Spitznagel EL, et al: Evaluation of gene expression measurements from commercial microarray platforms. Nucl Acids Res. 2003, 31: 5676-5684. 10.1093/nar/gkg763.
Shi L, Tong W, Fang H, et al: Cross-platform comparability of microarray technology: intra-platform consistency and appropriate data analysis procedures are essential. BMC Bioinformatics. 2005, 6 (Suppl 2): S12-10.1186/1471-2105-6-S2-S12.
Racunas S, Shah N, Fedoroff N: A case study in pathway knowledgebase verification. BMC Bioinformatics. 2006, 7: 196-10.1186/1471-2105-7-196.
Thomas RS, Pluta L, Yang L, Halsey TA: Application of genomic biomarkers to predict increased lung tumor incidence in 2-year rodent cancer bioassays. Toxicol Sci. 2007, 97: 55-64. 10.1093/toxsci/kfm023.
Thomas RS, O'Connell TM, Pluta L, et al: A comparison of transcriptomic and metabonomic technologies for identifying biomarkers predictive of two-year rodent cancer bioassays. Toxicol Sci. 2007, 96: 40-46. 10.1093/toxsci/kfl171.
Ideker T, Ozier O, Schwikowski B, Siegel AF: Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics. 2002, 18 (Suppl 1): S233-240.
Tian L, Greenberg SA, Kong SW, et al: Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci USA. 2005, 102: 13544-13549. 10.1073/pnas.0506577102.
Lee E, Chuang H, Kim J, Ideker T, Lee D: Inferring pathway activity toward precise disease classification. PLoS Comput Biol. 2008, 4: e1000217-10.1371/journal.pcbi.1000217.
Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, Conklin BR: GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat Genet. 2002, 31: 19-20. 10.1038/ng0502-19.
Tan PK, Downey TJ, Spitznagel EL, et al: Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res. 2003, 31: 5676-5684. 10.1093/nar/gkg763.
Frantz S: An array of problems. Nat Rev Drug Discov. 2005, 4: 362-363. 10.1038/nrd1746.
Curtis RK, Oresic M, Vidal-Puig A: Pathways to the analysis of microarray data. Trends Biotechnol. 2005, 23: 429-435. 10.1016/j.tibtech.2005.05.011.
This work was supported by grants (07142KFDA696, 09172KFDA637) from Korea Food & Drug Administration.
This article has been published as part of BMC Genomics Volume 10 Supplement 3, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/10?issue=S3.
The authors declare that they have no competing interests.
CK designed whole research process, implemented all required methods for the research, analyzed results, drafted and revised the manuscript. JHW devised similarity scoring algorithm with CK and revised the manuscript. WSO and KTN helped the conceptualization of the research process. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 1: Dataset detail for cell type classification. All 442 gene expression profiles used for cell type classification are listed with detailed information. Each gene expression profile is annotated with three original GEO accession id, sample id(GSM), experiment id(GSE) and platform id(GPL), and cell type. (TXT 15 KB)
About this article
Cite this article
Keum, C., Woo, J.H., Oh, W.S. et al. Improving gene expression similarity measurement using pathway-based analytic dimension. BMC Genomics 10, S15 (2009). https://doi.org/10.1186/1471-2164-10-S3-S15
- Gene Expression Profile
- Classification Accuracy
- High Classification Accuracy
- Pathway Expression
- Improve Classification Accuracy