- Open Access
Density based pruning for identification of differentially expressed genes from microarray data
© Hu and Xu; licensee BioMed Central Ltd. 2010
- Published: 2 November 2010
Identification of differentially expressed genes from microarray datasets is one of the most important analyses for microarray data mining. Popular algorithms such as statistical t-test rank genes based on a single statistics. The false positive rate of these methods can be improved by considering other features of differentially expressed genes.
We proposed a pattern recognition strategy for identifying differentially expressed genes. Genes are mapped to a two dimension feature space composed of average difference of gene expression and average expression levels. A density based pruning algorithm (DB Pruning) is developed to screen out potential differentially expressed genes usually located in the sparse boundary region. Biases of popular algorithms for identifying differentially expressed genes are visually characterized. Experiments on 17 datasets from Gene Omnibus Database (GEO) with experimentally verified differentially expressed genes showed that DB pruning can significantly improve the prediction accuracy of popular identification algorithms such as t-test, rank product, and fold change.
Density based pruning of non-differentially expressed genes is an effective method for enhancing statistical testing based algorithms for identifying differentially expressed genes. It improves t-test, rank product, and fold change by 11% to 50% in the numbers of identified true differentially expressed genes. The source code of DB pruning is freely available on our website http://mleg.cse.sc.edu/degprune
- Area Under Curve
- Microarray Dataset
- Average Expression Level
- Pruning Algorithm
- Rank Product
Statistical methods for identifying differentially expressed genes are now routinely used by biologists. There are two main categories of algorithms. The first category includes single gene testing approaches such as fold change , rank product , t-test and its variants . These methods are characterized by a single statistics score used to rank genes from significantly differentially expressed genes to no- change ones. The second category includes gene set testing approaches such as gene set enrichment analysis [4, 5]. These methods are featured by exploiting externally determined gene sets to rank a group of genes. The shortcoming of these methods is that in many cases such gene set information is not available, especially for under-studied species. Despite increasing usage of gene set analysis methods , single-gene based identification algorithms for differentially expressed genes (DEGs) still dominate the practice of biological differential gene expression analysis [6–9] from microarray data. This is partially due to their simplicity as well as little requirement on gene annotation. Thus improving single-gene DEG identification algorithms still has great implication for DEG microarray analysis practice in biology. Currently, a major purpose of DEG algorithm design is to reduce their false positive and false negative errors since experimental biologists usually only afford to test only a very limited number of predicted DEGs.
Biologically interesting DEGs are those genes that have significant phenotypic changes along with their change of gene expression levels. Most current single-gene DEG identification algorithms, however, take it as a statistical significance test problem without referring to the real characteristics of differentially expressed genes . Unfortunately, limited number of samples of microarray datasets in most biological studies makes such statistical test methods ineffective [11, 12]. This issue has been addressed recently using multiple strategies. A popular strategy is to gather information across similar genes to improve DEG identification. This includes the Bayes t-test approach , the local pooled error algorithm , the famous SAM algorithm , and et cetera. Another strategy is to use external information to improve variance estimation. Wille et. al.  proposed an external variance estimation algorithm called EVE, which exploits the relationships between variances of gene expression and gene function. Kim and Park  proposed a normalization method to make multiple microarray datasets with different chips comparable, which then facilitates the estimation of gene variance with those external datasets. Their method showed big improvement over the basic regularized t-test algorithm on experiments with 1x1, 2x2 and 3x3 samples. Hack- stadt and Hess  investigated three filtering methods for pruning genes before statistical tests: MAS detection call, variance, and average signal. They showed that gene filtering by MAS detection call and mean signal lead to increased performance of DEG identification. They also suggested that filtering 50% of probe sets is reasonable due to majority genes are expected to be equally expressed.
Differentially expressed genes are those genes with significant difference in expression levels among two or more classes/conditions. Such expression changes should not be caused by random variation in gene expression. DEGs resulting from a specific perturbation to corresponding pathways tend to share some functional or physiological characteristics. It is thus justified that DEG identification can be improved by considering the characteristics of true DEGs. One characteristic is that true DEGs experimentally identified to date tend to have high average expression values across all conditions . The resulting WAD algorithm is very competitive compared to other DEG algorithms based on evaluation over 38 GEO microarray datasets. In WAD, the product of the gene expression ratio and relative gene expression level are used to rank genes. The limitation of this ranking scheme is that it biases to genes with balanced expression ratios and expression levels.
The main idea of the proposed DB pruning is that differentially expressed genes between two conditions are usually located in the boundary region in the 2-D feature space of average gene expression (AG) versus average difference of gene expression (AD). Fig.1. shows the distribution of true DEGs in the 2D space for four datasets: GSE9499, GSE6342, GSE6740_1, and GSE6740_2 from GEO database . Based on the fact that boundary region is characterized with scarcity of genes, a density based pruning algorithm is proposed here for pre-filtering non-DEG genes located outside the boundary region so that the false positive rate of current DEG algorithms can be improved.
Density-based pruning algorithm for DEG identification
The main idea of density based pruning is to remove non-DEGs that usually appear within the dense part of the AG-AD space. Assume M is a microarray matrix with N genes (rows) and P profiles (columns). There are total P1 profiles in P corresponding to condition A and P2=P-P1 profiles corresponding to condition B. The average expression level (AG) of a gene X i is defined as (X i A + X i B )/2, where X i A and X i B are the average expression level (log-scaled) of gene X i under condition A and B. The average difference of gene expression of a gene X i is defined as |X i A - X i B | Since the expression values used in calculating |X i A - X i B | are log-transformed, the average differences of expression calculated here are actually equivalent to expression ratios as in fold change method.
The density based pruning algorithm works as follows: each gene X i is mapped into the (AG, AD) feature space. Pairwise Euclidian difference between two genes X i and X i is calculated, where i ≠ j. If the distance is smaller than a user-defined radius threshold R 0 , these two genes will be declared as neighbors. Then the number of neighbors (n i ) will be calculated for each gene. If n i ≥ N0 then gene X i will be pruned from the gene list, where N0 is a user-specified density parameter. The final output gene list is composed of genes that are mostly outliers located in the boundary region of the AG-AD feature space.
For different datasets, an important step of our algorithm is to determine appropriate parameters R0 and N0 such that all or most DEGs are kept in the final list and that a maximum number of non-DEGs are pruned. Through our experiments, we found that N n = 4 is an appropriate parameter for most of 17 datasets used in our experiments. The value of threshold radius R0 has a large effect on the number of pruned genes. The minimum value of R0 is 0 and maximum value is the max AD value. A binary search procedure is used to identify R0 value that can generate the desired number (K) of candidate DEGs.
DEG identification algorithms
We tested four popular DEG identification algorithms on the 17 GEO datasets with or without DB pruning.
• Fold Change (FC) is one of early DEG identification algorithms that are still widely used by biologists. It was recently recommended to be used with a non-stringent P cutoff to generate more reproducible DEG lists [11, 23]. FC ranks genes based on the ratio of average gene expression under two conditions. Usually a 2-fold change is regarded as significant in many biological studies. A major criticism of FC is that it doesn’t consider the case that genes with low expression level in both conditions but with small variances can be ranked high.
• Rank Product (RP) [2, 24] ranks genes based on product of rank ratios for multiple A-B conditions. The results and simplicity of RP is similar to FC but overcomes its most significant limitations. It also provides a statistically rigorous estimation of significance. It was reported to have good performance for small or noisy datasets.
• T-statistics (tTest) is one of the earliest and popular methods used in DEG identification. The major advantage is that it considers the variation of genes in its ranking. The limitation is that the estimation of gene expression variances is not reliable for small datasets, which can lead to poor performance.
• Weighted Average Difference (WAD)  is a DEG algorithm based on the observation that experimentally verified true DEGs tend to have high expression level across the conditions. Genes are ranked by the product of fold change times normalized expression level. It was shown to have significantly better and robust performance than most other standard algorithms including FC, RP and tTest.
For each of these methods, we compare their DEG prediction performance with or without DB pruning. Comprehensive evaluation is conducted on 17 real-world microarray datasets from GEO database with experimentally verified DEGs.
Data set preparation
17 Datasets with 284 DEGs in total. Each dataset has 22833 genes.
The 17 Datasets used here cover a variety of biological or medical studies: GSE1462 (mitochondrial DNA mutations), GSE1615_1 (Valproic acid treatment), GSE1650 (chronic obstructive pulmonary disease), GSE2666_2(bone marrow Rho level effect), GSE3524 (tumor of epithelial tissue), GSE3860 (Hutchinson-Gilford progeria syndrome), GSE4917 (breast cancer), GSE5667_1 (atopic dermatitis), GSE6236 (Adult vs. fetal reticulocyte transcriptome comparison), GSE6344 (renal cell carcinoma disease), GSE6740_1 (HIV-infection), GSE6740_2 (HIV-infection, disease state), GSE7146 (hyperinsulinaemic, does response), GSE7765 (dose response, DMSO or 100 nM Dioxin), GSE8441 (dietary intake response), GSE9574 (breast cancer), and GSE9499 (hypomorphic germline mutations). The diversity of these datasets ensures that the observed performance of the proposed pruning algorithm is not due to some specific characteristics of the data.
Bias of DEG identification algorithms
DEG identification algorithms such as t-test and fold change all have different bias in their ranking schemes. Three factors have been commonly used in their gene ranking criteria: r(y) = (d, e, v) where d is the difference of expression levels between two conditions; v is the overall gene expression level of the gene; and v is the variance of the gene’s gene expression. T-statistics based algorithms may make false positive prediction for genes with low d because of smallv . Fold change algorithm instead suffers from the fact that a gene with large variances tend to have larger fold changes. Both methods may make mistakes by neglecting the overall gene expression levels, which has been explicitly addressed by the WAD algorithm which rank genes by d × e. Indeed, it is shown that when the expression level was considered, the WAD algorithm achieves significantly better prediction performance than all previous methods based on extensive tests on 38 datasets with known true DEGs. This shows that expression levels of true DEGs are usually high. It is thus interesting to visualize the bias of different DEG algorithms in the (d, e) feature space. For simplicity, the variancev feature is neglected as it is not correlated to true DEGs as strongly as (d, e) features.
Improving DEG identification algorithms using density based pruning
Effect of density based pruning of non-DEGs
Comparison of No. of missing true DEGs after DB pruning. (N0 = 4, R0 = 0.0017)
Total Gene: 22283
Improvement of ranks of true DEGs in the rank list by DB pruning
Ranks of true DEGs in original gene list and pruned gene list. Genes are sorted by four DEG identification algorithms on the GSE1577 dataset. Increase of ranks of true DEGs means that DB pruning have correctly filtered out many non-DEGs.
Improving standard DEG algorithms using DB pruning
To evaluate the improvement of prediction performance of DEG identification algorithms with DB pruning, we used the receiver operating characteristic (ROC), or simply ROC curve. It is a graphical plot of the fraction of true positives (TPR = true positive rate) vs. the fraction of false positives (FPR = false positive rate) as the K (the number of genes predicted to be DEGs) varies. We use the area under curve (AUC) value of the ROC curve as the criterion for comparison, which has been used in previous work . To make the comparison relevant to real-world practice, we only plot and compare the AUC value with K varies from 1 to 1000 rather than to the total number of genes (22832) as done previously. The reason is that biologists rarely have the resources to check all 22832 genes and usually only care about top K predicted DEGs for experimental verification.
Increase of AUC values for DEG algorithms after DB pruning: Rp, Wad, Fc, and tTest.
Partial AUC (up to K=1000)
Percentage of Improvement
Increase of No. of identified true DEGs out of top K predictions with or without DB pruning. Rp’, Wad’, tTest’, FC’ are algorithms with DB pruning. The total number of true DEGs of the 17 datasets is 284.
DB Pruning’s performance on microarray datasets with a small number of samples
The no. of predicted true DEGs using partial samples from condition A and B with or without using DB Pruning. Rp’, Wad’, tTest’, FC’ are algorithms with DB pruning. The total number of true DEGs of the 17 datasets is 284.
Firstly, the table shows that with the increase of K, more true DEGs will be predicted. For each specific K, decreasing the number of samples reduces the number of true DEG identified. For example, when the number of samples of each condition decreases from 4 to 2, the number of predicted true DEGs will drop from 119 to 92 for WAD, from 60 to 45 for RP, from 62 to 43 for FC, and from 67 to 16 for tTest, which has the largest reduction of performance. DB pruning is shown to be able to significantly improve the prediction performance, especially for tTest, RP and FC. With 4x4 samples, DB pruning helps tTest to identify 18 more true DEGs, a 26.8% improvement. When the sample size is reduced to 2x2, the improvement is 16, or a 50% improvement. Improvement upon RP and FC is less significant, but still achieves 20% improvement when K=550 for FC with 2x2 samples and 14% for RP with K=450 and 3x3 samples. All these prove that the DB pruning is a useful procedure for DEG identification.
We have proposed a density based pruning algorithm for removing non-differentially expressed genes with high confidence from the total gene list. This pruning procedure can significantly improve the prediction accuracy of popular DEG identification algorithms such as fold change, t-test, and rank product. The key idea of DB pruning is based on the observation that DEGs tend to have high average expression values across conditions.
In this paper, the golden standard true differentially expressed genes are those verified by the RT-PCR method, which may comprise of only a portion of true DEGs. The fact that most true DEGs used here show high average expression levels may be due to the technical limitation of RT-PCR and/or microarray: only highly expressed genes can be identified. In this case our method should be qualified to be able to improve DEG identification algorithms for these types of true DEGs.
DB pruning has two parameters to set to pre-filter non-DEGs. Even though there is no theoretical guideline for setting their perfect parameter values, these two values can be easily set to achieve significant improvements. Both parameters can be set such that an expected number of predicted DEGs are obtained. In our experiments, a single set of R0 (=0.0017) and N0 (=4) have been able to reduce the DEG prediction accuracy for all 17 datasets. This demonstrates the stability of the algorithm in terms of the parameters for different datasets. In addition, an improved pruning algorithm based on Pareto set concept is being developed which can completely remove the parameters in DB pruning.
There are several further improvements following this pattern recognition based DEG identification. One common problem of DEG identification is lack of sufficient data points for reliable estimation of gene expression levels and their differences. This usually hurt the performance of most DEG algorithms including our pruning algorithm. One potential is to use additional external datasets to help estimate gene expression levels and their differences for the dataset of the study. Our preliminary experiments showed that estimating gene expression levels using external datasets is straightforward and feasible but estimating difference of gene expression needs more study. Another improvement is to introduce additional features of DEGs, e.g. the variance of gene expressions across multiple datasets. For example, the variance estimation method using multiple datasets  can be combined with DB pruning algorithm. Functional annotation information from gene ontology or pathways can also be integrated to aid gene pruning. Current DB pruning focuses on identifying DEGs between two groups. The extension to multiple groups is straightforward since calculation of average expression level remains the same. And the average difference of expression can be defined as sum of average difference among pairwise comparisons.
Our DB pruning is implemented using C++ and Perl and can be downloaded from http://mleg.cse.sc.edu/degprune.
This work is supported by National Science Foundation CAREER AWARD BIO-DBI-0845381
This article has been published as part of BMC Genomics Volume 11 Supplement 2, 2010: Proceedings of the 2009 International Conference on Bioinformatics & Computational Biology (BioComp 2009). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S2.
- Derisi JL, Iyer VR, Brown PO: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997, 278: 680-686. 10.1126/science.278.5338.680.View ArticlePubMedGoogle Scholar
- Breitling R, Armengaud P, Amtmann A, Herzyk P: Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett. 2004, 573: 83-92. 10.1016/j.febslet.2004.07.055.View ArticlePubMedGoogle Scholar
- Jeffery IB, Higgins DG, Culhane AC: Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. Bmc Bioinformatics. 2006, 7:Google Scholar
- Nam D, Kim SY: Gene-set approach for expression pattern analysis. Briefings in Bioinformatics. 2008, 9: 189-197. 10.1093/bib/bbn001.View ArticlePubMedGoogle Scholar
- Shi J, Walker MG: Gene set enrichment analysis (GSEA) for interpreting gene expression profiles. Current Bioinformatics. 2007, 2: 133-137. 10.2174/157489307780618231.View ArticleGoogle Scholar
- Chen R, Morgan AA, Dudley J, Deshpande T, Li L, Kodama K, Chiang AP, Butte AJ: FitSNPs: highly differentially expressed genes are more likely to have variants associated with disease. Genome Biol. 2008, 9: R170-10.1186/gb-2008-9-12-r170.PubMed CentralView ArticlePubMedGoogle Scholar
- Nevarez L, Vasseur V, Le DG, Tanguy A, Guisle-Marsollier I, Houlgatte R, Barbier G: Isolation and analysis of differentially expressed genes in Penicillium glabrum subjected to thermal stress. Microbiology. 2008, 154: 3752-3765. 10.1099/mic.0.2008/021386-0.View ArticlePubMedGoogle Scholar
- Estler M, Boskovic G, Denvir J, Miles S, Primerano DA, Niles RM: Global analysis of gene expression changes during retinoic acid-induced growth arrest and differentiation of melanoma: comparison to differentially expressed genes in melanocytes vs melanoma. Bmc Genomics. 2008, 9: 478-10.1186/1471-2164-9-478.PubMed CentralView ArticlePubMedGoogle Scholar
- Satish L, Laframboise WA, O'Gorman DB, Johnson S, Janto B, Gan BS, Baratz ME, Hu FZ, Post JC, Ehrlich GD: Identification of differentially expressed genes in fibroblasts derived from patients with Dupuytren's Contracture. BMC Med. Genomics,. 2008, 1: 10-10.1186/1755-8794-1-10.View ArticleGoogle Scholar
- Datta S, Satten GA, Xia JZ, Heslin MJ, Datta S: An empirical bayes adjustment to increase the sensitivity of detecting differentially expressed genes in microarray experiments. Bioinformatics. 2004, 20: 235-242. 10.1093/bioinformatics/btg396.View ArticlePubMedGoogle Scholar
- Shi L, Jones WD, Jensen RV, Harris SC, Perkins RG, Goodsaid FM, Guo L, Croner LJ, Boysen C, Fang H: The balance of reproducibility, sensitivity, and specificity of lists of differentially expressed genes in microarray studies. BMC Bioinformatics. 2008, 9 (Suppl 9): S10-10.1186/1471-2105-9-S9-S10.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang K, Li J, Gao H: The impact of sample imbalance on identifying differentially expressed genes. Bmc Bioinformatics,. 2006, 7 (Suppl 4): S8-10.1186/1471-2105-7-S4-S8.View ArticleGoogle Scholar
- Baldi P, Long AD: A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics. 2001, 17: 509-519. 10.1093/bioinformatics/17.6.509.View ArticlePubMedGoogle Scholar
- Jain N, Thatte J, Braciale T, Ley K, O'Connell M, Lee JK: Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics. 2003, 19,: 1945-1951. 10.1093/bioinformatics/btg264.View ArticleGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U. S A. 2001, 98: 5116-5121. 10.1073/pnas.091062498.PubMed CentralView ArticlePubMedGoogle Scholar
- Wille A, Gruissem W, Buhlmann P, Hennig L: EVE (external variance estimation) increases statistical power for detecting differentially expressed genes. Plant Journal. 2007, 52: 561-569. 10.1111/j.1365-313X.2007.03227.x.View ArticlePubMedGoogle Scholar
- Kim RD, Park PJ: Improving identification of differentially expressed genes in microarray studies using information from public databases. Genome Biology. 2004, 5:Google Scholar
- Hackstadt AJ, Hess AM: Filtering for increased power for microarray data analysis. Bmc Bioinformatics. 2009, 10:Google Scholar
- Zhou Y, Cras-Meneur C, Ohsugi M, Stormo GD, Permutt MA: A global approach to identify differentially expressed genes in cDNA (two-color) microarray experiments. Bioinformatics. 2007, 23: 2073-2079. 10.1093/bioinformatics/btm292.View ArticlePubMedGoogle Scholar
- Kadota K, Nakai Y, Shimizu K: A weighted average difference method for detecting differentially expressed genes from microarray data. Algorithms for Molecular Biology. 2008, 3:Google Scholar
- Dettling M, Gabrielson E, Parmigiani G: Searching for differentially expressed gene combinations. Genome Biol. 2005, 6: R88-10.1186/gb-2005-6-10-r88.PubMed CentralView ArticlePubMedGoogle Scholar
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of millions of expression profiles - database and tools update. Nucleic Acids Research. 2007, 35: D760-D765. 10.1093/nar/gkl887.PubMed CentralView ArticlePubMedGoogle Scholar
- Kadota K, Nakai Y, Shimizu K: Ranking differentially expressed genes from Affymetrix gene expression data: methods with reproducibility, sensitivity, and specificity. Algorithms for Molecular Biology. 2009, 4:Google Scholar
- Hong FX, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J: RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics. 2006, 22: 2825-2827. 10.1093/bioinformatics/btl476.View ArticlePubMedGoogle Scholar
- Xiaodong Cai, Giannakis GB: Identifying differentially expressed genes in microarray experiments with model-based variance estimation. IEEE Transaction on Signal Processing. 2008, 54: 2418-2426.View ArticleGoogle Scholar
- Shaik JS, Yeasin M: A unified framework for finding differentially expressed genes from microarray experiments. Bmc Bioinformatics. 2007, 8:Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.