- Research article
- Open Access
A gene selection method for GeneChip array data with small sample sizes
© Chen et al. licensee BioMed Central Ltd 2011
- Published: 23 December 2011
In microarray experiments with small sample sizes, it is a challenge to estimate p-values accurately and decide cutoff p-values for gene selection appropriately. Although permutation-based methods have proved to have greater sensitivity and specificity than the regular t-test, their p-values are highly discrete due to the limited number of permutations available in very small sample sizes. Furthermore, estimated permutation-based p-values for true nulls are highly correlated and not uniformly distributed between zero and one, making it difficult to use current false discovery rate (FDR)-controlling methods.
We propose a model-based information sharing method (MBIS) that, after an appropriate data transformation, utilizes information shared among genes. We use a normal distribution to model the mean differences of true nulls across two experimental conditions. The parameters of the model are then estimated using all data in hand. Based on this model, p-values, which are uniformly distributed from true nulls, are calculated. Then, since FDR-controlling methods are generally not well suited to microarray data with very small sample sizes, we select genes for a given cutoff p-value and then estimate the false discovery rate.
Simulation studies and analysis using real microarray data show that the proposed method, MBIS, is more powerful and reliable than current methods. It has wide application to a variety of situations.
- False Discovery Rate
- Differentially Express Gene
- Equal Variance Assumption
- Real Microarray Data
- True False Discovery Rate
Microarray technology has been successfully used by biological and biomedical researchers to investigate gene expression profiles at the genome-wide level. Usually, the sample sizes are small compared to the number of genes to be investigated, making estimation of standard error for statistical tests very inaccurate. Furthermore, thousands of hypotheses (one corresponding to each gene or set of genes, in general) are tested at once, which greatly increases the probability of Type I error. This problem is also called the "multiple comparison problem" in hypothesis testing. A very small cutoff p-value is then needed to avoid picking a large number of false positives (FP); however, the price of that decision is failing to find many true positives whose p-values are larger than the cutoff value. When the sample sizes are extremely small, the problem worsens because as the sample size decreases so do the detection power and the ability to estimate p-values.
where d i is the difference of means under two conditions for gene i; se i is the estimated standard error for d i and s0 is a constant, which is used to avoid too large absolute values of regular t-statistics due to very small estimated standard errors.
When we use test statistics in (1), we will lose the information about the distribution of true nulls since we do not know the distribution of (1). To overcome this problem, permutation-based procedures have been proposed . One extensively used method in microarray data analysis is called SAM for "Significance Analysis of Microarray" . SAM uses test statistics in (1) and then permutes sample labels to estimate the p-value for each gene.
The absolute values of statistics in (1) are usually smaller than that of regular t-statistics. When sample sizes are extremely small, the total number of distinguished permutations is limited and, therefore, permutation-based methods, such as SAM, will have larger p-values than those from regular t-test, especially for differentially expressed (DE) genes. For example, in experiments where there are only three replicates for two conditions (a typical scenario) there exist only ten different available permutations. The coarseness of the possible selections creates a problem for finding a reasonable cut-off p-value.
To select DE genes, we use a cutoff p-value and pick those genes whose p-values are smaller than the given cutoff value. Understood in this process and in any gene selection is the trade-off between false positives (type I error) and false negatives (type II error). If we want to control family-wise error rate (FWER), we need a very small cutoff p-value that will fail to find many true positives. Some researchers have proposed a strategy of, instead of controlling FWER, controlling false discovery rate (FDR) to allow some FPs in the set of selected genes, but to control the mean of the ratio of number of FPs to the number of total declared DE genes [3–5]. To control FDR, we need to estimate the number and the distribution of true nulls, which is quite difficult. Since it is difficult to separate non-DE genes from DE genes when doing permutations, the resulting estimated number and the distribution of the p-values for true nulls may not be accurate. Although several improvements for SAM have been proposed [6–8], Qiu et al showed that the permutation-based methods may have large variance and, therefore, are not reliable . Yang and Churchill have noticed the problem of permutation-based methods when applied to small microarray experiments .
As part of SAM, Storey's FDR-controlling method has been proven to be more accurate than Benjamini and Hochberg's procedure and has been used extensively in microarray data analysis . They defined a quantity called q-value. Similar to p-value, "a q-value threshold can be phrased in practical terms as the proportion of significant features that turn out to be false leads" . Its R package, "qvalue," is publicly available . "qvalue" first estimates the q-value for each p-value (gene) based on all p-values and then calculates the cutoff p-value for a given cutoff q-value. Although the authors claimed that "qvalue" usually conservatively controls the FDR in that its true false discovery rate is smaller than the given cutoff q-value , Jung and Jang have found that it could also be anti-conservative for small cutoff q-values . In some cases, when the given cutoff q-values are small, "qvalue" may select very few or no DE genes.
In this paper, we show that when sample sizes are extremely small, the t-test has poor performance in terms of sensitivity and specificity and SAM (and "qvalue") may not be applicable due to the difficulty of controlling FDR for GeneChip array data. To circumvent those problems, we propose a new model-based method we call model-based information sharing method (MBIS). To evaluate the performance of our new method, we compare it with others by using both simulation data and real data.
Fold change, equal variance, and data transformation
The ratio of the expression levels across two conditions is called fold change (FC); it has been used in the early comparative experiments [13, 14]. This criterion is arguable since, depending on the decision-makers, choosing cutoff FC is arbitrary. Furthermore, the FC method does not take into account the variability with gene expression measurements, or, even worse, it assumes that the variability for all expression measurements is the same, which is likely to be false for most gene expression experiments. However, FC criteria have their own advantages. First, they are biologically meaningful and easily interpreted. Second, more importantly, many studies have shown that FC-based methods, if used appropriately, outperform other methods [15–19].
One way to obtain equal variance from gene to gene is to transform the data, usually with a logarithmic transformation. After this transformation, a FC (log scale) can be calculated from the difference of means across two conditions. However, different data sets may require different variance-stabilization transformations. Several variance-stabilization and normalization transformation methods, which try to transform expression values to be equal variance and normally distributed for each gene, have been proposed [19–23].
Model-based information sharing (MBIS)
MBIS makes the assumption that an appropriate data transformation is available and has been applied to the raw gene expression data. This transformation has furthermore stabilized the variance. Therefore, the variance for each gene is a constant, denoted by s2, after transformation. If we can estimate s2 from data, then we can calculate p-value easily for each gene.
Estimation of s2
where d i is the difference of the means for gene i across two conditions and Φ(.) is the cumulative distribution function (CDF) of the standard normal distribution.
Estimation of total number of non-DE genes G0
For a given value μ (0 <μ < 1), we count the number (N u ) of genes with p-values greater than or equal to μ. Then an estimate of G0 is N μ /(1-μ). To reduce the influence of DE genes since they have relatively small p-values, a relatively large μ is preferable. We can also use a vector of μ's and calculate the corresponding estimated 's and then take their (weighted) mean as the final estimate for G0.
Gene selection and estimations for false positives and FDR
For a given cutoff p-value, p0, we pick those genes with p-values smaller than p0 as DE genes. Suppose S genes are selected. Then we can estimate the number of false positives, , and the false discovery rate, .
SAM, t-test and q-value
For the SAM method, we use the R package, SAMr , and choose different values for s0.perc (percentile of estimated se's): -1 (t-test only, i.e. s0 = 0 in (1)), 20, 40, 60, 80 and 100. SAM will calculate p-values by permutation. For the t-test method, we calculate p-values from the regular t-test statistics (i.e. s0 = 0 in (1)) without permutation. We then use the calculated p-values for each method as the input for R package "qvalue" and then get the output of selected DE genes with different preset q-values.
To restrict ourselves to small experiments, we assume the sample sizes for both conditions are 3, 5 and 8. We simulate 10,000 genes with normal distributions for two conditions. For non-DE genes, we assume they are normally distributed with a mean equal to 0; for DE genes, their absolute mean difference is uniformly distributed: with three ranges representing different degrees of differential expression: U(1,3), low, U(3,6), middle, and U(6,9), high. We assume the standard deviations are uniformly distributed as U(1,b), where b is greater than or equal to one. In the ideal situation, i.e. equal variance, b = 1. However, even after trying several variance-stabilization transformations, sometimes this assumption may be too strong for real data, and we therefore choose different b's in our simulations: b = 1, 1.5 and 2. In other words, we simulate data with equal or near equal variance. The proportion of DE genes among all genes may also affect the gene selection results; we then choose three levels of proportions: 0.1, 0.3 and 0.5 (i.e. the numbers of DE genes are 1000, 3000 and 5000, respectively). The output of selected genes from "qvalue" for each method with different preset cutoff q-values: 0.05, 0.10, 0.15, 0.20 and 0.25, are compared.
Real data set
We use Affymetrix GeneChip data sets selected from the GSE2350 series , downloaded from the NCBI GEO database  to compare our new method with others. We use the first three samples from both "control" (GSM44051, GSM44052 and GSM44053) and "CD40L treatment" (GSM44057, GSM44058 and GSM44059) groups. For the raw intensity data, we use the "rma" function in R package "affy"  to do background correction, normalization, and summarization . Then we apply different methods to the summarized expression values (already on log base 2 scale) to estimate p-values that are the input for the "qvalue."
To see which method gives more biologically meaningful results, we use the web-based tool, CLASSIFI algorithm [27–29], that uses Gene Ontology (GO)  annotation to classify groups of genes defined by gene cluster analysis using the statistical analysis of GO annotation co-clustering. We compare the median p-values of "topfile" from the output of CLASSIFI. In general, the smaller the p-value is, the more reasonable the results in terms of GO classification .
Simulation results of numbers of TPs, and FPs from different methods (nde = 1000, rep = 3, b = 1.5, diff = c(3,6))
S0 = 0
For the SAM methods with various s0.perc, when the preset q-value is small, we failed to get any true positives. For example, when given q-value 0.1, none of the SAM methods can get any true positives. Interestingly, when the given q-value is small, a regular t-test performs better than a t-test with a permutation in SAM; this implies permutation-based methods are not appropriate in this situation. Table 1 also indicates that SAM methods are usually conservative, as the authors of "qvalue" claimed . However, it is not the case for MBIS and regular t-test. In general, the observed false discovery rates (Obs. FDR in Table 1) from MBIS and regular t-test methods are larger than the preset q-values, while SAM methods are usually too conservative and need large q-values to get a reasonable proportion of true positives. For different setups in our simulations, we obtained similar comparison results.
Results from real data set
For the real data set, we use MBIS, regular t-test, and SAM to calculate the p-values for each gene and then use "qvalue" to select DE genes with cutoff q-values equal to 0.01, 0.025, 0.05, 0.075 and 0.1, respectively. By using "qvalue," we calculate the corresponding cutoff p-values from each cutoff q-value for these three methods. Since we know the distributions of nulls from MBIS and t-test (they have a uniform distribution for the p-values of nulls), and we can also estimate the number of true negatives for a given cutoff p-value, we can estimate the number of false positives and the false positive rates.
Results from real data for given cutoff q-values
# DE genes
The selected gene sets from MBIS and the t-test are usually different. For example, when the cutoff q-value is equal to 0.05, MBIS and the t-test select 5550 and 4748 genes, respectively; the number of common genes by these two methods is 3694. In other words, about 1000 genes are selected by the t-test that are not in the list from the MBIS. However, SAM selected genes also usually selected by MBIS.
From the CLASSIFI output with cutoff q-value 0.05, the median p-values (-log10 scale) are 15.30, 7.05 and 6.01 for MBIS, SAM, and t-test, respectively, indicating that SAM performs better than the t-test but worse than MBIS in terms of co-clustering for genes with similar function according to GO.
Results from real data for given cutoff p-values
# DE genes
In a typical microarray experiment, the number of genes, G, is usually between 10K and 50K, indicating that the variance in (9) is very close to 0 and the estimated value in (2) is close to the true value; therefore a normal distribution is appropriate to approximate the mean differences of the true nulls.
In comparing (7) with (9), we can see that, while the regular t-test method gives a much larger variance for each estimated variance (each individual t-test will lose two degrees of freedom due to variance estimation), MBIS, a method that utilizes information among genes, has a more precise estimate for the common variance. Therefore, MBIS always outperforms the t-test.
On the other hand, the Chi-square distribution is right skewed, implying that its mean is larger than its median. If 's have a Chi-square distribution, they are more likely to have estimated values less than the mean (true value) than estimated values greater than the mean. In other words, are more probable to underestimate than overestimate the constant variance. Therefore many true nulls may have very small p-values from a t-test only because they have small estimated standard errors. This explains why there are so many FPs from t-test in our simulations; and consequently t-test selects so many different DE genes than SAM and MBIS do in real data. Because of the same reason, adding a common number to each individual se i in (1) will potentially decrease the bias (for small s0.perc in SAM) and/or decrease the relative difference of estimated variances for most genes; therefore SAM usually improves the test statistics, although still not as favorably as MBIS. This explains why SAM performs better than t-test but worse than MBIS in terms of sensitivity and specificity.
When sample sizes are extremely small, as we mentioned before, SAM will have relatively larger p-values due to a limited number of permutations available, affecting the estimation of q-values by "qvalue". "qvalue" does not perform very well in this situation. For a given cutoff q-value, the corresponding cutoff p-value calculated by "qvalue" could be too large (as seen in the results from t-test and MBIS in simulation and real data) or too conservative (as in the results from SAM), a finding consistent with those from Jung and Jang .
Another difficulty for "qvalue" is that the number of selected genes can be very sensitive to the cutoff q-value, especially the very small preset q-value (see Table 2), that is desirable in practice; in this situation, SAM even performs worse than the regular t-test in terms of proportion of the DE genes selected. This raises the question of how to choose an appropriate q-value in practice to which there is no absolute answer. Sometimes, even for large q-values (as seen in the results from SAM in Table 1), the "qvalue" gives us a small proportion of true positives; on the other hand, we could select a large number of genes with a small q-value (as seen in the results from MBIS and t-test for real data in Table 2). We recommend that in this situation (small sample sizes), instead of using q-value only, one should choose a cutoff p-value to select DE genes first and then estimate FDR if desired.
Although we assume equal variance in the MBIS, we also evaluate this new method under situations when this assumption is violated. By simulation, we have shown that, when the variances of gene expressions are near constant, MBIS still outperforms both the t-test and SAM, making our method applicable in various situations.
From our experience, variances estimated from raw expression data are highly variable. We should transform data before applying MBIS. Several variance-stabilization and normalization transformation procedures, such as logarithm, Box-Cox transformation, generalized logarithm , variance stabilization  and data-driven Haar-Fisz transformation for microarrays (DDHFm) , are already available. In addition, choosing appropriate preprocessing procedures (background correction, normalization and summarization) is also very important for downstream analyses, including gene selection [16, 26, 31–34].
For microarray data with extremely small sample sizes, a modified t-test like SAM performs better than a regular t-test in terms of sensitivity and specificity. However, to control FDR, for small preset q-values, SAM fails to select enough true positives and performs worse than the t-test. To circumvent this problem, we propose a model-based information sharing method (MBIS) that uses information shared by genes. We show, using both simulation and real microarray data, that this new method outperforms the t-test and SAM.
The authors thank Ms. Linda Harrison and Ms. Kimberly Lawson for their editorial assistance. ZC would like to thank the support from the NIH grant (UL1 RR024148), awarded to the University of Texas Health Science Center at Houston.
- Efron B, Tibshirani R, Storey JD, Tushe V: Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc. 2001, 96: 1151-1160. 10.1198/016214501753382129.View ArticleGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98 (9): 5116-5121. 10.1073/pnas.091062498.PubMed CentralView ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B. 1995, 57: 289-300.Google Scholar
- Storey J: A direct approach to false discovery rates. J R Statist Soc B. 2002, 64: 479-498. 10.1111/1467-9868.00346.View ArticleGoogle Scholar
- Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003, 100 (16): 9440-9445. 10.1073/pnas.1530509100.PubMed CentralView ArticlePubMedGoogle Scholar
- Pounds S, Cheng C: Improving false discovery rate estimation. Bioinformatics. 2004, 20 (11): 1737-1745. 10.1093/bioinformatics/bth160.View ArticlePubMedGoogle Scholar
- Wu B: Differential gene expression detection using penalized linear regression models: the improved SAM statistics. Bioinformatics. 2005, 21: 1565-1571. 10.1093/bioinformatics/bti217.View ArticlePubMedGoogle Scholar
- Yang H, Churchill G: Estimating p-values in small microarray experiments. Bioinformatics. 2007, 23 (1): 38-43. 10.1093/bioinformatics/btl548.View ArticlePubMedGoogle Scholar
- Qiu X, Xiao Y, Gordon A, Yakovlev A: Assessing stability of gene selection in microarray data analysis. BMC Bioinformatics. 2006, 7: 50-10.1186/1471-2105-7-50.PubMed CentralView ArticlePubMedGoogle Scholar
- Bioconductor. [http://www.bioconductor.org]
- Storey J, Taylor JE, Siegmund D: Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J R Stat Soc B. 2004, 66: 87-205. 10.1111/j.1467-9868.2004.00439.x.View ArticleGoogle Scholar
- Jung S, Jang W: How accurately can we control the FDR in analyzing microarray data?. Bioinformatics. 2006, 22: 1730-1736. 10.1093/bioinformatics/btl161.View ArticlePubMedGoogle Scholar
- DeRisi JL, Iyer VR, Brown PO: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997, 278 (5338): 680-686. 10.1126/science.278.5338.680.View ArticlePubMedGoogle Scholar
- Schena M, Shalon D, Heller R, Chai A, Brown PO, Davis RW: Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc Natl Acad Sci USA. 1996, 93 (20): 10614-10619. 10.1073/pnas.93.20.10614.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen DT, Chen JJ, Soong SJ: Probe rank approaches for gene selection in oligonucleotide arrays with a small number of replicates. Bioinformatics. 2005, 21 (12): 2861-2866. 10.1093/bioinformatics/bti413.View ArticlePubMedGoogle Scholar
- Chen Z, McGee M, Liu Q, Scheuermann RH: A distribution free summarization method for Affymetrix GeneChip arrays. Bioinformatics. 2007, 23 (3): 321-327. 10.1093/bioinformatics/btl609.View ArticlePubMedGoogle Scholar
- Hong F, Breitling R: A comparison of meta-analysis methods for detecting differentially expressed genes in microarray experiments. Bioinformatics. 2008, 24 (3): 374-382. 10.1093/bioinformatics/btm620.View ArticlePubMedGoogle Scholar
- Kim S, Lee J, Sohn I: Comparison of various statistical methods for identifying differential gene expression in replicated microarray data. Stat Methods Med Res. 2006, 15: 3-20. 10.1191/0962280206sm423oa.View ArticlePubMedGoogle Scholar
- Zhou L, Rocke DM: An expression index for Affymetrix GeneChips based on the generalized logarithm. Bioinformatics. 2005, 21 (21): 3983-3989. 10.1093/bioinformatics/bti665.View ArticlePubMedGoogle Scholar
- Durbin BP, Hardin JS, Hawkins DM, Rocke DM: A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics. 2002, 18 (Suppl 1): S105-110. 10.1093/bioinformatics/18.suppl_1.S105.View ArticlePubMedGoogle Scholar
- Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002, 18 (Suppl 1): S96-104. 10.1093/bioinformatics/18.suppl_1.S96.View ArticlePubMedGoogle Scholar
- Motakis ES, Nason GP, Fryzlewicz P, Rutter GA: Variance stabilization and normalization for one-color microarray data using a data-driven multiscale approach. Bioinformatics. 2006, 22 (20): 2547-2553. 10.1093/bioinformatics/btl412.View ArticlePubMedGoogle Scholar
- Rocke DM, Durbin B: A model for measurement error for gene expression arrays. J Comput Biol. 2001, 8 (6): 557-569. 10.1089/106652701753307485.View ArticlePubMedGoogle Scholar
- Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, Califano A: Reverse engineering of regulatory networks in human B cells. Nat Genet. 2005, 37 (4): 382-390. 10.1038/ng1532.View ArticlePubMedGoogle Scholar
- NCBI GEO Database. [http://www.ncbi.nih.gov/projects/geo]
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003, 19 (2): 185-193. 10.1093/bioinformatics/19.2.185.View ArticlePubMedGoogle Scholar
- CLASSIFI. [http://pathcuric1.swmed.edu/pathdb/classifi.html]
- Kong M, Chen Z, Qian Y, Cai J, Lee J, Rab E, McGee M, Scheuermann R: Use of gene ontology as a tool for assessment of analytical algorithms with real data sets: impact of revised affymetrix CDF annotation. In 7th International Workshop on Data Mining in Bioinformatics August 12th 2007; San Jose. Edited by: Chen JY, Lonardi A, Zaki M. 2007, 60-68.Google Scholar
- Lee JA, Sinkovits RS, Mock D, Rab EL, Cai J, Yang P, Saunders B, Hsueh RC, Choi S, Subramaniam S, Scheuermann RH: Components of the antigen processing and presentation pathway revealed by gene expression microarray analysis following B cell antigen receptor (BCR) stimulation. BMC Bioinformatics. 2006, 7: 237-10.1186/1471-2105-7-237.PubMed CentralView ArticlePubMedGoogle Scholar
- The Gene Ontology Consortium: Creating the gene ontology resource: design and implementation. Genome Res. 2001, 11 (8): 1425-1433. 10.1101/gr.180801.View ArticleGoogle Scholar
- Chen Z, McGee M, Liu Q, Kong M, Deng Y, Scheuermann RH: A distribution-free convolution model for background correction of oligonucleotide microarray data. BMC Genomics. 2009, 10 (Suppl 1): S19-10.1186/1471-2164-10-S1-S19.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen Z, McGee M, Liu Q, Kong YM, Huang X, Yang JY, Scheuermann RH: Identifying differentially expressed genes based on probe level data for GeneChip arrays. Int J Comput Biol Drug Des. 2010, 3 (3): 237-257. 10.1504/IJCBDD.2010.038028.View ArticlePubMedGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4 (2): 249-264. 10.1093/biostatistics/4.2.249.View ArticlePubMedGoogle Scholar
- McGee M, Chen Z: Parameter estimation for the exponential-normal convolution model for background correction of affymetrix GeneChip data. Stat Appl Genet Mol Biol. 2006, 5: Article24-PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.