Putative null distributions corresponding to tests of differential expression in the Golden Spike dataset are intensity dependent
© Gaile and Miecznikowski; licensee BioMed Central Ltd. 2007
Received: 25 August 2006
Accepted: 19 April 2007
Published: 19 April 2007
We provide a re-analysis of the Golden Spike dataset, a first generation "spike-in" control microarray dataset. The original analysis of the Golden Spike dataset was presented in a manuscript by Choe et al. and raised questions concerning the performance of several statistical methods for the control of the false discovery rate (across a set of tests for differential expression). These original findings are now in question as it has been reported that the p-values associated with the tests of differential expression for null probesets (i.e., probesets designed to be fold change 1 across the two arms of the experiment) are not uniformly distributed. Two recent publications have speculated as to the reasons the null distributions are non-uniform. A publication by Dabney and Storey concludes that the non-uniform distributions of null p-values are the direct consequence of an experimental design which requires technical replicates to approximate biological replicates. Irizarry et al. identify four characteristics of the feature level data (three related to experimental design and one artifact). Irizarry et al. argue that the four observed characteristics imply that the assumptions common to most pre-processing algorithms are not satisfied and hence the expression measure methodologies considered by Choe et al. are likely to be flawed.
We replicate and extend the analyses of Dabney and Storey and present our results in the context of a two stage analysis. We provide evidence that the Stage I pre-processing algorithms considered in Dabney and Storey fail to provide expression values that are adequately centered or scaled. Furthermore, we demonstrate that the distributions of the p-values, test statistics, and probabilities associated with the relative locations and variabilities of the Stage II expression values vary with signal intensity. We provide diagnostic plots and a simple logistic regression based test statistic to detect these intensity related defects in the processed data.
We agree with Dabney and Storey that the null p-values considered in Choe et al. are indeed non-uniform. We also agree with the conclusion that, given current pre-processing technologies, the Golden Spike dataset should not serve as a reference dataset to evaluate false discovery rate controlling methodologies. However, we disagree with the assessment that the non-uniform p-values are merely the byproduct of testing for differential expression under the incorrect assumption that chip data are approximate to biological replicates. Whereas Dabney and Storey attribute the non-uniform p-values to violations of the Stage II model assumptions, we provide evidence that the non-uniformity can be attributed to the failure of the Stage I analyses to correct for systematic biases in the raw data matrix. Although we do not speculate as to the root cause of these systematic biases, the observations made in Irizarry et al. appear to be consistent with our findings. Whereas Irizarry et al. describe the effect of the experimental design on the feature level data, we consider the effect on the underlying multivariate distribution of putative null p-values. We demonstrate that the putative null distributions corresponding to the pre-processing algorithms considered in Choe et al. are all intensity dependent. This dependence serves to invalidate statistical inference based upon standard two sample test statistics. We identify a flaw in the characterization of the appropriate "null" probesets described in Choe et al. and we provide a corrected analysis which reduces (but does not eliminate) the intensity dependent effects.
Normalization of microarray data is essential for removing systematic variation and biases that are present due to the nature of the assay. In experiments where the goal is to determine differential expression scientists have developed a variety of tests and algorithms to identify differentially expressed genes. One such experiment was the "Golden Spike" experiments by . In the experiment six Affymetrix chips were divided into two groups: a control group (C) and a spike group (S). The S sample contains the same cRNAs as the C sample, except for ten selected groups of approximately 130 cRNAs per group that are present at a defined increased concentration compared to the C sample. This results in 3860 cRNAs, where 1309 cRNAs are spiked in with differing concentrations between the S and C samples. The rest (2551) are present at identical relative concentration between the two sets of microarrays. This type of experiment models the general paradigm of experiments meant to detect differential expression. Recently, however, the validity of inference based upon the Golden Spike experiment has been questioned .
In microarray experiments, the end-stage analysis usually consists of simple two-sample test statistics such as the t-statistic or the Wilcoxon Rank Sum test statistic to test for differential expression. However, it is important to note that these statistics generally operate upon data matrices which have been subjected to potentially significant amounts of pre-processing. With this technology, there are several steps required in order to process the data in order to achieve a single value representing the intensity for a given probe. It is worthwhile to consider the Affymetrix data acquisition in two stages. A Stage I analysis includes image processing where each spot is deemed to consist of a collection of pixels. From the collection of pixels at a spot an overall signal value is determined by taking a summary measure (often a median) of the pixel set at each hybridization location on the chip. In the Affymetrix data design there are 11 probe pairs spotted for each gene or SNP. Each probe pair contains two 25-mer DNA oligonucleotide probes; the perfect match (PM) probe matches perfectly to the target RNA, and the mismatch (MM) probe which is identical to its PM partner probe except for a single homomeric mismatch at the central base-pair position. The MM probe serves to estimate the nonspecific signal. In this stage, the PM and MM signals are combined into one score representing the expression signal for a specific probe. The major software packages for Stage I analysis include Bioconductor's "Affy" package, dChip and MAS 5.0 executables . Each software package varies in how the image processing is performed and how the PM and MM values are combined. After obtaining a signal for each probe, the next step in the Stage I analysis is to "normalize" the data accounting for between chip effects, spatial effects, intensity effects, a possible grid effect, and any nonlinear intensity/variation effects. Popular normalizing methods include lowess and loess smoothers to remove systematic sources of noise [3, 4].
The Stage I analysis often involves a matrix of dimension p by m where the p rows refer to the different probes, and the m columns refer to the different chips. The general procedure in normalizing this data is to use loess smoothers on the data set. One of the motivations for the Golden Spike experiment was to examine the numerous and varied normalization methods that currently exist for this data. Most of the normalization methods consider the data as a function of the matrix column. The goal of any of these normalization schemes is to reduce the systematic variation that exists in each chip. By considering each column of the data matrix as a separate chip, in each column we can scale and center the values, via loess smoothers so that each column has roughly the same "center" and "scale." This general approach (as discussed in ) does not deal well with nonlinear relationships between arrays. Another method from  is to transform the data via quantile regression so that the distribution of probe intensities is roughly the same across arrays. At this stage, the normalization should result in a dataset where the systematic variation is reduced in order to get a clearer glimpse of the biological variatio that is present in these experiments.
Ultimately, the Stage I analysis results in an X matrix of dimension p by m for each experiment where the p rows correspond to each (smoothed) probe value and the m columns correspond to the sample. In the Golden Spike datasets, numerous options in the Stage I analysis were examined, resulting in 152 different X matrices with each matrix corresponding to a different set of parameters in a Stage I analysis. From these 152 datasets, 10 "best" datasets were chosen that represented the best combination of processing in terms of detecting approximately 95 percent of true differentially expressed genes (DEGs) with changes greater than twofold, but less than 30 percent with changes below 1.7 fold before exceeding a 10 percent false-discovery rate. At this point, each data matrix X represents the input for Stage II analysis. The goal of Stage II analysis is to answer the researcher's questions of the experiment. Usually in the microarray setting this consists of a ranked list of genes determined to be differentially expressed between two groups such as treatment versus control. The methods of Stage II generally take in to account facets of the experimental design and allow the user to control for things like the false discovery rate (FDR) within a given two sample test environment. Often the validity of Stage II analyses depends upon the assumption that the Stage I analysis has provided a Stage II data matrix such that the two sample test statistic null p-values are uniformly distributed.
Dabney and Storey  provide a re-analysis of the Golden Spike dataset in which they consider the most common choices for the Stage II analysis (e.g., t-test, permutation t-test, and Wilcoxon Rank Sum test) and demonstrate that the null p-values for the Choe et al. 10 best datasets were non-uniform in all cases. The authors note that statistical methods to control the FDR require the assumption that the true null p-values are uniformly distributed and hence the Golden Spike data can not be utilized to assess the performance of such methods. Furthermore, Dabney and Storey conclude that the non-uniform distributions of p-values are the direct consequence of an experimental design which requires that technical replicates adequately approximate biological replicates. The authors provide simulation results which demonstrate that technical replicates analyzed as biological replicates can provide non-uniform null p-value distributions but fail to provide any evidence that the parameter values that evoke this behavior are consistent with the set of conditions under which the Golden Spike experiment was conducted. Presumably the reader is left to infer that because the null p-values are non-uniform and because technical replicates analyzed as biological replicates can provide non-uniform null distributions, then the technical replicates generated by the Golden Spike experiment do not adequately approximate biological replicates.
We have replicated and extended the analyses of Dabney and Storey and we agree with the assessment that the null p-values are indeed non-uniform. We also agree with the conclusion that, given current pre-processing (i.e., Stage I) technologies, the Golden Spike datasets should not serve as reference datasets to evaluate FDR controlling methodologies. However, we disagree with the assessment that the non-uniform p-values are merely the byproduct of testing for differential expression under the assumption that chip data are approximate to biological replicates when, in fact, they are not. Whereas Dabney and Storey attribute the non-uniform p-values to violations of the Stage II model assumptions, we provide evidence that the non-uniformity can be attributed to the failure of the Stage I analyses to correct for systematic biases in the raw data matrix.
A recent article by Irizarry et al.  identifies four characteristics of the feature level data (three related to experimental design and one artifact) which offer a possible explanation for the inconsistencies between the conclusions presented in  and . Irizarry et al. argue that the four observed characteristics imply that the assumptions common to most pre-processing algorithms are not satisfied and hence the expression measure methodologies considered in  are likely to be flawed. Whereas Irizarry et al. describe the effect of the experimental design on the feature level data, we consider the effect on the underlying multivariate distribution of putative null p-values. Specifically, we demonstrate that the 10 best Stage I analyses considered in  and  provide Stage II data matrices in which the columns are neither adequately centered nor adequately scaled. Further we note that the observed deviations in centering and scaling are intensity dependent. The intensity dependence of the Stage II data values leads to putative null distributions which are intensity dependent and hence non-uniform. We provide simple diagnostic plots which indicate that the relative center and scales of the underlying distributions for the control and spike-in expression values vary as a function of signal intensity.
Although the scope of this manuscript is in large part restricted to the re-analysis of the Golden Spike dataset, we also apply several of the same diagnostic plots to another Affymetrix spike-in experiment. The results suggest that some of the intensity dependent effects may exist in other settings. We relegate the extensive application and the continued development of such diagnostics to future research.
Results and Discussion
Re-analysis of the Golden Spike Dataset
Observed p-value Distributions Inconsistent with Model of Dabney and Storey
Dabney and Storey attributed the non uniform distribution of p-values to the fact that the Golden Spike experimental design requires technical replicates to masquerade as biological replicates. In their response to , the authors of  acknowledged that the three spike-in and three control chips were technical replicates but they argued that the differences in the relative concentrations of the fold change one genes within the master spike-in sample (i.e., prior to splitting into three samples) compared to those in the master control sample should have had a negligible impact on the observed expression values.
Dabney and Storey proposed the following model for i genes, i = 1, 2,...m, j treatments, j = C,S and k technical replicates, k = 1, 2, 3 and the Stage II expression data matrix X:
X ijk = μ ij + ε ij + φ ijk (1)
where s S and s C represent the sample standard deviation of probe i under condition S and C respectively. It follows from (3) that a t-test statistic calculated with respect to random variables governed by model (1) constitutes an observation from a distribution which is heavier in the tails than a t4 distribution provided 2(1 - ρ)> 0. This follows from the fact that square of the denominator of the test statistic is an unbiased estimator of and hence, a negatively biased estimator of the variance of W i . Hence, evaluating t-test statistics such as (4) against a t4 distribution will provide p-values which are negatively biased. This is the crux of the Dabney and Storey critique of the Golden Spike experimental design. Unfortunately the experimental design does not provide enough data to fit model (1) and directly estimate the relative magnitudes of and . However, it is still possible to determine that model (1) does not adequately explain all aspects of the observed p-value distributions for all Stage II datasets. There is an underlying symmetry to this putative model mis-specification because if the t-test statistic underestimates the actual variance, then the distributions of the one-sided p-values should be parsimonious with the distribution of the two-sided p-values. In actuality, the one-sided p-values for eight of the ten best datasets proved to be inconsistent with the two sided p-values. Figure 2(b) contains the sample quantile curves for dataset 10a where solid lines correspond to the two-sided p-values and the dashed and dotted lines correspond to the p-values associated with the one sided tests. The distributions of the one-sided p-values are not in agreement. Surprisingly, the set of p-values associated with the "less than" alternative appearing to contain a disproportionate number of large p-values and an insufficient number of small p-values. Datasets 9a-d and 10b-d provided results similar to those observed for dataset 10a. Figure 2(c) contains the sample quantile curves for dataset 10e in which the two sets of one-sided p-values appear to share the same underlying distribution.
Re-Loessing Golden Spike Dataset Improves Null Distributions
Each of the ten best Choe et al. Stage II data matrices were obtained using Stage I steps that included correcting the observed intensity with a loess curve that was fit to values from an invariant set of genes. This invariant set included present null (i.e., present with a putative fold change of one) as well as empty null (i.e., not present in either sample) probesets. The inclusion of the empty null probesets appears to have had a deleterious effect on the distributions of the null p-values. We have calculated a new set of ten best datasets in which the invariant set contains only the present null probesets. These calculations were performed at our request by the authors of  using analysis scripts which were identical to those used for the original analyses except for passages of the code relating to the identification of the invariant set. The red curves in Figures 2(a)–(c) correspond to the sample quantile curves for the re-loessed datasets. The "re-loessed" datasets are still significantly non-uniform, although noticeably less so than the original datasets. Inclusion of the empty nulls in the original invariant sets appears to have contributed to the observed biases in the underlying t-distributions as inspection of Figures 4(b) and 4(d) indicates that this bias appears to be mitigated in the re-loessed datasets.
Other Common Two Sample Tests Failed to Provide Uniform Null Distributions
Distribution Free Diagnostic Plots
Results of logistic regression for intensity dependence.
Diagnostic Plots Applied to the Affymetrix SpikeInSubset Data
Figure 8 suggests that the null distributions may be intensity dependent, although not to the extent as was observed in the original Golden Spike datasets. This result is expected as the Affymetrix SpikeInSubset experiment contained a smaller quantity of spiked in transcripts and did not contain anomalies of the type described in . Figure 9 suggests that the intensity dependence is most notable at the extremes and that the relative variability of the expression values appears to be intensity dependent when the data is normalized using the RMA algorithm.
This analysis does not constitute a thorough investigation of the suitability of the SpikeInSubset data for validation of FDR estimation techniques. Unlike the Golden Spike data set, only a few naive "out of the box" algorithms were applied to the raw data (rather than an analysis which spans many possible combinations and settings). Figures 8 and 9 have been included to illustrate that the diagnostic plots can detect differences between (and possible defects within) the underlying putative null distributions associated with different normalization procedures. These figure also highlight the need for the development of formal methods to assess the statistical significance of such results.
The Golden Spike dataset was generated to address a dearth of controlled spike-in array datasets. The original analysis of the data was presented in  and concluded, among other things, that common methods to control the false discovery rate had failed to control at the nominal level. Dabney and Storey determined that the failure of the FDR algorithms was not methodological, rather the distributions of the null p-values corresponding to the most common choices for the Stage II analysis (e.g., t-test, permutation t-test, and Wilcoxon Rank Sum test) were non-uniform for the datasets which were considered. Dabney and Storey concluded that the Stage II model assumptions (e.g., that the denominator of t-test is appropriate estimator of the underlying variation) associated with these tests are not met, as the non-uniform p-value distributions imply that the technical replicates generated by the experiment do not constitute adequate approximations of biological replicates. We note that their simulation parameters are provided without justification and we demonstrate that their model is inconsistent with the observed p-value distributions. In demonstrating that their analysis is flawed, we conclude that the adequacy of the technical/biological replicate approximation remains an open research question. Such a result has relevance with respect to the design of future spike-in experiments.
We suspect that the underlying null distributions are adversely effected by failures of the normalization algorithms to properly account for the abnormal feature level characteristics identified by . While Irizarry et al. speculate that the Golden Spike data may not be appropriate for the comparison of methods for FDR control, we confirm that this is the case and we demonstrate the failure of the Stage I analyses to correct for systematic biases (whatever their cause may be) in the raw data matrix.
Our analysis of the Golden Spike data also reveals that the invariant set of genes used for the pre-processing steps in Choe et al. should not have included the empty null probesets. We demonstrate that removing the empty probesets from the invariant set can provide Stage II data which appears to be adequately re-centered, a result which is dependent upon artificial knowledge of the true invariant set. Unfortunately, even under these ideal conditions, issues pertaining to higher moments (e.g., relative scale) remain and these issues appear to invalidate the null distributions of commonly used two sample test statistics.
Our analysis constitutes proof of principle that the distributions of the p-values, tests statistics, and probabilities associated with the relative locations and variabilities of the expression values can vary with signal intensity. This implies that Stage I algorithms do not always adequately adjust for intensity dependent effects. If the variation of the expression values for the null genes is a consequence of the unbalanced design of the experiment, then it is reasonable to speculate the existence of biological conditions which could engender similar imbalances. For example, such imbalances could occur when comparing different tissue types, in cases of immune challenge or in certain developmental time course studies. Although it remains an open research question whether our findings apply to other datasets we note the assessment by Irizarry et al. that experiments for which normalization assumptions do not hold are becoming more common.
If the diagnostics which we have introduced prove useful for other datasets, then questions of optimality must be considered. For example, one of the diagnostic tests was based upon a 4 th order polynomial logistic regression model. The order and nature of the model were chosen for computational convenience. A higher order model or a spline based approach could conceivably provide an improved diagnostic. However, the properties of the diagnostic are dependent on the unknown multivariate distribution of the feature level values for invariant genes. In order to compare statistical tests for intensity dependence of the p-values we would need to characterize the multivariate distribution of the invariant probes. This is very difficult due to the complicated correlation structure among the probes (e.g., correlations due to cross hybridization); a correlation structure that may vary from experiment to experiment. Thus the task of optimizing the diagnostics is fraught with challenges and has been relegated to future research.
The Golden Spike dataset was generated according to the experimental design described in  and clarified in Figure 5 of . The Golden Spike data was "re-loessed" using an invariant set which only contained the present null probesets. These calculations were performed at our request by the authors of  using analysis scripts which were identical to those used for the original analyses except for passages of the code relating to the identification of the invariant set. All calculations and figures presented in this manuscript were conducted using the R language and environment .
The Affymetrix "SpikeInSubset" data is contained in the Bioconductor  SpikeInSubset package . The RMA [12, 13], threestep, and MAS 5.0  methods were applied using functions available in the "affy"  and "affyPLM"  R packages. When using the threestep procedure we chose a background subtraction using the Ideal mismatch , quantile normalization , and Tukey's Biweight  method for summarization. Our sample IDs are as follows: sample A-1 = 1521a99hpp_av06, sample A-2 = 1532a99hpp_av04, sample A-3 = 2353a99hpp_av08, sample B-1 = 1521b99hpp_av06, sample B-2 = 1532b99hpp_av04, sample B-3 = 2353b99hpp_av08r.
The authors would like to thank Marc S. Halfon and Sung Eun Choe for their assistance.
- Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol. 2005, 6 (2): R16-10.1186/gb-2005-6-2-r16.PubMed CentralPubMedView ArticleGoogle Scholar
- Dabney AR, Storey JD: A reanalysis of a published Affymetrix GeneChip control dataset. Genome Biol. 2006, 7 (3):
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003, 19: 185-193. 10.1093/bioinformatics/19.2.185. [Evaluation Studies]PubMedView ArticleGoogle Scholar
- Schadt EE, Li C, Ellis B, Wong WH: Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data. J Cell Biochem Suppl. 2001, 120-125. 10.1002/jcb.10073. Suppl 37
- Irazarry RA, Cope LM, Wu Z: Feature-level exploration of a published Affymetrix GeneChip control dataset. Genome Biol. 2006, 7 (8):
- Irazarry RA, Wu Z, Jaffee HA: Comparison of Affymetrix GeneChip expression measures. Bioinformatics. 2006, 22 (7): 789-794. 10.1093/bioinformatics/btk046.View ArticleGoogle Scholar
- Koenker R: quantreg: Quantile Regression. 2006, [R package version 3.85], [http://www.r-project.org]Google Scholar
- Koenker RW, D'Orey V: [Algorithm AS 229] Computing Regression Quantiles. Applied Statistics. 1987, 36: 383-393. 10.2307/2347802.View ArticleGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology. 2004, 5: R80-10.1186/gb-2004-5-10-r80.PubMed CentralPubMedView ArticleGoogle Scholar
- Irizarry R, Wu Z: SpikeInSubset: Part of Affymetrix's Spike-In Experiment Data. 2006, [R package version 1.2.1]Google Scholar
- R Development Core Team: R: A language and environment for statistical computing. 2005, R Foundation for Statistical Computing, Vienna, Austria, [ISBN 3-900051-07-0], [http://www.R-project.org]Google Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4 (2): 249-264. 10.1093/biostatistics/4.2.249.PubMedView ArticleGoogle Scholar
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003, 31 (4): e15-10.1093/nar/gng015.PubMed CentralPubMedView ArticleGoogle Scholar
- Affymetrix: Microarray Suite User Guide version 5.0. 2001, [Affymetrix, Santa Clara, CA]Google Scholar
- Irizarry RA, Gautier L, Bolstad BM: affy: Methods for Affymetrix Oligonucleotide Arrays. 2006, [R package version 1.12.2]Google Scholar
- Bolstad B: affyPLM: Methods for fitting probe-level models. 2006, [R package version 1.10.0], [http://bmbolstad.com]Google Scholar
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003, 19 (2): 185-193. 10.1093/bioinformatics/19.2.185. [Comparative Study]PubMedView ArticleGoogle Scholar
- Affymetrix: Statistical Algorithms Description Document. 2002, [Affymetrix, Santa Clara, CA]Google Scholar
- Hothorn T: On Exact Rank Tests in R. R News. 2001, 1: 11-12.Google Scholar
- Hothorn T, Hornik K: exactRankTests: Exact Distributions for Rank and Permutation Tests. 2006, [R package version 0.8–12]Google Scholar