 Methodology article
 Open Access
 Published:
Improving the statistical detection of regulated genes from microarray data using intensitybased variance estimation
BMC Genomics volume 5, Article number: 17 (2004)
Abstract
Background
Gene microarray technology provides the ability to study the regulation of thousands of genes simultaneously, but its potential is limited without an estimate of the statistical significance of the observed changes in gene expression. Due to the large number of genes being tested and the comparatively small number of array replicates (e.g., N = 3), standard statistical methods such as the Student's ttest fail to produce reliable results. Two other statistical approaches commonly used to improve significance estimates are a penalized ttest and a Ztest using intensitydependent variance estimates.
Results
The performance of these approaches is compared using a dataset of 23 replicates, and a new implementation of the Ztest is introduced that pools together variance estimates of genes with similar minimum intensity. Significance estimates based on 3 replicate arrays are calculated using each statistical technique, and their accuracy is evaluated by comparing them to a reliable estimate based on the remaining 20 replicates. The reproducibility of each test statistic is evaluated by applying it to multiple, independent sets of 3 replicate arrays. Two implementations of a Ztest using intensitydependent variance produce more reproducible results than two implementations of a penalized ttest. Furthermore, the minimum intensitybased Zstatistic demonstrates higher accuracy and higher or equal precision than all other statistical techniques tested.
Conclusion
An intensitybased variance estimation technique provides one simple, effective approach that can improve pvalue estimates for differentially regulated genes derived from replicated microarray datasets. Implementations of the Ztest algorithms are available at http://vessels.bwh.harvard.edu/software/papers/bmcg2004.
Background
Biologists can now use microarray technology to determine the expression levels of tens of thousands of genes simultaneously, in less time than it previously took to measure the expression level of a single gene. Currently, cDNA and oligo microarrays can measure a sizeable fraction of all mRNA species in cell or tissue samples. The richness of the resulting data is opening up a new era of systems biology that promises to reveal the complex inner workings of cellular machinery [1]. However, there remains the challenge of processing the microarray data from array images into a format that best facilitates the discovery of new biological insights. In fact, applying improved computational tools to previously published microarray data has led to the discovery of new biology (e.g. [2]). We strongly believe, as do others, that the quality of the data processing steps is critical to the overall success of a microarray experiment [3].
A typical data processing pipeline consists of several steps. (See [4] for a review, and see [5, 6] for a review of microarray processing software.) First, image analysis software locates the arrayed spots in the scanned image, quantifies the foreground and background brightness of each spot, and notes any irregularities in spot morphology. The background intensity value is then subtracted from the foreground intensity value. The backgroundsubtracted intensity data from each array must then be normalized, or rescaled, to remove artifactual differences in signal brightness due, for example, to different labeling efficiencies that produced arrays of different overall intensity. Normalization techniques are often based on the assumption that a large number of spots will have similar expression levels between conditions. Curvefitting techniques, such as a locally weighted regression, are used to equalize expression values between arrays, or between array channels for twocolor arrays [7, 8]. After this normalization, the intensity values can be used by a variety of algorithms for detecting differences in expression between the measured biological conditions. This processing is applied whether two samples are compared directly or a "reference sample" experimental design is used. In a reference sample design, the same reference RNA sample is hybridized to one channel of all arrays, and the other channel is hybridized with each individual experimental sample. This design is often used when multiple biological conditions are being investigated and it becomes impractical to perform every pairwise combination of conditions directly [4, 9].
Accurately detecting differentially regulated genes
Given a list of normalized intensity values across various biological conditions, the next step is to determine which genes are differentially regulated among the conditions being studied. In the early days of microarray experimentation, an emphasis was placed on analyzing the data using exploratory data mining techniques, such as hierarchical clustering [10] and selforganizing maps [11]. Clustering algorithms measure the similarity between observed gene regulation patterns across the various conditions, and assemble clusters such that similarly regulated genes are grouped together. The resulting clusters produce an effective overview of the data, showing which of the many possible patterns of regulation are actually present in the data. Since these patterns are somewhat robust, a few erroneous spots are unlikely to change them dramatically. For a researcher who is simply interested in the overall pattern of the data, performing replicate arrays to reduce the number of errors is not particularly efficient. Many researchers choose instead to explore a greater number of experimental conditions.
Increasingly, microarrays are being used in a different context; researchers want to know with high confidence which specific genes are regulated across a small number of experimental conditions (e.g., treatment vs. control, or mutant vs. wildtype). To answer this question, it becomes extremely important to use an accurate method to rank individual genes by their probability of truly being regulated, especially since this information may be used to plan more laborintensive experiments around biological questions raised by a small number of such putatively regulated genes. In the absence of replicate arrays, the reliability of the data can be estimated (e.g. [12, 13]), but such "single slide" methods require a model of the expected noise characteristics of the system, a property that can potentially change between datasets. Performing replicate arrays can significantly improve predictions of differentially regulated genes, thereby decreasing the false positive (false detection) rate and false negative rate [9, 14, 15]. Using replicate arrays allows the calculation of more accurate significance estimates (pvalues) that will aid in the interpretation of a list of "top regulated genes," which are commonly ranked by ratio alone.
Here we address the problem of accurately detecting genes that are significantly differentially regulated between a pair of biological conditions, given microarray datasets with a small number of replicates (e.g. N = 3 arrays). If the number of replicates were very large (e.g., hundreds), the task would be relatively easy; since the ratio of expression levels between the two conditions would be well estimated by the average ratio or median ratio, the genes could simply be ranked by one of these estimates. In practice, however, the number of replicate arrays is rarely greater than 3, and estimates of average expression ratios are not always sufficiently accurate to predict which genes are truly regulated. The variation of a measured expression ratio is critical in determining whether the observed ratio is due to random measurement fluctuations or to a true difference between the quantities being measured. Genes with larger measured expression ratios between conditions are more likely to be truly regulated, while genes whose ratios have a high measured variance are less likely to be truly regulated. This idea can be expressed mathematically as a test statistic where the numerator contains an estimate of the size of the effect, i.e. the ratio of gene expression intensities between conditions, and the denominator includes an estimate of the variance, i.e. the standard deviation of the ratio. A variety of such statistical tests have been applied to microarray data (reviewed in [4, 16]); the challenge is to choose the numerator and denominator of the test statistic such that it makes the best use of all available data in order to get the most accurate determination of which genes are most likely to be regulated.
Comparing statistical tests used to find differentially regulated genes
The familiar Student's ttest (hereafter, "standard ttest") is the most straightforward method of calculating whether there is a significant difference in expression levels between conditions for each gene. Suppose that mRNAs from two biological conditions, "X" and "Y", are hybridized to a small number of replicate arrays (N twocolor arrays or 2N onecolor arrays). M_{avg}, the average logged ratio of expression levels between conditions X and Y, and its sample standard deviation, σ_{M}, are given by the standard formulas (see Methods). A standard tstatistic is calculated as . From this formula, it is clear that a large tstatistic (and the corresponding highly significant pvalue) can occur because of either a large M_{avg} (high ratio) or a small σ_{M} (low noise). Although the standard tstatistic (or derivates thereof based on permutation [17] or Bayesian analysis [18]) can produce acceptable results for larger numbers of replicates (e.g., N = 8), the results are less than satisfactory when applied to a small number of microarray replicates (e.g., N = 3, Fig. 1). Fig. 1a shows data from an experiment that was repeated 6 times on twocolor arrays. The six arrays were split into two random groups of three arrays, and the tstatistic described above was calculated for each gene in each group of three. The tstatistics from the two groups are graphed against each other in Fig. 1a. Although the two groups contain replicate arrays from the same experimental conditions, the tstatistic is clearly not reproducible between the groups. Fig. 1b and 1c demonstrate that M_{avg}, the numerator of the tstatistic, is more reproducible between the two groups, while 1/σ_{M}, representing the denominator of the tstatistic, is not reproducible. This example highlights the major shortcoming of the tstatistic: due to random chance, the replicate ratios can occasionally be extremely similar, producing an artificially low σ_{M} and high t values. False positives stemming from this effect prevent the standard tstatistic from serving as a reliable or useful test of which genes are truly regulated.
To overcome this limitation, various modifications to the tstatistic have been proposed. First, a "penalized" tstatistic (also called a "moderated" or "regulated" tstatistic) can be used, where a constant value is added to the denominator. Tusher et al. use a penalized tstatistic of the form
[19]. The addition of the constant s_{0} prevents the denominator from becoming small for low σ_{M}, reducing the false positive rate of genes with unusually low σ_{M}. Choosing too large an s_{0}, however, effectively makes the denominator a constant, removing useful information about the variability of genes. Estimating the optimal s_{0} for a particular dataset can be based on minimizing the coefficient of variation of the absolute tstatistic values ("SAM") [19], minimizing false positive and false negative estimates obtained through permutation ("SAMroc") [16], or simply choosing s_{0} as the 90^{th} percentile of the σ_{M} values [20]. These studies have demonstrated that when ranking genes from a microarray dataset, a penalized tstatistic can perform better than a standard tstatistic in terms of decreasing the false positive and false negative rate [4, 16, 18–21], but it also has the potential disadvantage of showing bias against genes of high intensity [16].
An alternative to using a penalized tstatistic is obtaining a more precise estimate of the standard deviation σ_{M}. Such an estimate should be less susceptible to a chance concordance of measurements of M that occasionally produces an extremely low σ_{M} and a high tstatistic. For this purpose, knowledge of the relationships between the data points can be used to improve the estimate. Namely, the variance values, or σ_{M} ^{2}, for one spot can be pooled, or smoothed, with the σ_{M} ^{2} values of spots that are likely to have similar variances. The variance of microarray data has often been observed to be a function of the spot intensity [12, 15, 21–30], raising the possibility that the variances of individual spots can be pooled with those of spots of similar intensity to produce a more precise estimate of the standard deviation. Several studies have taken into account this intensitydependent heteroscedasticity. For example, Rocke et al. [27] and Newton et al. [13] have presented models of measurement error in microarrays that can explicitly take into account higher variance at lower expression levels. More general approaches to variance pooling have been implemented in a variety of ways, using loessbased curve fits [15], robust nonparametric spline fits [28] and sliding windows for calculating either local averages [26, 29, 30] or interquartile ranges [24]. These more reliable estimates of the standard deviation can be used directly to calculate Zstatistics, which are calculated according to the same formula as the standard tstatistic, but correspond to lower pvalues [26, 31].
Strategies for pooling standard deviations
The studies cited above use methods that pool spots together based on their average intensity or logged intensity. For example, consider one set of replicate spots with an average intensity of 128 (2^{7}) in one channel and 16384 (2^{14}) in the other channel compared to a set of replicate spots with an average intensity of 1024 (2^{10}) in one channel and 2048 (2^{11}) in the other channel (Fig. 2). Since both of these sets of spots have the same average log_{2} intensity of 10.5, the standard deviations of their ratios would be presumed to be similar and would be pooled together using the pooling methods described above. However, these spots may actually be expected to have quite different standard deviations; we have noted that many ratios with high variances result from spots that have a medium or high intensity in one channel and a very low intensity in the other (data not shown). Thus, the ratios for the first spot are expected to be more variable because of the very low intensities (~100) in one channel. In this study, we test the hypothesis that if spots are pooled together with other spots of similar minimum intensity over both channels (I_{min}), rather than average intensity over both channels (I_{avg}), then a larger proportion of the highvariance spots will be grouped together, resulting in a tighter fit of the pooled standard deviation curve to the actual variance and generating more accurate estimates of the standard deviation.
This study expands upon previous work on intensitydependent variance estimation for microarray data by introducing a new metric, I_{min}, for pooling standard deviations. We evaluate the performance of the I_{avg} and I_{min} metrics by explicitly comparing the reproducibility and accuracy of the Zstatistics calculated using these two metrics. We also compare the performance of the Zstatistics to the performance of other statistical techniques in current use, the standard and penalized ttests. Finally, we extend our technique for pooling standard deviations to twocolor microarray data from a reference sample experimental design.
Results
Datasets
The analyses in this study were performed on five different datasets. Datasets 1–4 use the direct comparison experimental design, i.e. labeled cDNA from two biological conditions, "X" and "Y," were cohybridized onto a single array. Each dataset was generated from a different biological experiment using twocolor Agilent cDNA arrays. For Datasets 1–3, the biological experiment, RNA processing and array hybridization were repeated three times. Dataset 4 contains 23 replicate arrays (see Methods). Dataset 5 uses a reference sample design, where RNA from each experimental condition is cohybridized on an array with a standardized reference RNA sample. Dataset 5 contains three replicates arrays for each experimental condition.
Average logged intensity (I_{avg}) vs. minimum logged intensity (I_{min}) pooling metric
We demonstrate our technique of pooling standard deviations using the three arrays in Dataset 1 as a representative example of a "direct comparison" dataset. For each spot, we calculate the average logged ratio M_{avg} and the standard deviation of the logged ratio σ_{M}, across the three replicates. The spots are then sorted by either average intensity (I_{avg}) or minimum logged intensity (I_{min}) before pooling. Fig. 3a and 3b show the results of pooling standard deviations for Dataset 1, using either the I_{avg} or I_{min} metrics; the measured standard deviation σ_{M} and the pooled standard deviation σ_{M}' are plotted together against either I_{avg} or I_{min}. For better comparison, the pooled standard deviation curves for σ_{M}'(I_{avg}) and σ_{M}'(I_{min}) are both plotted together on Fig. 3c against their respective intensity metric, I_{avg} or I_{min}. Fig. 3 is based on data produced using the Agilent Feature Extraction software Version A.6.1.1 to quantify spot intensities in the original microarray image. This entire analysis was repeated on Datasets 2 and 3, as well as using two additional image processing techniques: SPOT Processing [32] and a combination of Agilent foreground and SPOT background values (see Methods).
We evaluated the tightness of the I_{avg}pooled vs. I_{min}pooled standard deviation curve fits to the measured standard deviations. Figs. 4a and 4b plot both measured (σ_{M}) and pooled (σ_{M}') standard deviations against either the I_{avg} or I_{min} pooling metric, analogous to Fig. 3a and 3b but using an especially noisy threearray subset of Dataset 4 that includes a population of extremely high variance spots. Instead of pooling together spots with similar variance, the I_{avg} metric combines the highvariance spots with the lowervariance spots. In contrast, the I_{min} metric pushes the highvariance spots to the left end of the curve, apart from the less noisy spots. This effect is reflected in the lower mean residual errors between σ_{M} and σ_{M}' for the I_{min} metric, calculated for Datasets 1–3 and six independent threearray subsets of Dataset 4 (see Table 1). For all of the datasets processed with Agilent Feature Extraction software only, the mean residual errors from using the I_{min} pooling metric are always less than or equal to the corresponding mean residual errors from using the I_{avg} pooling metric. This observation is most striking for Dataset 4 subset #3, which corresponds to the data in Fig. 4. The tighter fit that is obtained using the I_{min} metric is also reflected in the improved accuracy of the final Z statistic calculated using σ_{M}'(I_{min}), which is demonstrated in Fig. 5 and discussed below. The trend in residual values is not present when datasets are processed with the SPOT technique or with Agilent foreground and SPOT background.
Comparing the accuracy of different ranking statistics
In order to test the accuracy of the different test statistics – M_{avg}, the standard tstatistic, the 90th percentile penalized tstatistic, the SAM penalized tstatistic, Z(I_{avg}) and Z(I_{min}) – a subset of three arrays was randomly selected from the total set of 23 replicate arrays in Dataset 4 (see Methods). Each statistic was calculated for each gene in this set. The large number of remaining replicate arrays allowed us to calculate an approximate "gold standard" statistic, t_{gold}, by computing the standard tstatistic over the set of 20 remaining replicates. The value of each test statistic from the threearray subset was compared to the value of the "gold standard" tstatistic, t_{gold}, as shown in Fig. 5. The squared Pearson's linear correlation coefficient value (R^{2}), representing the degree of concordance between the test statistic and t_{gold}, was calculated. This analysis was repeated five additional times, selecting different subsets of experimental and "gold standard" arrays from Dataset 4 each time, and the R^{2} values from all six repetitions are given in Table 2. The Zstatistics and penalized tstatistics both have appreciably higher R^{2} values than either M_{avg} or the standard tstatistic. The R^{2} values for Z(I_{min}) are greater than the R^{2} value for any other technique across all six datasets. Note that there is less scatter for highmagnitude values when using Z(I_{min}) instead of Z(I_{avg}) (Fig. 5e and 5f respectively). Accordingly, the R^{2} value is higher for the Z(I_{min}) than the Z(I_{avg}) ranking metric for all three datasets, confirming that the tighter curve fits seen in Fig. 4a and 4b and Table 1 (see above) translate into improved accuracy of using the I_{min} pooling metric over I_{avg}.
Comparing the reproducibility of different ranking statistics
We also evaluated the reproducibility of these different test statistics, by constructing test datasets that split six replicate arrays from Dataset 4 into two subsets of 3 arrays (see Methods). Each test statistic – M_{avg}, the standard tstatistic, the 90th percentile penalized tstatistic, the SAM penalized tstatistic, Z(I_{avg}) and Z(I_{min}) – was calculated for both threearray subsets. A precise, i.e., reproducible, test statistic should produce similar values for both subsets since all of the arrays in both subsets were drawn from a pool of replicates prepared from identical biological experiments. Fig. 1a,1b and Fig. 6a,6b,6c,6d show the correlation for each test statistic between the two subsets, including a linear regression line in Fig. 6. The slope coefficient of the linear regression indicates whether overall magnitudes of the test statistics are different between the two subsets, while R^{2} indicates the degree of correlation on a genebygene basis (Table 3). This analysis was repeated for an additional two pairs of independent threearray subsets of Dataset 4 (graphs not shown), with the slope coefficients and R^{2} values given in Table 3.
The R^{2} values for the two Zstatistics were similar to each other and consistently higher than those of the other techniques. Nonparametric measures of correlation, the Spearman Rho and Kendall Tau rank correlation coefficients, were also higher for both Zstatistics than any of the other statistics for all three pairs of subsets (data not shown). All three calculations of the slope coefficients for both Zstatistics, as well as M_{avg} and the 90^{th} percentile penalized tstatistic, are close to 1, indicating that the overall magnitudes of the Zstatistics are consistent across datasets, whereas the standard tstatistic and the SAM penalized tstatistic produced test statistics whose overall magnitudes vary across the subsets.
Outlier detection
When calculating the Zstatistic, using a much smaller pooled σ_{M}' in place of a large σ_{M} has the potential to overestimate the significance of gene regulation in the case where one of the replicates is an outlier measurement and the large measured standard deviation provides a better estimate of the variability. As seen in Fig. 3a,3b,3c, there are several spots that lie far above the pooled standard deviation curve. Datasets 1 and 4 were reprocessed using an outlier detection technique (see Methods). Fig. 7a shows σ_{M} and σ_{M}' from Dataset 1 plotted against I_{min}, as in Fig. 3c, except that the yaxis has been rescaled to show all spots detected as outliers, which are now highlighted in black.
The accuracy of this outlier detection technique was also evaluated by comparing the Zstatistic to t_{gold} using Dataset 4. Fig. 7c plots Z(I_{min}) vs. t_{gold} for Dataset 4 set #2. The outliers, which are highlighted, include false positive spots for which t_{gold} is low and Z(I_{min}) is high, although not all such points are detected as outliers. Fig. 7d is an identical plot to Fig. 7c except that the Zstatistics for the outlier spots are calculated using the highervalued measured standard deviation σ_{M} instead of the pooled value σ_{M}'. The outliers are now mostly clustered around the origin with the other nonsignificant spots. A few spots with moderately high t_{gold} values are detected as outliers and have low corrected Zstatistics, and some potential false positives with high Zstatistic values and low t_{gold} values are not detected as outliers.
At the end of the analysis, the outliercorrected Z statistics are converted to pvalues. To demonstrate the additional information that the pvalues provide, Fig. 7b shows a scatterplot of X vs. Y for Dataset 1, with statistically significant spots colored according to their multipletestcorrected pvalues (see Methods). Spots with similar ratios may have different pvalues due to their different standard deviations. In addition, after outlier detection, some spots with high ratios are not found to be significant.
Analysis of reference sample arrays
The techniques used above for a direct comparison experimental design were extended to a reference sample design (see Methods). Under a reference sample design, one can estimate either the standard deviation of the individual logged ratios comparing experimental samples to the reference sample, M_{x} and M_{y}, or the standard deviation of the paired differences of these logged ratios, μ = M_{x}M_{y}. Under the first, or unpaired method, the Zstatistic is calculated as where N_{X} and N_{Y} are the number of replicates for the given spot for condition X and condition Y, respectively. Under the second, or paired method, the Zstatistic is calculated as where N is the number of paired replicates for the spot. The samples used in a reference sample design may not always have been collected or processed in pairs, so we evaluated both of these methods.
For each replicate in reference sample Dataset 5, the biological specimens for conditions X and Y were prepared on the same day, so a natural pairing exists for the condition X and Y arrays. These data were processed using all three image processing techniques and then analyzed using both paired and unpaired methods, and using either the I_{avg} or I_{min} pooling metric for each approach. Fig. 8a shows the measured and pooled standard deviation of the paired differences of logged ratios (σ_{μ} and σ_{μ}') plotted together against the pooling metric, I_{min}. Curve fits were analogously constructed using the I_{avg} pooling metric with the paired method (with results similar to using the I_{min} metric, data not shown), and using both I_{avg} and I_{min} with the unpaired method (with results similar to using the ratio method with direct comparison arrays, data not shown). The unpaired σ_{M}' and paired σ_{μ}' curves are plotted together against their the I_{avg} or I_{min} pooling metric in Fig. 8b. The paired standard deviations are lower than the unpaired standard deviations except at low intensity metric values.
Linear regression was performed between Zstatistics calculated using the paired and unpaired methods for all spots. Table 4 gives the linear regression slope coefficients when either the I_{avg} or I_{min} pooling metric was used, for Dataset 5 processed with the three different image processing techniques. For most spots, both the difference of logged ratios (μ) and number of replicates (N) are the same, except for the occasional difference between the two conditions in the number of low quality spots that are excluded from the analysis. Thus, differences in the Zstatistic primarily reflect differences in the standard deviations. The slope coefficients are all greater than 1, indicating that the paired technique produced higher Zstatistic values, due to the lower standard deviations that are produced with paired analysis.
The mean residual errors for spots with σ' < σ and σ' > σ were calculated when using the I_{avg} or I_{min} pooling metric in unpaired or paired analyses of Dataset 5, and are given in Table 5. For the unpaired analysis, as in the direct comparison experiments, mean residual values produced by using the I_{min} pooling metric are less than or equal to those produced by the I_{avg} pooling metric. The same trend is seen between the two pooling metrics for the paired analysis. These results are consistent regardless of the image processing technique used.
Discussion
Building up new knowledge about biological systems is the ultimate purpose of microarray experiments, but all such insights have to be built on a solid foundation to be accurate and useful. Proper normalization of data and accurate detection of which genes are regulated are vital to the success of downstream exploration of microarray data. Even for exploratory cluster analyses, the genes that are significantly regulated must be selected beforehand. This task of detecting these genes is a difficult statistical problem; a statistical hypothesis is made for each of tens of thousands of genes tested, but only a small number of replicate arrays are available to test those hypotheses. The statistical methods presented in this study attempt to draw as much information as possible out of a small number of array replicates to determine which genes are likely to be regulated.
It is clear that looking at the measurements of each gene in isolation can produce a test with low statistical power (e.g. using the standard ttest, Fig. 1). To improve statistical power, we can use knowledge about the relationships among the many thousands of points in the arrays. Specifically, we group together spots that have similar standard deviations and then pool together many less accurate estimates of standard deviation into a single, more accurate estimate. Our data also show that the Zstatistics are more precise than either standard or penalized tstatistics for detecting differential gene expression in microarray data. We further demonstrate that pooling standard deviations using the minimum intensity metric produces Zstatistics that are more accurate than the standard ttest, the penalized ttests, and the average intensitybased Zstatistic.
Average combined logged intensity (I_{avg}) vs. minimum logged intensity (I_{min}) pooling metric
We evaluated two different intensitybased metrics for pooling standard deviations. There are many reports that the variance is a function of intensity, but the exact shape of this relationship could depend on many factors extrinsic to the biological experiment, such as the array technology being used, the signaltonoise ratio of the data, the similarity between the two conditions[30], the normalization technique or the background subtraction technique. For this reason, we favor an estimation of the standard deviation using a curvefitting technique rather than a fixed model based on previous data. Furthermore, when dealing with twochannel arrays, there are two different intensity values associated with each replicated spot. It is possible that the variation is best described as a function of the average intensities of both channels. However, our own experience and many other reports suggest that the highest variances are often seen for low intensity spots. If so, the variance may be better described as a function of the minimum intensity over all the spots.
The data presented here show that the mean residual errors are either equal or lower when using the I_{min} compared to the I_{avg} pooling metric, for every dataset using the Agilent Feature Extraction image processing technique. The subset of Dataset 4 for which this difference is most striking, #3 in Table 1, also has a population of spots with particularly high variance (see Fig. 4). The I_{avg} metric pools these spots together with other spots that have a much lower variance. In contrast, the I_{min} metric moves these spots to the low end of the xaxis, and the curve fit tracks the standard deviation of the spots much better. The noisiest spots on microarrays are often those where at least one channel is "blank", i.e. a noisy, low level of signal that presumably represents no expression. The I_{min} metric is better at grouping such spots together. For datasets with low background levels, there is a smaller difference in the performance of the two pooling metrics.
The trends in the mean residual errors from the unpaired reference sample analysis agree with the results from the direct comparison analyses. This similarity is to be expected, since processing each reference sample condition separately is equivalent to doing a direct comparison between each condition and reference RNA samples. Both pooling metrics generate similar mean residual error values when pooling σ_{μ}, but one dataset is not enough to make any generalizations about which pooling metric will perform best for all paired reference sample datasets. The improved performance of the I_{min} pooling metric is lost when using SPOT processing or combined Agilent foreground and SPOT background image processing, suggesting that these image processing techniques may be more effective at removing noise at low intensities.
The I_{avg} and I_{min} pooling techniques are reproducible to the same degree, since their R^{2} coefficients between Zstatistics from paired datasets (see Table 1) are similar to each other. The I_{min} pooling technique generates slightly more accurate results, as indicated by the greater R^{2} coefficients between Z(I_{min}) and t_{gold} compared to those between Z(I_{avg}) and t_{gold} (see Table 2). This trend holds for all six subsets of Dataset 4.
The higher accuracy of Z(I_{min})
The Zstatistic calculated using the I_{min} pooling metric provides an improvement in accuracy over the other techniques. The tstatistic derived from datasets with 20 replicates was used as a surrogate "gold standard" since 8 or more replicates can be considered sufficient to give power to the tstatistic [17]. The tstatistic was chosen as the "gold standard" instead of the average logged ratio since the latter does not take variability into account. For each of the six permuted subsets of Datasets 4, the 90^{th} percentile penalized tstatistic, SAM penalized tstatistic, and Z(I_{avg}) had similar R^{2} values when correlated with the "gold standard" tstatistic, although the SAM statistic did perform poorly for the noisiest subset of Dataset 4 (#3 in Table 2) with an R^{2} value of only 0.70. Z(I_{min}), however, consistently produced the highest R^{2} value for each of the six datasets. Since the ratios used in each of these statistics is identical, this result indicates that the standard error generated with the I_{min} technique produces the best correlation with the gold standard tstatistic based on 20 replicates. Although excluding spots with very low intensity could eliminate the difference in performance between the I_{min} and I_{avg} pooling metrics, this approach would make it impossible to detect lowexpressed regulated genes, which may be biologically significant.
The Zstatistics from the I_{min} technique do not correlate perfectly with the "gold standard" tstatistic, however. Some disagreement can be expected because the Z(I_{min}) data was based on only three replicate arrays, which contain much less information than the 20 replicates used to calculate the "gold standard" tstatistic. Also the significance estimates calculated using the "gold standard" tstatistic may still contain some inaccuracies, even with 20 replicates. Kerr et al. found this to be true with 12 replicates, where accuracy is reduced if the error distribution for each gene is modeled separately instead of using a pooled estimate [15]. Analyzing the large (N = 20) replicate dataset using robust estimators of ratio and standard deviation may be able to create a more accurate "gold standard" to use for further testing of the Zstatistic or other statistics. Note that we do not employ an explicit permutationbased approach to estimate the false detection rates of the statistics investigated in this study, as in Ref. [16]. Rather than permute gene labels from a small set of arrays to estimate the distribution of expected test statistics, with the availability of the large (N = 23) replicate dataset described herein, we preferred to use this rich source of actual test statistics directly.
The higher reproducibility of zstatistics
The Zstatistic – calculated with either pooling the I_{min} or I_{avg} pooling metric – provides an appreciable improvement in reproducibility over the average logged ratio alone, the standard ttest and the 90^{th} percentile and SAM penalized tstatistics. Both linear (R^{2}) and nonparametric rank correlation coefficients were highest for the Zstatistic when comparing corresponding spots between three independent pairs of replicate datasets. Also, the standard tstatistic and SAM penalized tstatistic generate linear regression slope coefficients that vary greatly from pair to pair, indicating that their absolute magnitude is not as reproducible as the Zstatistics, whose linear regression slope coefficients are much closer to 1.
The high correlation values and nearunity slope coefficients for the Zstatistic support the hypothesis that pooling the standard deviations of spots with similar intensities provides a stable, precise estimate of the standard deviation. This assumption of a wellestimated standard deviation supports the use of the Gaussian distribution to map the Zstatistic to a pvalue. Using only the measured standard deviation, one is forced to use a tdistribution with only 2 degrees of freedom to generate a pvalue. This test does not have sufficient power to generate any significantly regulated points; because of the very small number of degrees of freedom, not a single spot seen in Fig. 1a is found to be significant after multiple test correction. In contrast, even after a conservative multiple test correction that makes the cutoff for statistical significance much more stringent, many spots are found significant using the Zstatistic. The penalized tstatistics do not produce a stable estimate of the standard deviation with these data, perhaps because the constant added to the denominator of the test statistic showed a large variation between replicate datasets. Therefore they cannot be mapped to a pvalue in a reproducible manner.
Outlier detection
One limitation of using a pooled standard deviation is that for a spot with replicate ratios that include one or more outliers, the appropriately high measured standard deviation will be replaced by an inappropriately low pooled standard deviation. This substitution could produce a false positive result. We have sought to minimize this limitation by implementing an overlying outlier detection algorithm. (For other implementations of outlier detection, see Ref. [26, 30].) The algorithm in this study uses the measured standard deviation instead of the pooled standard deviation for spots for which the pooling model may not hold. These spots are identified as ones for which residual error σ–σ' is positive and greater than twice the standard deviation of the positive residual errors.
The measured standard deviations for these outlier points are valid sample measurements of the variance process and should be used to calculate the pooled standard deviations for spots with similar intensities. These ratio measurements, however, are too widely varying for one to have the same confidence in the average ratio as one would have for other spots; thus, it is appropriate to substitute the measured standard deviation for the pooled standard deviation in these cases. Fig. 7c,7d, which highlight outlier spots on a plot of the Z(I_{min}) vs. the "gold standard" tstatistic for Dataset 4b, show that this outlier detection technique correctly detects many of the presumably false positive spots that have a high Zstatistic and low t_{gold} value. The plots also show some false positive spots that are not detected through this algorithm, as well as a few spots that become false negatives after outlier detection. Other, more complex outlier detection algorithms may perform better, and should be explored. A simple modification to the current algorithm, using local instead of global estimates of the standard deviation of the residual error, may improve outlier detection. Alternative implementations include modifying the pooling window shape to give more weight to a spot's measured standard deviation or that of its nearest neighbors by intensity. Strictly speaking, the pvalues for outlier spots should be calculated using a tdistribution instead of a Gaussian distribution since the measured standard deviation is being used. We have shown, however, that with 3 replicates, no spots in our datasets can be found statistically significant using the ttest and strict multiple test correction. In order to preserve detection of spots, we continue to use the Gaussian distribution to convert outlier Zstatistics to pvalues, which may slightly increase the false positive rate for spots detected as outliers. In practice, however, such spots are rarely found to be significantly regulated.
Unpaired vs. paired analysis for reference sample experiments
Finally, we have extended our algorithms to apply to data from a reference sample experimental design. This design gives one the flexibility to compare many different conditions to one another, but the tradeoff is a loss in precision. In theory, using a reference sample design instead of a direct comparison design should increase the variance by a factor of 2. This increase has in fact been observed in practice [33].
The paired analysis method can reduce the measured variation in a reference sample design. The linear regression slope coefficients in Table 1 indicate that the Zstatistic values using the paired analysis are higher than the unpaired Zstatistic values. Thus, the paired difference of logged ratios, μ, is less variable than the independent logged ratios, M_{X} and M_{Y}. This observation suggests that the effects of biological or analytical variation from replicate to replicate can be reduced if comparisons are made between paired samples. Whether this reduction is due to using paired biological samples or paired array processing dates [34] is still an open question, and probably will be contextdependent. Although it may not always be practical, it would be beneficial for investigators to design reference sample experiments to be performed in parallel whenever possible to take advantage of the lower standard deviations produced by paired analysis.
Finding the optimal statistical test
Several areas remain for further refinement of our implementation of poolingbased statistical analysis of microarray data. Currently, the standard deviation is pooled using a simple moving rectangular window of 501 spots, but other window sizes and shapes may improve performance slightly. More generally, we have not explicitly compared the moving average estimator with the splinefit or loessbased techniques to estimate the standard deviation used in other studies (see Background). While we expect performance to be similar, further testing may reveal an advantage.
Following Ref. [35], we do not try to estimate the dyespecific bias of individual spots or genes (i.e., dyegene interaction) in order to preserve degrees of freedom needed to estimate the variance. Informally we noted that dye bias in some spots produced high measured variances that caused those spots to be considered nonsignificant outliers. A posthoc test to warn of potential dye bias of individual spots may be appropriate for small numbers of array replicates (e.g. N = 3), especially if the experimental design is unbalanced (i.e., the number of dyeswapped and unswapped arrays is not equal).
Note that this study only considered statistics of the general form (ratio) / (standard deviation). ANOVA models that consider the variance as intensitydependent, as seen in Ref. [15, 25], can be seen as an extension of this concept. An ANOVA framework, however, also allows for a more complicated experimental model that can incorporate normalization and multiple biological conditions. Pooling standard deviations as a function of minimum intensity instead of average intensity may benefit such models. Permutation tests can also be used to detect regulated genes, and are known to be robust to outliers but can have low power for small N. Xu et al. found a permutation test to be equally or less accurate than parametric methods in ranking genes [36]. Bayesian analysis can also be applied to microarray data [13, 20, 21], and may be useful in this context to draw more information out of the distribution of intensities and ratios in the data.
In this study, data is first normalized, and then detection of regulated genes is performed in a separate step. In contrast, other approaches incorporate normalization and statistical inference into a unified model [29, 35]. Furthermore, the options for normalizing the data are numerous, including algorithms based on local regression (loess) [7], splines [37], a constant shift [15], or more exotic transforms that tend to remove the intensity dependence of the variance [38]. Increased attention to the lowlevel details of scanning and image processing may also improve accuracy [22, 33, 39], while at the same time potentially changing the intensity dependence of the variance. It remains to be seen how the techniques used for normalization or variancestabilizing transforms will impact the accuracy and precision of regulated gene detection. In addition, we are concerned that some of these transforms may create a systematic bias for or against genes of low intensity (e.g., [40]).
Test performance can depend on data characteristics
Although many datasets have a variance that is intensitydependent [12, 15, 21–26], some studies have analyzed datasets whose variance characteristics are not strongly intensitydependent (e.g., [35]). In general, we have experienced that microarray datasets with a low background relative to signal, loessbased normalization, and conservative background subtraction (e.g. SPOT Image Processing) produce standard deviations that are not strongly intensitydependent. In this context, the differences between the I_{min} and I_{avg} metrics disappear. In fact, for data with unusually low noise, the standard deviations is nearly constant across all spots and all of the statistical tests considered in this paper, even simply the average logged ratio, tend to converge. This observation is not unexpected; as the standard deviations converge to the same value, the denominator of the test statistics will become constant, leaving the test statistics simply proportional to the ratio. We would recommend finding a normalization [7, 29, 33, 37] and background subtraction technique [22, 32, 39] that produces low, intensityindependent standard deviations. Applying variance stabilizing transforms may eliminate the intensity dependence of the standard deviation [38], but might also reduce statistical power or bias the test toward spots of certain intensities. It cannot be predicted in advance whether all intensity dependence of the variation will be removed, so we continue to use the more robust statistic Z(I_{min}) for all of our datasets. Furthermore, in situations where changing the background subtraction or normalization technique is not possible because the original data is not available, using a more robust statistic like Z(I_{min}) will be advantageous.
While the pooling techniques described herein can compensate for intensitydependent variation, this intensity dependence can be minimized or exaggerated by different normalization techniques and background subtraction techniques. These techniques may have subtle effects on the power to detect regulated genes at different intensities, perhaps creating bias for or against detection of lowexpressed genes. For this reason, until the most sensitive and unbiased normalization and background subtraction methods are optimized for each microarray system, we would encourage creators of microarray data archives to preserve unnormalized intensity and background data, and the original image data when possible.
Of the many useful tests used to detect regulated genes from a small number of microarray replicates, we see the intensitybased variance estimation and Zstatistic described in this study to be a good combination of simplicity, robustness, precision, and accuracy. This technique allows meaningful pvalues to be added to a list of regulated genes. With this assessment of statistical significance, an investigator can proceed to focus on genes that are most likely to be regulated.
Methods
Data acquisition
For Datasets 1–3, microarrays were prepared essentially according to the manufacturer's instructions [41]. Briefly, 20 μg of total RNA were directlabeled with Cy3 and Cy5, and labeled cDNAs were hybridized overnight to Agilent Human 1 cDNA arrays (G4100a, Agilent Technologies, Palo Alto, CA) containing 16,142 features representing approximately 10,500 unique genes. After washing, the microarrays were scanned in an Agilent model G2505A microarray scanner.
Dataset 3 contains 3 replicate twocolor arrays with condition X in the Cy5 channel and condition Y in the Cy3 channel. Dataset 1 contains 3 replicates from another experiment, including one dyeswapped array; i.e. condition X in the Cy3 channel and condition Y in the Cy5 channel. Dataset 2 contains 3 replicate arrays without dyeswap, but each array was hybridized with a different amount of RNA, 5, 10 or 20 μg.
Dataset 4 consists of 23 replicate Agilent cDNA arrays from the Alliance for Cellular Signaling. The files MAE030201N00.txt to MAE030223N00.txt were downloaded from http://www.signalinggateway.org/data/micro/cgibin/microcond.cgi. These arrays correspond to the conditions "Bcell + SIMDM exposure = 0 minutes" vs. "Spleen". Four additional arrays are available for this condition (numbered MAE02070xN00.txt), but these arrays appeared to be slightly different from the other 23 arrays (using hierarchical clustering, data not shown) and were excluded from further analysis. The Bcell RNA was derived from 23 preparations, each from a different set of mice, while the spleen RNA was drawn from a single large pool (Rebecca Hart, Alliance for Cellular Signaling at the California Institute of Technology, Pasadena, CA, USA, personal communication).
Reference sample Dataset 5 is a set of 6 microarrays generated in our laboratory. Each of the arrays contains a reference RNA sample in the Cy3 channel. Three have condition "X" samples in the Cy5 channel and the other three have condition "Y" samples in the Cy5 channel. Since corresponding biological specimens for conditions X and Y were prepared together for each replicate, a natural pairing exists for the condition X and Y arrays.
Computer techniques
Statistical modules were programmed in Perl v5.8. Microsoft Visual Basic 6.0 was used to integrate the image processing and statistical modules.
Image processing
For Datasets 1–3 array images were processed using Agilent Feature Extraction software version A.6.1.1. The Feature Extraction Software provides normalized Cy3 and Cy5 channel intensity values for each spot on an array (in the gProcessedSignal and rProcessedSignal fields of the output files). The default settings were used for all options. Quality control algorithms in the software detect unusual (poor quality) spots; spots were excluded from analysis that contained a nonzero value any of the following fields: IsSaturated, IsFeatNonUnifOL, IsBGNonUnifOL, IsFeatPopnOL, IsBGPopnOL, IsManualFlag. For a detailed description of the Agilent Feature Extraction software and the algorithms it uses, see the Agilent Feature Extraction Version 6.1 Users' Manual. Briefly, Agilent Feature Extraction determines the foreground value for each channel based on the pixel values in a fixedsize circle centered on each spot. The median of pixel values in a concentric ring around the circle, with an excluded region between the outer boundary of the circle and the inner boundary of the ring, gives the spot background value. The raw spot value is calculated as its foreground value less its background value. A surrogate raw value is assigned when the foreground value does not exceed the background value by two standard deviations of the spot's background pixel values. Intensitybased normalization between channels using a linear regression and a lowess curvefit technique is then applied to remove any systematic dye incorporation biases.
Images were also processed using SPOT (CSIRO, New South Wales, Australia)[32], an Rbased implementation which uses seeded region growing to determine the foreground pixels for each spot and morphological opening to determine the background value for each spot. The raw spot values, foreground less the background values, are normalized between channels using an intensitybased Loess implementation in R available in the maNorm function of the marrayNorm package of the opensource Bioconductor software http://www.bioconductor.org. We considered three image processing techniques: Agilent Feature Extraction output alone, SPOT output alone with maNormbased normalization and Agilent foreground (gMedianSignal and rMedianSignal columns) less SPOT background (morphG and morphR columns) with maNormbased normalization.
Pooled standard deviations – direct comparison design
Three replicate arrays were processed for each direct comparison experiment. To map intensities from different replicates onto similar scales without altering the absolute ratio values, we multiplied the intensity values on each array by a constant such that mean square error between the intensities of that array and the intensities of the first replicate array was minimized. The multiplicative factor for array j is given by
, where G is the total number of spots and x and y are intensities for condition X and condition Y. Then, for each spot, the mean and sample (measured) standard deviation (σ) across array replicates were calculated for the logged ratio M = XY, where X and Y are log_{2}(x) and log_{2}(y). The sample standard deviation of M, σ_{M}, is calculated as . A replicate spot for which either channel was flagged as poor quality was excluded from these calculations. Spots for which there were less than two replicates of good quality were discarded from analysis.
The pooled logged ratio standard deviation, σ'_{M}, was calculated by sorting all the spots by the average logged intensity or the minimum logged intensity I_{min} across both channels of all replicates and then taking the square root of the moving average of the variance σ_{M} ^{2} with a window of 501 spots. We averaged the variance instead of the standard deviation, since averaging the standard deviation directly will produce a negatively biased (~13%) estimate for N = 3 [42]. The Zstatistic was then calculated as . Note that I_{avg} and M as defined above are equivalent to the symbols Ā and M, respectively, as used in other studies [17]. The common "MA plot" would be called an "MI plot" using the notation of this study.
Pooled standard deviations – reference sample design
Three pairs of arrays were processed for each reference sample experiment. For the unpaired analysis, the arrays within a given condition were linearly normalized to each other, in order to map intensities from different replicates onto similar scales without altering the absolute ratio values (as described above). For each condition, the mean M_{avg} and sample standard deviation σ_{M} of the logged ratio were calculated for each feature. The pooled standard deviation of the logged ratio, σ'_{M}, was calculated by sorting all the spots by the average intensity, I_{avg}, or the minimum intensity, I_{min}, across both channels of all replicates for the condition and then taking the square root of the moving average of the variance σ_{M} ^{2}, with a window of 501 spots, centered on the given spot. The Zstatistic was calculated as where N_{X} and N_{Y} are the number of replicates for the given spot for condition X and condition Y, respectively.
For the paired reference sample analysis, the intensity vectors were all linearly normalized to the vector for the first replicate array of condition X to put all intensity values from both conditions on the same scale without changing the value of the ratios. Then the paired difference of logged ratios μ = M_{X}  M_{Y} for each pair of replicates was computed. The mean and sample standard deviation of μ was then calculated across replicates. The pooled standard deviation of μ, σ'_{μ}, was calculated by sorting all the spots by the average intensity I_{avg} or the minimum intensity I_{min} across both channels of all replicates for both conditions, and then taking the square root of the moving average of the variance σ_{μ} ^{2}, with a window of 501 spots. The Zstatistic was calculated as where N is the number of paired replicates for the spot.
To compare Zstatistic values between the paired and unpaired methods, the linear regression slope coefficient with intercept set to 0 was calculated between corresponding Zstatistics from the two methods.
Calculation of pvalues
For a Zstatistic Z, the twotailed pvalue is given by 1  2Φ(Z), where Φ is the cumulative distribution function for the zeromean, unitvariance Gaussian. The pvalue is corrected for multiple tests using Sidak's formula, p' = 1(1p)^{L}, where L is the total number of spots being examined. Note that we did not find it necessary to use more sophisticated means of controlling the error rate [43, 44], as we are primarily concerned with ranking regulated genes and not in establishing firm statistical cutoffs.
Calculation of standard tstatistics and penalized tstatistics
Standard tstatistics for direct comparison arrays were calculated with the formula
. The twotailed pvalue was calculated using a t distribution with N1 degrees of freedom. In a penaltybased technique, a constant penalty s_{0} is included in the denominator of the tstatistic. The new statistic, d, is given by . Two different methods of choosing s_{0} were used: setting s_{0} to equal the 90^{th} percentile of the actual standard deviations and the significance analysis of microarrays (SAM) technique, which chooses s such that the coefficient of variation of d is minimized. The SAM technique was implemented using software developed at Stanford University Labs [19, 45]. This software imputes missing logged ratio values before calculating s_{0}, and this feature cannot be disabled. The Knearestneighbor technique was selected for imputation.
Outlier detection
When outlier detection was enabled, Zstatistics were calculated using the measured standard deviation instead of the pooled standard deviation for outlier spots. Outliers were determined by calculating σ_{ε}, the standard deviation of the residual error ε = σ  σ' for spots with σ > σ'. Spots for which ε > 2σ_{ε} were treated as outliers, similar to [26]. The measured standard deviations for the outlier points were considered to be valid sample measurements of the variance process and were not excluded from the calculation of the pooled standard deviations for spots with similar intensities.
Comparison of Zstatistic and penaltybased statistics
In order to test the reproducibility of different test statistics (Fig. 6), two sets of three arrays were randomly selected from the 23 replicate arrays in Dataset 4. For both of these subsets, we calculated the several different test statistics described above. For each gene, the value of each of the test statistics from one 3array subset was compared to the corresponding value from the other subset, using the squared Pearson's linear correlation coefficient, R^{2}, and two nonparametric, rankbased correlation coefficients, Spearman Rho and Kendall Tau, which were calculated using JMP (SAS Inc., Cary, NC). This entire process was repeated twice with the remaining arrays in Dataset 4, yielding a total of three independent comparisons. In total, six nonoverlapping sets of three arrays – 18 arrays in all – were drawn from the original pool of 23 arrays, leaving 5 arrays that were not used in this analysis. As the sets are nonoverlapping, each comparison is based on independent data.
In order to evaluate the accuracy of the different test statistics, we compared these statistics to an approximate "gold standard" measure (see Fig. 5). 3 arrays were randomly selected from the 23 arrays in Dataset 4; the other 20 were used to calculate "gold standard" tstatistics to which the results from the n = 3 dataset could be compared. The R^{2} value and the linear regression slope coefficient with intercept set to 0 were calculated between the corresponding experimental statistic and "gold standard" tstatistic for each gene. Only spots for which there were at least 15 replicates in the "gold standard" set of arrays were used. This process was repeated on a total of 6 random subsets.
Abbreviations
 X_{avg} :

Average logged intensity in channel X
 Y_{avg} :

Average logged intensity in channel Y
 M_{avg} :

Average ratio
 σ_{M} ^{2} :

Variance of the average ratio
 σ_{M} :

Standard deviation of the average ratio
 σ_{M}':

Pooled standard deviation of the average ratio
 I_{avg} :

Average logged intensity
 I_{min} :

Minimum logged intensity
 σ_{M}'(I_{avg}):

Pooled standard deviation of the average ratio, pooled using I_{min}
 σ_{M}'(I_{min}):

Pooled standard deviation of the average ratio, pooled using I_{min}
 μ:

Paired difference of ratios in a reference sample experiment
 σ_{μ}':

Pooled standard deviation of the paired difference of ratios
 Z(I_{avg}):

Zstatistic calculated with standard deviation pooled using average logged intensity
 Z(I_{min}):

Zstatistic calculated with standard deviation pooled using minimum logged intensity
 N:

Number of replicate arrays
 t:

Standard tstatistic
 d_{90th percentile} :

Penalized tstatistic calculated using 90^{th} percentile s_{0}
 d_{SAM} :

Penalized tstatistic calculated using Significance Analysis of Microarrays
 t_{gold} :

"Gold standard" tstatistic calculated using 20 replicate arrays
 R:

Pearson's correlation coefficient
References
Segal E, Yelensky R, Koller D: Genomewide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics. 2003, 19 Suppl 1: I273I282. 10.1093/bioinformatics/btg1038.
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet. 1999, 22: 281285. 10.1038/10343.
Tilstone Claire: Vital statistics. Nature. 2003, 424: 610612. 10.1038/424610a.
Smyth GK, Yang YH, Speed T: Statistical issues in cDNA microarray data analysis. Methods Mol Biol. 2003, 224: 111136. 10.1385/159259364X:111.
Comander J, Weber GM, Gimbrone M. A., Jr., GarciaCardena G: Argusa new database system for Webbased analysis of multiple microarray data sets. Genome Res. 2001, 11: 16031610. 10.1101/gr.186601.
Dudoit S, Gentleman RC, Quackenbush J: Open source software for the analysis of microarray data. Biotechniques. 2003, Suppl: 4551.
Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP: Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002, 30: e1510.1093/nar/30.4.e15.
Kepler TB, Crosby L, Morgan KT: Normalization and analysis of DNA microarray data by selfconsistency and local regression. Genome Biol. 2002, 3: RESEARCH003710.1186/gb200237research0037.
Churchill GA: Fundamentals of experimental design for cDNA microarrays. Nat Genet. 2002, 32 Suppl: 490495. 10.1038/ng1031.
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci U S A. 1998, 95: 1486314868. 10.1073/pnas.95.25.14863.
Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A. 1999, 96: 29072912. 10.1073/pnas.96.6.2907.
Chen Y, Kamat V, Dougherty ER, Bittner ML, Meltzer PS, Trent JM: Ratio statistics of gene expression levels and applications to microarray data analysis. Bioinformatics. 2002, 18: 12071215. 10.1093/bioinformatics/18.9.1207.
Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW: On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol. 2001, 8: 3752. 10.1089/106652701300099074.
Pan W, Lin J, Le CT: How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biol. 2002, 3: research0022
Kerr MK, Afshari CA, Bennett L, Bushel P, Martinez J, Walker NJ, Churchill GA: Statistical Analysis of a Gene Expression Microarray Experiment with Replication. 2001, Bar Harbor, The Jackson Laboratory, [http://www.jax.org/staff/churchill/labsite/research/expression/niehs.pdf]
Broberg P: Statistical methods for ranking differentially expressed genes. Genome Biol. 2003, 4: R4110.1186/gb200346r41.
Dudoit S, Yang YH, Callow MJ, Speed T: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica. 2002, 12: 111139.
Lonnstedt I, Speed T: Replicated microarray data. 2001, Uppsala, Uppsala University, [http://www.stat.berkeley.edu/users/terry/zarray/TechReport/Baypap4d.pdf]
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001, 98: 51165121. 10.1073/pnas.091062498.
Efron B, Tibshirani R, Storey JD, Tusher V: Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc. 2001, 96: 11511160. 10.1198/016214501753382129.
Baldi P, Long AD: A Bayesian framework for the analysis of microarray expression data: regularized t test and statistical inferences of gene changes. Bioinformatics. 2001, 17: 509519. 10.1093/bioinformatics/17.6.509.
Colantuoni C, Henry G, Zeger S, Pevsner J: SNOMAD (Standardization and NOrmalization of MicroArray Data): webaccessible gene expression data analysis. Bioinformatics. 2002, 18: 15401541. 10.1093/bioinformatics/18.11.1540.
Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH: Functional discovery via a compendium of expression profiles. Cell. 2000, 102: 109126. 10.1016/S00928674(00)000155.
Baggerly KA, Coombes KR, Hess KR, Stivers DN, Abruzzo LV, Zhang W: Identifying differentially expressed genes in cDNA microarray experiments. J Comput Biol. 2001, 8: 639659. 10.1089/106652701753307539.
Coombes KR, Highsmith WE, Krogmann TA, Baggerly KA, Stivers DN, Abruzzo LV: Identifying and quantifying sources of variation in microarray data using highdensity cDNA membrane arrays. J Comput Biol. 2002, 9: 655669. 10.1089/106652702760277372.
Quackenbush J: Microarray data normalization and transformation. Nat Genet. 2002, 32 Suppl: 496501. 10.1038/ng1032.
Rocke DM, Durbin B: A model for measurement error for gene expression arrays. J Comput Biol. 2001, 8: 557569. 10.1089/106652701753307485.
Nadon R, Shi P, Skandalis A, Woody E, Hubschle H, Susko E, Rghei N, Ramm P: Statistical inference methods for gene expression arrays. 2001, St. Catharines, Imaging Research Inc., [http://imaging.brocku.ca/PDF_files/AST_Technicalnote.pdf]
Tsodikov A, Szabo A, Jones D: Adjustments and measures of differential expression for microarray data. Bioinformatics. 2002, 18: 251260. 10.1093/bioinformatics/18.2.251.
Yang IV, Chen E, Hasseman JP, Liang W, Frank BC, Wang S, Sharov V, Saeed AI, White J, Li J, Lee NH, Yeatman TJ, Quackenbush J: Within the fold: assessing differential expression measures and reproducibility in microarray assays. Genome Biol. 2002, 3: research0062
Cheadle C, Vawter MP, Freed WJ, Becker KG: Analysis of microarray data using Z score transformation. J Mol Diagn. 2003, 5: 7381.
Yang YH, Buckley MJ, Dudoit S, Speed T: Comparison of Methods for Image Analysis on cDNA Microarray Data, Technical report #584. 2000, 2003: [http://www.stat.berkeley.edu/users/terry/zarray/TechReport/584.pdf]
Dudley AM, Aach J, Steffen MA, Church GM: Measuring absolute expression with microarrays with a calibrated reference sample and an extended signal intensity range. Proc Natl Acad Sci U S A. 2002, 99: 75547559. 10.1073/pnas.112683499.
Livesey FJ, Furukawa T, Steffen MA, Church GM, Cepko CL: Microarray analysis of the transcriptional network controlled by the photoreceptor homeobox gene Crx. Curr Biol. 2000, 10: 301310. 10.1016/S09609822(00)003791.
Kerr MK, Martin M, Churchill GA: Analysis of variance for gene expression microarray data. J Comput Biol. 2000, 7: 819837. 10.1089/10665270050514954.
Xu R, Li X: A comparison of parametric versus permutation methods with applications to general and temporal microarray gene expression data. Bioinformatics. 2003, 19: 12841289. 10.1093/bioinformatics/btg155.
Workman C, Jensen LJ, Jarmer H, Berka R, Gautier L, Nielser HB, Saxild HH, Nielsen C, Brunak S, Knudsen S: A new nonlinear normalization method for reducing variability in DNA microarray experiments. Genome Biol. 2002, 3: research004810.1186/gb200239research0048.
Durbin B, Rocke DM: Estimation of transformation parameters for microarray data. Bioinformatics. 2003, 19: 13601367. 10.1093/bioinformatics/btg178.
Qian J, Kluger Y, Yu H, Gerstein M: Identification and correction of spurious spatial correlations in microarray data. Biotechniques. 2003, 35: 424, 46, 48.
Manduchi E, Grant GR, McKenzie SE, Overton GC, Surrey S, Stoeckert C. J., Jr.: Generation of patterns from gene expression data by assigning confidence to differentially expressed genes. Bioinformatics. 2000, 16: 685698. 10.1093/bioinformatics/16.8.685.
Agilent Fluorescent Direct Label Kit Protocol Rev. 2.1. 2003, Agilent Technologies, [http://www.chem.agilent.com]G255598003
Sokal R, Rohlf FJ: Biometry: the principles and practice of statistics in biological research. 2000, W.H. Freeman and Co., 3rd ed.; page 53: page 53
Ge Y, Dudoit S, Speed T: Resamplingbased multiple testing for microarray data hypothesis. Test. 2003, 12: 177.
Dudoit S, Shaffer JP, Boldrick JC: Multiple Hypothesis Testing in Microarray Experiments. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 110. 2002, [http://www.bepress.com/ucbbiostat/paper110]
Significance Analysis of Microarrays. [http://wwwstat.stanford.edu/~tibs/SAM/]
Acknowledgements
We would like to thank John Aach, Adnan Derti, and Yonatan Grad for a critical review of this manuscript, and Sandrine Dudoit for helpful discussions. We are grateful to the Alliance for Cellular Signaling for publicly releasing one of the microarray datasets used in this study. This work was supported by the Leet and Patterson Trust (GGC) and NIH P50HL56985 (MAG).
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
JC and SN conceived of the study, implemented the I_{avg} and I_{min} algorithms, processed the microarray data used in the analyses, and completed the comparisons between the algorithms described herein. JC and GGC performed some of the microarray experiments analyzed in this study. MAG and GGC participated in the design and coordination of the study. All authors read and approved the final manuscript.
Jason Comander, Sripriya Natarajan contributed equally to this work.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Comander, J., Natarajan, S., Gimbrone, M.A. et al. Improving the statistical detection of regulated genes from microarray data using intensitybased variance estimation. BMC Genomics 5, 17 (2004). https://doi.org/10.1186/14712164517
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/14712164517
Keywords
 Outlier Detection
 Image Processing Technique
 Pool Standard Deviation
 Measured Standard Deviation
 Agilent Feature Extraction Software