Our results demonstrate that the analysis of dual-color microarray gene expression experiments using intensity-based linear models outperforms the standard ratio-based analysis. Both reproducibility and sensitivity were enhanced in detecting differential gene expression in two independent datasets.
By analyzing technically replicated experiments we determined the effect of both models on the reproducibility of gene rankings. Our studies show that for both the cell line and brain datasets the intensity-based analysis provides more reproducible gene rankings than the ratio-based analysis of the same dataset. For the cell line dataset, 78% of the 1,000 most significant genes is reproduced between the two duplicate datasets C1 and C2 when using the intensity analysis, whereas only 73% of the genes is reproduced with the ratio analysis (see Figure 4C). For the brain datasets B1 and B2, the difference between ratio- and intensity-based reproducibility is far more pronounced: only 4% of the top 1,000 genes are reproduced in the ratio analysis, while there is still a substantial overlap of 51% between intensity-based gene rankings (Figure 6C). The underlying reasons behind the apparent discrepancy between the cell line and brain datasets will be addressed later. An independent line of evidence, based on model selection, also indicated that intensity-based models are preferred over ratio-based models for the analysis of dual-color microarray data. When performing Bayesian Information Criterion model selection calculations, we found that for 95% of the transcripts in the cell line experiment, and virtually all transcripts in the human brain experiment, the intensity model was favored over the ratio model. Furthermore, for both the cell line dataset and a publicly available third dataset, a comparison between ANOVA-based array and treatment effect sizes revealed that the treatment effects are much larger.
Combining the gene ranking, relative effect size and model selection results, we argue that simply by selecting the intensity model instead of the ratio model for the analysis of the same set of gene expression measurements, more reproducible results are obtained.
It should be noted that the relative advantage of dropping the array effect depends on the complexity of the design and the sample size (the number of arrays). For the relatively simple MAQC data set BIC selects the model with array effect for 29% of the genes, much more frequently than for both the brain and cell line data sets. The beneficial effect of dropping the array effect from the model seems more pronounced in experiments that employ direct designs to address complex comparisons, such as time series and multifactorial experiments.
Adding to the enhanced reproducibility, intensity-based analysis is more sensitive in the detection of differential gene expression, as derived from more significant p-values. It is important to note that, by selecting the ratio-based p-value of the 1000th most significant gene as a cutoff, almost all of the 1000 genes (89% for dataset C1, 92% for dataset C2) are also significant in the intensity-based analysis using the same cutoff. Interestingly, this analysis also reveals that 3335 genes, not selected by the ratio model, are reproducibly more significant than the 1000th gene in the ratio results. This provides additional evidence for the enhanced sensitivity of the intensity model over the ratio model. Due to the poor reproducibility of the ratio-based results in the brain dataset, such calculations were not meaningful for that dataset.
Enhanced sensitivity due to ignoring the array effect in the linear model
The observation that ratio-derived p-values can be improved by intensity-based models can be attributed to the inclusion of the array effect in the ratio-based linear model. Pairing of data is a powerful concept for removing subject specific bias. In particular, when the quality of the spot printing procedure is not constant (often the case with in-house spotted arrays), it is essential to account for an array effect in the ANOVA model . But there is a price to pay: degrees of freedom . The total number of degrees of freedom equals the number of samples. The array effect consumes almost half of the degrees of freedom. However, due to the high quality of commercially available dual-color oligonucleotide microarrays, we and others observed that the ratios of the same sample pair, measured on different arrays, are strongly correlated , which means that the array effect is likely to be very small. When using a ratio-based model to analyze the data, many degrees of freedom are used to estimate the array effect, explaining only a small proportion of the variability. This ultimately results in less significant p-values, a lower correlation between p-values from the two replicate experiments, and a smaller proportion of reproduced top-ranked genes. Indeed, the results from the model selection experiments clearly indicate that the model without array effect is the preferred model for both datasets. It should be noted that we do not state that the array effect is absent: our analyses in fact show that an array effect is present in modern dual color microarray experiment. Furthermore, the results from the power calculations for the MAQC dataset show that including the array effect can be slightly beneficial for certain sample sizes. However, we conclude from our experiments that for both the brain and cell line datasets, the array effect is too small in comparison to the main factor of interest (treatment) to justify incorporation into the ANOVA model.
A possible argument for the inclusion of the array effect is the potential competition for spot binding between the co-hybridized samples. However, our and other studies suggest that competition is not an issue [7, 10]. This can be derived from the strong correlation between the real and in silico reconstructed ratios (see Additional files 2 and 3), and the hierarchical clustering in Figures 1 and 5. Our study was however not conducted to demonstrate that ratios can be reconstructed in silico by using separate intensities. Indeed, this has been demonstrated before . Our specific aim was to compare the performance of ratio- and intensity-based methods based on the main outcome of comparative gene expression experiments: a list of ranked genes. As this gene ranking provides the basis for further research, it needs to be robust and reproducible. We show here that intensity-based methods provide more reproducible results and is more sensitive in detecting differential gene expression, and thus outperform the standard ratio-based analysis.
Biological variation negatively affects ratio-based, but not intensity-based, replication
As indicated earlier, in the human brain experiment, we observed a striking lack of reproducibility (r = 0.05) between p-values generated by the ratio model on the replicate datasets B1 and B2, whereas the intensity-based p-values reproduced quite well (r = 0.46). These findings can be attributed to the following. First of all, the overall p-values (both intensity- and ratio-based) are less significant in the human brain experiment than in the cell line experiment, due to the large biological variation between individuals. Second, due to the relatively low level of biological replication, few degrees of freedom are left for estimating the biological effect. Third, the brain experiment was not designed with splitting the data into two technical replicates in mind. While the two data sets are biologically identical, the samples are paired differently on the arrays between the two replicate datasets (see Additional file 4). Since this pairing is more or less arbitrary, the results should be robust against this artifact, but this is not necessarily the case for the ratio-based analysis. When the biological variation is large, different sample pairings may result in differences in measured ratios, a phenomenon we observed in the brain dataset (Figure 7 and Additional file 5). The intensity-based analysis of brain datasets B1 and B2 does not suffer from these drawbacks: no ratios are calculated, and more degrees of freedom are left for estimating the biological effect of interest, resulting in a substantial proportion of reproducible findings (51% of the 1,000 most significant genes), and a relatively high correlation between p-values (r = 0.46). In a setting with many biological replicates per level (e.g. comparison of two large groups) the differences in correlation between the ratio-based and intensity-based analysis are likely to be smaller.
Our studies indicate that the reliability of gene rankings obtained from dual-color microarray experiments can be improved by using intensity-based models. An added benefit of the intensity-based analysis is that intensity models do not suffer from the drawbacks of ratio models in the analysis of complex direct dual-color experiments. Designs such as the interwoven loop design address the increased complexity of microarray experiments, which have progressed from "simple" two-group comparisons to multifactorial or time-course experiments. The aforementioned direct designs are efficient, but often bias certain comparisons over others and lack the possibility to extend the experiment by adding more groups or samples. There are no such limitations when analyzing dual-color experiments with intensity-based models . Finally, the LIMMA software package also uses intensity data from dual-color experiments, but mainly as a solution to compare samples which are unconnected in the hybridization design . Here, we provide evidence that it is beneficial to perform an intensity-based analysis for connected designs as well. It should be noted that the observed improvements may be limited to dual-color arrays and that further experiments are needed to justify the generality of these results for other array designs.