It is important to select appropriate pre-processing methods for a given data set based on the experimental setup used. On the one hand, if sample sizes of the different groups are relatively small, it is crucial to achieve a homogeneous variance for the groups. On the other hand, if sample sizes are large, variances can be estimated separately and one should focus on unbiased fold changes. Since the sample sizes for the current data set are rather small (three to four replicates per group), a stable variance is more important than an exact representation of the fold change. In general, the data should be normalized without too much reducing real variations. Figure 1 summarizes the quality measures for all methods we investigated, demonstrating the background for the final choice. Clustering of the quality scores assigned reveals two major tendencies based on background normalization. On the one hand, data that was background normalized (bg_*) tend to better reflect the real fold changes, i.e. show less bias. On the other hand, pre-processing without background normalization (noBg_*) leads to a more homogeneous variance. Accurately defined, constant experimental conditions across all experiments as well as their parallel conduction probably have lead to a relatively consistent background level across all samples. Since background correction could introduce additional variation, these could be the reasons why, for our data set, data that was not background normalized (noBg_*) in general provides better stabilization of variance than background normalized data (bg_*). Methods combining background normalization with vst (bg_vst_*) constitute an exception. Here, vst leads to a better stabilization of variance while introducing more bias. As vst estimates an offset for the background based on the data , noBg_vst_* and bg_vst_* pre-processing methods could lead to similar results.
One has to keep in mind that, based on the individual analyses, there are several methods resulting in nearly equal quality. Therefore, it is not possible to give a well-defined rationale for using only one specific method. After excluding the methods that clearly violate the imposed criteria, the decision is still subjective. It, for example, depends on whether one would like to account for a good estimate of fold changes or a small and homogeneous variance. Finally the decision remains based on experience; yet, with the analyses and criteria described here, we provide a recommendation on how to pre-select appropriate methods. Since, for our data set, we intended to achieve a low and homogeneous variance, we provided more and to a certain degree overlapping statistics investigating variance. In case the focus is on a good estimate of the fold change, the researcher should higher account for statistics investigating this measure. Correlation to qRT-PCR or slope and intercept of the regression between qRT-PCR and gene expression fold changes are examples of analyses that could be of higher interest in this context. Focusing on variance, best suited for the data set analysed here are noBg_log_quantile and noBg_log_rsn. Although log2-transformation in combination with quantile normalization has been approved as performing relatively well by Du et al.  and Dunning et al. [8, 10], we decided to make use of robust spline normalization (rsn). In addition to our measures it was selected because rsn is aiming at combining the positive effects of quantile normalization, i.e. preservation of the rank order, and spline interpolation, i.e. continuous mapping of the values, but at the same time circumventing their drawbacks, i.e. discontinuous mapping of intensity values and no rank preservation, respectively [17, 23]. Surprisingly, the use of vst as recommended by Dunning et al.  and by Du et al.[9, 17, 23] and the combination of vst with rsn as successfully used by Du et al.  did not perform as well as expected. Reasons for this could be the different experimental setups (two replicates per group in the Barnes setup  used for validation of vst compared to three to four replicates in our setup) or the use of a newer Illumina chip technology, namely HumanHT-12 v3 chips, in our experiment. vst has been validated based on a pre-released version of the HumanRef-8 v1 Expression BeadChip that contained 19 (25% quantile) to 30 (75% quantile) beads per probe. On the HumanHT-12 v3 chips an average of only 15 beads per probe is available. Since vst makes use of those technical replicates, this could lead to a slightly worse performance on the new chip generation. In general, vst still performs well in stabilizing the variance but is outperformed by noBg_log_quantile, noBg_log_rsn, and noBg_vsn in reflecting the results measured by qRT-PCR. When utilising BeadStudio normalizations, in accordance with Dunning et al. [8, 10] who advised against the use of background normalization, we recommend using cubic spline without background normalization (noBg_cubicSpline). As displayed in Figure 1, noBg_cubicSpline outperforms all other BeadStudio normalization methods.
Spike-in or dilution data is frequently used for evaluating different normalization methods [5, 7–10]. If no such data is available for the microarray chip type used, we propose to perform qRT-PCR for genes covering different spectra of expression intensities in order to obtain a measure for judging the quality of pre-processing methods. Thereby, it becomes possible to get an idea of how well different normalization methods are able to reflect the real changes in expression intensities across different expression levels.
In summary, we provide statistical measures based on which researchers can decide on the best suited pre-processing scenario for their own experimental design. If no spike-in data is available, we recommend conducting qRT-PCR for selected, representative transcripts. Thereby, it is possible to estimate the bias of log2 ratios obtained from normalized data. In conjunction with the measures for the variability of the data finally the basis for weighing well measured changes versus low and homogeneous variance is delivered and by this means selecting an appropriate normalization method is possible.