Microarrays represent a powerful means of rapidly assessing genome-wide expression patterns. Unfortunately confounding technical variation and systematic error in array technologies presents a major obstacle to their adoption for clinical diagnostics in humans. These factors are rarely documented, poorly understood, and their implications for experimental and clinical utility of microarrays frequently ignored [1, 4]. Building on previous investigations of technical variation between replicate RNA samples from breast tumour biopsies, this extended study used both Illumina and Affymetrix arrays to explore the reliability of reported expressions across a variety of experiment designs. Re-analysis of our MCF7 and MCF10A Affymetrix datasets demonstrated that not compensating for one batch effect, such as the use of a different scanner (Eg. if the information was not available) would have a much smaller effect on the numbers of genes identified as commonly differentially expressed, than another batch effect (such as the labelling method or if RNA was amplified).
Using a large combined dataset of conserved, reliably-detected probes on Illumina Ref-8 and HT-12 BeadChips (experiments 1 through 3) we found that the correlation between replicate UHRR hybridisations were consistently poorer than correlations previously reported using the Ref-8 data alone . Interestingly, we found that labelled UHRR samples from the original experiment, which were stored at -80°C for approximately two years, hybridised to two arrays on the new HT-12 chips correlated better with the original Ref-8 samples than did freshly prepared UHRR replicates. This suggests that even long periods of frozen storage and additional freeze-thaw cycles introduce less noise into experimental measurements than that inherent in creating a new preparation of labeled cRNA, even from the same RNA source.
As in our previous investigation , quantile normalisation did little to improve correlation between the UHRR replicates across the Ref-8/HT-12 dataset. However specific batch-correction using ComBat  once again significantly improved correlations and is a valuable tool for removing systematic error introduced between experiments and/or processing runs. The inter-chip variation in the new HT-12 datasets was almost double what it was in the Ref-8 dataset and due to this increase in inter-chip variation and high-levels of inter-experiment variation, the inter-run variation in the combined dataset was largely obscured. However, as we have previously seen, inter-experiment and inter-run variances were largely eliminated following ComBat corrections.
Variance estimates using the MAQC Illumina dataset were smaller in magnitude to the variances obtained from our Ref-8 data, revealing a surprisingly high level of reliability between the three laboratories that performed these experiments. In contrast, the MAQC Affymetrix dataset was found to be far more variable than their Illumina data but similar to the magnitudes of variance observed in both our combined Ref-8/HT-12 dataset and our Affymetrix dataset. The reason for the low variation in the MAQC Illumina data is unknown, especially since their study design deliberately split the sample replicates before cRNA synthesis; a much earlier stage in the sample-prep workflow than our replicates (which were split after amplification and labelling). It is possible that the small number of laboratories (three, in total) performing the MAQC Illumina hybridisations produced highly concordant data completely by chance, while the larger number of laboratories (six, in total) performing the Affymetrix experiments provided a more realistic reflection of the technical variation in these data.
Pooled sample vs. UHRR as batch-effect calibrators
Several studies have found the use of replicate control samples such as UHRR to be a useful standard in microarray experiments, suitable for monitoring expression consistency within and across a variety of genome-wide expression platforms [29–32]. However, such commercial controls are deliberately generic and deficiencies have been reported in terms of how well they represent specific cell types . Clearly UHRR is not representative of breast tumour RNA and therefore carries no guarantee of expressing RNAs that may be variably expressed in the specific subset of genes changed in breast tumour tissue. Therefore, in terms of compensating for confounding technical variation, the very probes for which the correction is most important are those that are most neglected in the UHRR controls.
Unfortunately the relatively small degree of legitimate biological differential expression between the pre-and post-treatment tumour biopsies provided little opportunity to assess the relative performance of UHRR and pooled batch calibrators on the consistency of reported differentially expressed probes. However, compared to the UHRR, the pooled tumour RNA controls were shown to more faithfully emulate the individual shift in expression between tumour technical-duplicates as a result of variation introduced between runs and between chips. Had it been possible to identify more legitimately differentially expressed probes between pre- and post-treatment samples, the pooled RNA would almost certainly have made for a better batch-calibrator during ComBat correction than the UHRR controls. If a similar pooled calibrator was used in our previous study it seems reasonable to speculate that the consistency between the gene-lists reported as significantly differentially expressed would have been noticeably higher than the 74.1% achieved using UHRR calibrators.
Batch effects relative to probe position and composition
We took the opportunity to assess compositional properties of the probes as a potential explanation/surrogate for the technical effects observed in the Ref-8/HT-12 data. A highly significant trend in favour of low-GC content was identified in the core set of probes consistently affected by inter-run and inter-chip variation between sample duplicates. A similar, but less significant, enrichment for low-GC-high-SD probes was also observed in the MAQC Illumina dataset. This suggests that the magnitude of error introduced due to low probe GC-content is sufficiently great that it is resolvable between the replicate cRNA preparations assessed in the MAQC study. A similar observation regarding probe GC-content and expression consistency was recently reported in a comparison of RNA preservation protocols, using matched samples, in terms of the effect on results of downstream expression analyses . A further correlation of probe composition, specifically with respect to GC-content, has been reported previously in a spike-in experiment using Illumina BeadChips , in which it was found that probes with high-GC content tended to have a higher than expected signal intensity, but probes with lower than average GC content had inflated differential expression statistics. We found no such association between low-expression and GC-content, however we did observe a low-signal-low-GC association; therefore the notion of inflated differential expression stats for low-GC probes is supported not by probe intensity, but through greater variability in our data. The low-GC effect is likely to be related to thermodynamic properties of hybridisation favouring high-GC probes/targets, a supposition that is rational given the deliberate high-GC bias in the design of the Illumina probesets. Probes with low GC-content appear to be inherently more vulnerable to systematic error but, although highly statistically significant, the magnitude of this variation in our data was small relative to that between biological replicates. It is therefore somewhat unlikely that such variation would pose a threat to the accurate classification of samples and, even in an experiment in which groups of biological replicates are poorly distributed across chips and runs, is also unlikely to be a serious confound to statistical tests for differential expression.
The proximity, with respect to the target transcript, of probes has been reported to strongly influence the correlation of expression measurements between technologies . The analyses performed here were designed to assess whether such probe-transcript mapping influenced expressions reported by the same platform, however no such correlation was observed either between biological or technical replicate samples. A more thorough analysis of the MAQC datasets would provide further insight into any relationship between probe-location and expression between a variety of different platforms and sample-preparation procedures.