Microarray data are generated from multi-step biochemical reactions, scanning/data collection, image analysis and data processing. This process is inherently prone to variability that affects the specificity and sensitivity of the assay, thus requiring evaluation of each microarray data set [9, 10, 12]. In order to calibrate the sensitivity and specificity of the output data, appropriate statistical tools applied to reference sequences composed of positive and negative controls may be used to quality control data from a given hybridization. We argue that any procedure which uses raw intensity ratios alone to infer differential expression may be inefficient and thus may lead to excessive errors.
Since ratios are simply the result of uneven signal distributions between Cy5 and Cy3 channels, analyzing these distributions will help interpret the biological relevance of an observed ratio. Signals are the result of specific and non-specific binding when a complex probe DNA mixture is incubated on the slide surface containing target DNA. The quality of DNA microarray data rests on the ability to measure non-specific components of a spot signal and eliminate them from ratio analysis. Such a component analysis on spotted DNA microarrays is not possible with today's technology and the proportion of non-specific binding will vary for each spot because of competitive binding in the presence of sequence specific hybridization (Note: GeneChip arrays from Affymetrix using perfect vs. mismatch oligonucleotide pairs do, to a certain extent, measure the non-specific binding component of every sequence; see technical note discussing probe length and performance, http://www.affymetrix.com/support/technical/technotes/25mer_technote.pdf). Since the influence of non-specific binding is more severe for probes where no or little specific hybridization occurs [8], we treat the problem as one of detecting a threshold value that is both determined by the highest signals attributable to spots representing non-specific hybridization and the lowest signals from spots where sequence specific hybridization must be assumed. Simply put, we determine a threshold separating specific from non-specific hybridization assuming that the former usually results in stronger signals than the latter [18]. A similar approach has been reported for Affymetrix GeneChip arrays, so called 'LUT based scoring system' [8] (tables to check noise level of particular chips or noise filtering look up tables).
Methods used to determine a signal threshold include the use of arbitrary fluorescence intensities [19], relative errors in Cy3/Cy5 ratios [9, 20, 21] or certain signal-to-background ratios [22]. However, these methods lack information about the specificity and sensitivity of the threshold, which are crucial parameters for estimation of the diagnostic accuracy of microarray hybridizations. To select a threshold, we have exploited a reference set of positive and negative control genes based on presence or absence of their cognate labeled cDNAs in the hybridization mix.
Positive controls may be spiked RNAs from non-homologous species or transcripts known to be expressed in the sample i.e. housekeepers [13, 23]. Signals from positive controls should cover the range of test signals. This can be achieved by appropriate spiking and/or selection of housekeeping genes that fulfill this criterion.
The negative controls should be chosen to lack sequence homology to test genes, however, choosing appropriate control sequences for a ROC plot analysis is crucial: we conclude that SSC spots show a distinct signal pattern different from plant, bacterial and viral DNA deposits. Unlike spots containing control DNA, blank spots are not representative of non-specific hybridization to cellular probe DNA, do not behave well as control spots, and should be disregarded for threshold detection on custom arrays.
The robustness of ROC analysis to yield TM and ROC area values to discriminate 'good' from 'poor' microarray hybridizations relies on the relative positions of signal ranges from positive and negative controls as well as from target genes. We can imagine two szenarios making ROC analysis inappropriate for determination of threshold and/or microarray hybridization quality: (i) If the set of positive controls is in the high signal range, ROC analysis will yield a higher TM and ROC areas close to 1.0 (indicating good microarray hybridization) because positive and negative signals are well separated, irrespective of the distance between the greatest observation in the negative and the lowest observation in the positive sample. Consequently, a large portion of target genes will be discarded because of the relatively high TM. (ii) Alternatively, if the positive controls are spiked below the detection limit of microarrays (i.e. typically 1:500,000 wt/wt), their signal range may resemble the one from negative controls. This scenario will produce a low TM and ROC areas close to 0.5 falsley indicating 'poor' hybridization.
The overall performance of individual microarray hybridizations can be assessed by the position of the receiver operating characteristic line (Figure 3) using one single parameter: the area under the ROC curve (Table 2). Poor microarray hybridizations have lines close to the rising diagonal (or values ~0.5), whereas the lines for 'perfect' hybridizations would rise steeply and pass closely to the top left hand corner (or values ~1.0), where both, the specificity and sensitivity are 100%. In high-throughput applications such as routine diagnostic examinations, where a large number of hybridizations may be performed using a standard microarray-design, the ROC-plot area may be used as a 'hybridization quality checkpoint' to either accept or discard individual microarray hybridizations (for example 0.990 for array 1 and 2, Table 2). The area under the ROC curve represents a summary statistic of the overall performance of individual microarray hybridizations. A modification of the Wilcoxon rank-sum procedure may then be used as a statistical test to determine whether two ROC curves are significantly different [15].
Among 25 thresholds calculated here we have compared Sp and Se of 3 commonly used cut-offs with the ROC analysis-derived threshold TM (Table 1). The median or mean of a negative control group is regarded an adequate measure for non-specific hybridization [13], however, due to low specificity (~50%) we conclude that neither thresholds should be used if maximum specificity is required. In such a case, we find that the widely used cut-off value defined as the mean plus two standard deviations of the negative reference sample may be used adequately for DNA microarrays. The underlying rationale of using this threshold is to establish a cut-off value providing a specificity of 97.5% [24, 25]. In our own example (Figure 1) skewness to the left makes the TX2SD overly conservative, which will sacrifice sensitivity unnecessarily. Hence, TX2SD may be inadequate and to adjust the sensitivity one should use TM. Most importantly, the TX2SD procedure does not account for the sensitivity of the threshold. Although the ROC-analysis derived cut-off resembles closely the cut-off defined as the mean + 2 SD, it is entirely possible that choosing a finer-grained partition of the signal space would alter the relative positions of theses two points. Likewise, this may be a characteristic for the 2 exemplary microarrays. Usually, the cut-off selection procedure is an informed decision based on the motivation of the individual to accept false positives (high Sp) or false negatives (high Se) that takes into account whether it is crucial to exclude any false positives (high Sp) or to cover the broadest signal range possible (high Se). Which cutoff to use depends on the objective of the experiment: If one needs to make sure that the 'present' or 'absent' call for a particular gene is correct, a cut-off with high Sp should be chosen, whereas if one is willing to accept false-positives where signals are low, high Se will be the driving force.
The signal intensity is the most critical parameter that influences the informative value of ratio estimates [26]. Therefore, ratios should be judged based on the absolute signal intensity of each gene. To diagnose the metastatic potential of highly versus poorly invasive melanoma cells we compared their gene expression profile with our metastasis chip, which contains genes critical for aspects of the metastatic process, including tumor cell motility and the ability to form primitive tubular networks [17, 27]. Each ratio was tested for the Null hypothesis that there is no difference between the means of the ranks of the Cy5 and Cy3 signals over 6 replicate spots representing a unique sequence. At signal intensities below the threshold with maximum specificity and sensitivity some genes gave a ratio greater than 1.6-fold (at a confidence level of p < 0.05). In such a case, however, ratios are not optimal estimators because the low denominator value introduces large artifacts [8]. Therefore we sought to determine ratio confidence categories based on the absolute signal intensities [13]. Assuming that the ROC-analysis derived threshold is an 'appropriate' cut-off for distinguishing absent/present genes, the proposed 'confidence categories' may be interpreted as follows:
(A) The gene is present in both samples, and this is the best estimate of the true ratio, while further statistical evaluation should be applied to take into account the variability of the measurements.
(B) The gene is present in one sample and absent in the other. Ratios are meaningless, but this is still an extremely significant biological result!
(C) The gene is absent in both samples. Not only are the ratios meaningless, but so are the intensity estimates.
As a result of threshold setting, certain genes may be falsely included (=false positives) or, less frequently [5], falsely discarded (=false negatives) from further analysis (i.e. ratio calculations, clustering analysis, etc.) as exemplified here with the microarray experiment for investigating invasion in cancer. The ROC-derived threshold correctly classifies the signal for Laminin-5, γ2 as 'absent', whereas the mean- or median -derived thresholds would produce a false-positive result. Since this gene product plays a significant role in vasculogenic mimicry [17]. Classifying the expression level is biologically crucial. ROC analysis leads to a result (Type B, above) that is in line with data obtained otherwise, whereas both mean- or median -derived threshold would have resulted in accepting falsely a change (type A, above) in Laminin-5, γ2 expression.
Collectively, the present study demonstrated that microarray-derived signals from positive and negative controls may be used to compute accurately type I and type II errors for a series of signal thresholds. We have introduced a new model for signal threshold determination for gene expression microarray experiments that greatly eases the interpretation and comparison of these data. This model is based on analysis of signal intensities and distributions of a reference set of positive and negative controls included on each microarray. It provides a framework for determination of detection limits, confidence about fluorescent ratios and for pre-processing data for subsequent data analyses, such as cluster analyses [23, 28].