Hyperspectral imaging and multivariate data analysis
The HSS we have developed is optimized for imaging printed DNA microarrays and excites a sample with a single laser, typically at 532 nm, while recording the emission over a wavelength range from 550–900 nm in approximately 0.75 nm increments to create a hyperspectral data cube (Fig. 1a). Details of the optical design and characterization of this line-imaging system have been published elsewhere [18]. Additional lasers are available and the wavelength range and spatial and spectral resolution are adjustable. The sensitivity and dynamic range of this HSS is the same as or better than the commercial microarray scanners we have tested for dyes emitting in the green channel of commercial systems such as Cy3 [18]. Typically, red emitting dyes like Cy5 are not optimally excited by our 532 nm laser, but based on its excitation spectrum, we are achieving ~5–8% excitation of Cy5. In comparison studies between an Axon 4000B microarray scanner exciting Cy5 with 633 nm light and the HSS exciting Cy5 with 532 nm light (data not shown) the HSS was found to be a factor of 6 less sensitive for Cy5 than commercial microarray scanner. However, the signal acquired from Cy5 is sufficient for quantitation by the multivariate algorithms. In addition, for the studies reported in this publication the focus is predominantly on the green channel emissions. In its current configuration the HSS scanner operates at a slightly slower speed of data acquisition (scanning at a maximum rate of 0.07 mm/s for 10 μm resolution) but this disadvantage is compensated for by the more accurate quantification and increased specificity of the fluorescent signal, especially in the presence of contaminants. The CCD readout rate is the limiting factor in the current HSS and newly available charge-coupled device (CCD) detector electronics could allow the HSS to scan at speeds up to twice the speed of commercial scanners.
A typical hyperspectral data cube contains tens of thousands to millions of spectra and spectral and spatial relationships cannot be readily visualized (Fig. 1a). To reduce the data to a simpler representation of the important features, HSS data are analyzed using multivariate curve resolution (MCR), a powerful factor analysis technique based on a constrained alternating least squares procedure [24]. MCR algorithms, when applied to fluorescence emission images, use correlations between variables to extract 1) representative pure-component spectra of the all the emitting species (Fig. 1b) and 2) independent concentration maps depicting the spatial location of these species and their relative concentrations from the highly overlapped spectral data (Fig. 1c). The MCR analysis assumes that the number of non-noise components is known or can be estimated and requires initial estimates to either the spectra or concentrations.
Principal component analysis, a closely related and popular multivariate analysis technique for biological data [25], typically provides an accurate determination of the number of non-noise components, but often cannot provide easily interpretable pure component spectra underlying complex spectral image data due to rotational ambiguity. For these studies involving microarray data, principal component analysis was used to determine the number of non-noise components and to provide initial estimates of the spectral shapes for the MCR algorithm to iterate upon. The application of non-negativity and equality spectral constraints within the MCR algorithms facilitates physically meaningful solution of spectral components. The benefits of MCR over more traditional univariate (band integration, band ratios) and other multivariate (least squares, linear unmixing) analysis methods include the ability to separate overlapping spectral emissions, discover the pure-component spectra with little or no information given a priori, and model unknown spectral contributions such as interferrents, backgrounds, and instrument artifacts. Applications of MCR for vibrational spectroscopic data have been outlined in a recent review [26]. Recently, members of our group have developed efficient, automated multivariate statistical analysis algorithms to analyze large X-ray hyperspectral image data using desktop computers [27]. We have optimized these algorithms for fluorescence image data from the HSS [28, 29].
Throughout this paper, emissions from within the area of the printed DNA spots are referred to as spot-specific emissions and emissions that are not spot-specific are referred to as background emissions. Also, ratio images generated by commercial microarray scanners are referred to as R/G (Red/Green) images and images generated by HSS are referred to as Cy5/Cy3 images, because the commercial scanner image is constructed from the intensities in the red and green channels regardless of source, whereas the HSS and data analysis produces true images of the Cy5 and Cy3-labeled cDNA that are uncontaminated by emission from other species.
Spot-localized emissions: Skew to toward the green channel
Earlier work has shown the presence of contaminant fluorescence in the green channel of many common types of microarray slides [17]. The HSS was used to confirm the presence of the green contaminant in slides from a variety of laboratories (four different commercial sources and in-house printed arrays from at least four independent laboratories) after hybridization. This contaminant is introduced to the slide during printing and can be recognized in the raw images as a variable, weak green channel signal that may be observed in all spots. In our investigations we examined slides from at least three laboratories printed using DMSO printing methods and did not identify a green contaminant in any of the spots on these slides.
Gtotal = GCy3 + Gglass + Gcont Eq. 1
Commercial filter-based microarray scanners will confound signal from a spot-localized green contaminant with signal from the labeled cDNA and glass substrate in the green channel. This is because the green signal intensity recorded for a pixel with a spot-localized green contaminant is the sum of all the green fluorescence intensities, including the glass emission (a relatively constant small value), the contaminant (a variable value), and the labeled hybridized cDNA. This extra green intensity has no affect in the red channel and little effect on spots with high green channel signal, but contributes significantly to the signal from spots with weak and medium green-channel intensities. As illustrated in Equation 1 data obtained from slides with spot-localized contaminating fluorescence will report green channel intensities that are falsely high (Gtotal is the total signal acquired in the green channel; GCy3, Gglass, and Gcont are the signals arising from the Cy3 labeled DNA, glass substrate and contaminant, respectively.) Unfortunately the effect of spot-localized contaminant emission cannot be corrected in commercial scanner microarray data using standard background correction procedures that estimate background signal for a spot from the signal around the spot and thus it leads to errors in the calculated expression ratios (R/G). The detrimental effect of the contaminant decreases with increasing green-channel intensities but can easily account for skews toward the green channel, dye-gene effects, unsuccessful dye-flip experiments, and highly variable low intensity data observed in many microarray experiments.
Figure 2 shows the results of multivariate data analysis of a hyperspectral image from a microarray slide with contaminating fluorescence. Because the HSS can isolate the contaminant emission spectrum, true pure-component concentration maps can be generated and the resulting images of the Cy3 labeled cDNA are contaminant-free. Using the optical filter functions of the Axon 4000B microarray scanner, HSS images can be scaled to match the total intensity of each of the commercial scanner channels, thus providing a direct comparison to the commercial scanner results. These scaled concentration maps are used to calculate an accurate Cy5/Cy3 image that gives spot intensity values without contributions from the glass and contaminant emissions. The R/G ratio constructed from the commercial scan (Fig. 3a) and the more accurate Cy5/Cy3 ratio image of the same area on the slide (Fig. 3b) show the difference is dramatic, with 75% of the spots having ratios in error of a factor of 2 or more due to the presence of the green channel contaminant. These errors would change the basic conclusion from the data that most genes in the test sample were down-regulated relative to the control when, in fact, the correct conclusion should be that most genes in the test sample were either up regulated or not differentially expressed. The fluorescence contribution of the contaminant cannot be corrected with current commercial technology due to extreme spectral overlap and the variability of the spot-localized contaminant concentration. Image thresholding could be useful if the contaminant contribution were known and fairly constant but even low levels of contaminant fluorescence would require high thresholds be set. For example, a maximum of 100 arbitrary fluorescence units of spot-localized contaminant signal would require a threshold of 500 arbitrary fluorescence units be set in an attempt to maintain errors in the 20% range.
Spot-localized emissions: Dye separation
"Dye separation" is a phenomenon referred to in the microarray literature as a ring of one color surrounding the other, typically red around green [16]. Although this phenomenon is seen in published images, there is no satisfactory explanation for what causes the labeled cDNA to hybridize with a dye-specific spatial pattern. Other spatial patterns of hybridization, e.g., doughnuts and coffee rings are well documented, but spatial anomalies theoretically should affect both dyes within a spot to the same extent.
Apparent dye separation was visible in some spots on the microarray slides we examined with the green contaminant. Figure 4a shows a R/G ratio image from a typical spot exhibiting dye separation scanned on an Axon 4000B scanner. From this image, the Cy5-labeled cDNA appears to have hybridized in a circle around the area where the Cy3-labeled cDNA hybridized. HSS analysis revealed that the spot-specific signal was from three sources: Cy5-labeled cDNA, Cy3-labeled cDNA, and green contaminant. The individual concentration maps generated from the multivariate analysis of the HSS image data demonstrate that, for this spot, the diameters of Cy5 and Cy3 hybridization are equal, although little Cy3 is present (Fig. 4c–d). The contaminant emission, which is brighter than the Cy3, is present in a smaller diameter circle (Fig. 4e) and the result of this localization difference is a spot with a bright green center and a red ring on commercial scanners. This size difference most likely occurs because of differences in surface tension of the printed cDNA and contaminant when drying or differences in charge of the cDNA and contaminant. Figure 4b shows an accurate Cy5/Cy3 image of this spot created from the HSS concentration maps. The original Axon data produced a ratio of Cy5/Cy3 medians of 3.0 and ratio of Cy5/Cy3 means of 3.0 while the more accurate HSS data gives rise to ratios of medians and means of 7.7 and 7.5, respectively, for this particular spot.
Background emissions
In microarray experiments, mean or median intensities are calculated per spot to determine expression ratios. These spot intensity values are typically corrected for background emission using methods that subtract local or global estimates of background contributions. Various background correction methods exist in data analysis software, but all of these methods make one critical assumption about the data – that the background emissions are the same outside the spot as they are under the spot. This assumption is valid in an ideal situation where a perfectly homogeneous glass slide is the sole source of background emissions and the printed DNA spot is sufficiently thin and non-scattering so as not to interfere with the excitation of the glass beneath it. Unfortunately this assumption is rarely valid. Researchers have shown the results of microarray experiments are very dependant on background subtraction methods used and have theorized that local background values are not representative of the true background emission in a spot, leading to erroneous values and negative spot intensities [30]. Options such as not correcting for background, using a global background value for every spot, or using negative control values as a background can be successful in some cases, but are not robust.
Using the HSS, we have explored background emissions on many printed microarrays spanning a variety of preparation protocols. The unique ability to identify and isolate all emission sources and model the background for each pixel directly from the spectra allows us to generate pure-concentration maps of the dyes of interest without contributions from background sources. In every microarray we have scanned (9+ different labs, in-house and commercially printed slides) the background was different under the spot than around the spot. This difference can vary from a subtle decrease in intensity under the printed spot to a much more predominant intensity variation that seriously affects the accuracy of the data. Understanding the background emissions is essential to ensuring that appropriate background correction techniques are used when scans are to be performed with commercial scanners. This increased understanding of background emissions also provides the feedback necessary to alter the preparation process to minimize the background emissions present on microarray slides.
Figure 5a shows the R/G ratio image of a portion of a microarray with spatially variable, high background emissions and poor spot-specific signal in the red channel. Although only a small area is shown, the entire array contains significant variations in background intensities and patterns. Multivariate analysis of the spectra from a HSS image determines that the emission outside of the spots is predominantly from Cy5-labeled cDNA, but also includes contributions from Cy3-labeled cDNA as well as the glass substrate. In this case the non-specific interaction of the labeled cDNA with the glass substrate is most likely caused by inadequate blocking procedures. Non-specifically bound cDNA is not expected to contribute to the intensities within a spot and therefore local or even global subtraction methods would lead to spot intensities that are too low and even negative. The critical advantage of HSS analysis in this situation is that pure emission spectra are quantified and spot intensities from these images do not need separate background correction.
Two additional points should be noted about Figure 5. First, in the HSS Cy3 concentration map, Cy3 is present both in and outside of the printed cDNA spots. This is in contrast to the Cy5-HSS image, in which Cy5 is absent from many spots. We have observed a slight spectral shift in the Cy5 emission maximum under these conditions compared to other slides with successful Cy5 hybridization, suggesting that the Cy5 outside the spots may contain significant amounts of residual, unincorporated dye. Cy5 and Cy3 both exhibit a spectral shift in emission maxima upon incorporation into cDNA. This effect is illustrated in Additional File 1.
Additional File 1 compares the MCR extracted spectra from a spotted array containing only Cy5-dCTP in the spots and a second spotted array containing only Cy5-cDNA in the spots showing the effect the molecule Cy5 is attached to can have on the emission maxima of Cy5. Second, the HSS image of the glass concentration shows another phenomenon: the glass intensity is lower under very bright spots (spots with intensities > 6000 arbitrary fluorescence units on the commercial scanner we utilized) than outside of the spots. This difference is slight but consistent. We believe this is due to scattering or absorption of the laser light by the printed spot, decreasing the irradiance of the glass beneath the DNA spot.
We also observed several microarrays with high background in smears across the slide. (data not shown) In those cases, the spectra from the smears did not resemble Cy3 or Cy5, but, instead, were from a green contaminant. The smears persist across the surface of the spot and contribute to additional signal intensity in the Cy3 channel. No background correction method can adequately correct for this emission and unless the contribution from this contaminant is modelled and removed from the data, the analyzed microarray data will not be reliable.