Our model for DNA hybridization on microarrays is comprehensive. It includes parameters that are specific to the chip design and to the processing protocol. These parameters are probe length, temperature, salt concentration, wash stringency, target DNA size, and the parameters that are related to the scanners dynamic range. Furthermore, the NN values can be adjusted to accommodate DNA-RNA and RNA-RNA binding. The algorithm that calculates Keq values for probe target complexes is applicable to many array designs as well as to other binding reactions in thermodynamic equilibrium. Thus, this algorithm can provide valuable information on binding affinities and cross-hybridization for a wide array of applications including probe design.
Our approach to probe intensity calculations takes into account many of the “problems” inherent to microarrays such as batch effects and probe synthesis failure. Preparation of target DNA for microarray experiments often poses many technical challenges; consequently, the concentration of DNA products often varies between experiments as does the wash strength and average fragmentation size. Furthermore, probe synthesis efficiency varies between individual chips, and different batches of arrays. Our model directly tackles this problem by fitting the concentration as well as the nucleotide synthesis parameters for each chip. It is important to note that all the chips analyzed are Affymetrix chips with 25 bp long probes; therefore, the chip synthesis aspects of our model may be somewhat specific to the company’s manufacturing process. Our model fits one parameter for the concentration of each target molecule, which limits its use for arrays that quantify transcript levels. However, the Keq information from our model is applicable to any type of array.
For all the chips that were analyzed, the mean correlation between expected and observed probe intensities is 0.701, with average correlations for each chip ranging from 0.881 to 0.550. Furthermore, extreme intensity values do not dominate the correlation terms (Figure
4). For the FMR1 design chips the correlation for replicated probe spots (probe spots with the exact same probe sequence) is 0.906. Similar measurements for correlation between probe spots have been previously reported
[41, 42]. Thus, if we view ~90% as the maximum possible correlation between expected and observed intensity, because observed intensity varies this much between replicated spots within a single chip, our model can be seen to be doing a very good, but not quite a perfect job of predicting probe intensities.
Our model has helped us understand some puzzling observations regarding microarrays: the difference in intensity between complimentary forward and reverse probes; the larger decrease in intensity for mismatches towards the center of a probe; and batch effects.
First, assuming simple liquid phase kinetics, probe spots that target the forward DNA strand should have equal amount of bound target DNA as the complementary spot that targets the reverse DNA strand. However, the intensity of the forward and reverse strands usually have systematically different intensities, with probes targeting the forward strand of a given genomic region being brighter/darker than the set of probes that target the reverse strand of the same genomic region
. Under our model this observation is simply the result of probe synthesis failure or sites becoming abasic after synthesis. This claim is supported by the fact that the A,C,G, and T synthesis parameters are statistically different from each other for each of the different chip designs (Additional file
1: Table S2). The estimates for A,C,G, and T incorporation rates (Table
3) are similar to previously published estimates of synthesis failure for Affymetrix arrays
. Furthermore, this explanation for the difference between forward and reverse strand probes is far more parsimonious than relating this difference to G-stacks
, given that this difference is observed in A/T rich probes that have no stretches of the same amino acid. There are two factors that we did not model, but could partially contribute to the observed difference between forward and reverse probes. For one there are base analogues that are often used to manufacture probes, this difference can easily be incorporated into our model by revising the NN numbers used for the Keq calculations. Second there could be an entropy penalty to base pairs that are closer to the array surface.
In our model, probe synthesis failure along with wash and fragmentation differences explain batch effects for microarrays. Batch effects can be subdivided into two types, those that happen during manufacturing of the chip and those that happen during the processing of the chip. In our model, the former is explained by differences in the efficiency of probe synthesis, while the latter is explained by differences in the fragmentation, washing steps, and as the rate of abasic site formation.
Another previously puzzling observation is the correlation between the distance of a mismatch from the edge of the probe and its effect on probe intensity. Mismatches towards the center have a larger effect than mismatches towards the edges
. Under our model this observation is expected (Figure
6), and is the result of fragmentation of the target molecule. Intuitively, there are more fragments that can bind the center of a probe, than fragments that can only bind a single edge (Figure
1). Hence, if a target molecule contains a mismatch, its effect will be proportional to its distance from the middle of the probe. In our model, simple hybridization kinetics can explain these puzzling observations without the need to assign different weights along the probe sequence nor a penalty to probes with a mismatch.
Using our model we get an average correlation of ~70% between observed and expected probe intensity. This includes data from all probe spots regardless of their quality. Even so, our model comes close (0.881) on some chips, but never achieves our theoretical maximum correlation (0.906). From the manufacturing process up until the reading of probe spots intensities, the microarray experiment has several complex steps. Our model makes several “simplistic” assumptions that allowed us to develop efficient algorithms. In doing so we made several compromises. One of these assumptions is that probes with two or more errors do not bind target DNA. This assumption should have a relatively small effect on the correlation for 25 bp long probes; however, it is expected to have a larger effect on the correlation for longer probes. In our model we use the Langmuir isotherm to calculate the fraction of bound probes and do not take into account probe surface density
[16, 17], non-equilibrium, and low target/probe ratio
. Theoretically, commercial arrays of 25 bp long probes should have reached equilibrium at the end of the hybridization step; however, equilibrium might not be achieved by the end of the washing period; therefore, our approach to modeling the wash step of the protocol is rather simplistic. Furthermore, in these arrays the target/probe ratio should be very large. When arrays deviate from this ideal scenario our model loses predictive power. It is important to note that even though the probe surface density is not directly modeled by our approach, the parameters we use to describe the scanners dynamic range can indirectly be used to adjust for microarrays with varying probe surface densities. This “adjustment” however has some limitations. The most obvious one being if increasing probe density affects mismatched sequences and matched sequences disproportionately
Overall our data suggests that the Langmuir isotherm appropriately and efficiently models binding between probe and target DNA on a microarray; however, other more computationally intensive measurements for binding on arrays have been proposed
[39, 44]. Furthermore, when we calculate target-probe binding we do not account for the known in solution effects of dangling ends
 and the stabilizing effect of mismatches in the last three base pairs of a sequence
[21–25]. We also do not model secondary structures that can form on arrays with long probes or arrays that hybridize to targets with extensive secondary structures, for example rRNA arrays
Other details of the microarray experiment that are left out of our model are bleed-through between features and regional artifacts such as air bubbles, scratches, and miscellaneous particles. There are two sources of bleed-through between features: one, the probes at the edge of a feature may have a hybrid sequence due to incorporation of nucleotides during the synthesis of the neighboring probe; two, the scanner may be detecting light from neighboring features and falsely determining its origin. If this were going on, its effect would be most noticeable in probes that would otherwise be very dark, and appears to be present (Figure
4) in our data, where a substantial fraction of probes that are predicted to have very low intensities appear to have much higher than expected intensity.
Our approach to modeling probe synthesis failure also has some limitations. First, there is the possibility that the concentration parameters are confounded with the synthesis efficiency parameters. This can be a problem when dealing with G/C or A/T rich PCR products. Second, the synthesis efficiency of a nucleotide can potentially be dependent on its position on the probe
For our calculations we assume that the target DNA has the reference sequence. This assumption is never completely valid because each individual almost surely has a unique sequence that may differ slightly or even significantly from the genomic reference sequence. The impact of this assumption on the correlations for the analyzed data depends on the type of genetic variation of the samples. When the target DNA only has SNPs and/or other one base pair changes, then the genetic variation is unlikely to have a large impact on the average correlation over the entire chip; however, if the target DNA has large CNVS and/or several CNVs then these genetic variants would be expected to have a significant effect on the average correlation for the chip.
Our model is most applicable to 25 bp long arrays that are designed to detect genetic variation. With this in mind, the obvious next step is to apply our model to SNP arrays with the goal being to better determined which genetic variants are present, by incorporating our model into a variant calling algorithm.