The rapid increase in the number of completely sequenced genomes in the past few years has generated much effort in functional genomics, particularly studies seeking to assign biological functions to DNA sequences. Comparative gene expression profiling is widely used to study the functional role of genes. The DNA microarray assay provides an invaluable technique for large scale expression analysis. In the two-channel DNA microarray assay, RNA from two samples is reverse transcribed to cDNA and labeled with two distinct fluorescent dyes before being co-hybridized to immobilized DNA strands on a microarray slide. Spotted arrays currently being used can be divided into two groups based on the nature of immobilized DNA used: one in which the immobilized DNA is comprised of both sense and antisense strands (usually PCR product) and the other where the immobilized DNA is single stranded consisting of only the sense strands (usually, chemically synthesized oligonucleotides). During hybridization, the two fluorescently labeled cDNA samples compete for hybridization to the immobilized strands. Hybridization reactions between complementary strands occur only between the labeled antisense strand and immobilized sense strand. The ratio of the intensities of the two fluorescently labeled cDNAs is used to quantify the relative levels of transcripts in the two samples [1, 2]. This method serves well for pair-wise comparison of transcript levels in two samples. With over ten thousand different DNA species immobilized on the microarray, the relative transcription level of all the corresponding genes in the two samples can be obtained in a single assay.
DNA microarrays have found applications in gene discovery, disease diagnosis, pharmacogenomics and toxicology research. They are increasingly used for a series of related samples, for which a comparison across all samples and all genes is desirable. When a large number of samples are to be compared, a combinatorial approach pairing all possible pairs (or at least a number of combinations of pairings of the sample) is often taken. This results in a large number of microarrays, requiring a large amount of each RNA sample. A 'loop design', where every sample is directly compared to two other samples to form a closed loop, has been proposed to overcome this problem [3, 4]. The ratios calculated using a loop design have variable levels of precision since some samples are more directly related than others [5]. When a new sample is to be inserted into the earlier 'loop', RNA for at least two of the previous samples is needed to pair with the new sample to form a new node in the closed loop.
Another approach to tackle the issue of a large combinatorial pair-wise comparison is the 'reference design' [3] in which a common reference sample is introduced with which all RNA samples are hybridized. Two possible universal references are RNA pooled from various samples and genomic DNA [6]. For a given set of samples, pooled RNA provides an excellent reference. However, if the experimental conditions change, the possibility arises that some new transcripts may not be represented in the initially-pooled RNA.
Genomic DNA is an attractive candidate for use as a common reference. It is isolated from cells or tissue and sheared to fragments in a narrow range of length. It is easier to prepare, maintain and reproduce, as compared to RNA. It is especially useful for microorganisms, which lack repetitive sequences in their genome, and microarrays using genomic DNA as a reference have been used to identify genes differentially expressed in various growth stages of Mycobacterium tuberculosis [7]. It has also been recently reported that the data obtained using genomic DNA as a reference in microarray experiments with Arabidopsis thaliana employing 70-mer oligonucleotide microarrays was in agreement with ratios obtained from direct hybridizations [8]. Genomic DNA samples, isolated from stationary phase cultures where the chromosome is not being replicated, have the same representation of all genes as in the genome. Since the transcript level of each gene is being referenced to its own representation on the genome (for most genes, it is a single copy), the relative expression can be compared across different genes in the sample (i.e., from the same hybridization) as well as across different samples. The use of microarrays using genomic DNA in a range of applications including genomic diversity studies [9, 10] and aneuploidy detection using comparative genomic hybridization [11] has been demonstrated.
In the conventional two-channel cDNA hybridization [1], both the cDNA samples are antisense strands. The probability of hybridization between strands in solution is very low. On the other hand, using sheared genomic DNA as a reference, hybridization between complementary sense and anti-sense strands can occur in solution between the complementary genomic DNA strands, and between genomic DNA sense strands and their cDNA counterparts. The number of strands lost to hybridization in the solution phase may differ for different RNA species as well as for the two complementary strands of the same species. This may result in decreased fidelity in the ratio of cDNA to genomic DNA as a representation of gene expression level. With the complexity of hybridization in both the solution phase and the immobilized surface phase, and between double strands of genomic DNA and single strands of cDNA, it is difficult to assess the effect of using genomic DNA as a common reference. Adapting a mathematical model we developed previously to assess diffusional constraints on DNA microarray assay [12], we have constructed a kinetic model for microarray hybridization and predicted the effectiveness as well as potential pitfalls of using genomic DNA as a reference. We examined the effect of the various parameters that may affect surface hybridization. Here we report the framework of the model and our findings.
Mathematical model for microarray hybridization
A DNA microarray consists of thousands of spots, each spot containing DNA strands of known sequence immobilized on an impermeable surface. We simulate hybridization reactions for one spot on the microarray, which is considered to be a sample chamber with two compartments; a spot phase and a solution phase (Figure 1). The spot phase is the small volume in which the bound strands are assumed to be present at a uniform concentration. The solution phase, which comprises a vast majority of the volume of the microarray chamber, contains only fluorescently labeled single strands at the beginning of the hybridization and double strands formed by hybridization of complementary mobile single strands during the course of the hybridization.
Five kinds of single-stranded DNA molecules are present in the system: labeled single stranded cDNA (S ample) reverse transcribed from RNA, denoted as S; the two identically labeled complementary (anti-sense and sense) strands from genomic DNA (R eference), denoted as R and R'; and the non-labeled anti-sense and sense strands immobilized on the array surface (B ound strands) denoted by B and B'. The nomenclature used in this study is summarised in Table 1. The. The double-stranded species are denoted by combining the constituent single strand symbols, for example RB' denotes the labeled complex formed by hybridization of genomic DNA antisense strands to the sense bound strands. We assume that each phase is well-mixed, and hence all mobile species are present at uniform concentrations within each phase.
We consider the case that PCR products are used for immobilization; thus, both sense and antisense strands of probe DNA are immobilized. We assume that the two complementary bound strands do not hybridize to each other [13] due to the immobilization procedure used and neglect the formation of BB'. The mobile species in the two compartments are considered to move across the phase boundary at a rate proportional to the difference in the concentrations of identical species present in the two phases. The proportionality constant is the effective mass transfer coefficient for the transport of mobile DNA strands and is estimated from the diffusivity as discussed below.
The model equations take the form of a mass balance on each component in each phase that accounts for the change in concentration due to reactions within that phase and transport between the two phases. Non-specific hybridization in both phases is neglected. All these equations are of the form:
where
= R',
= B', A1 = R, A2 = B, A3 = S, subscript p denotes either solution or spot phase and p' is the other phase; kf is the forward rate of hybridization; kb is the backward rate of hybridization; kt is the rate of transport across the spot-phase solution-phase boundary; Vt is the total volume of the sample chamber and Vp is the volume of phase p.
Model parameters
Model geometry
The diameter of the spot phase is set to 0.01 cm as seen in a typical microarray spotted on poly-lysine slides [14], with a height of 2 × 10-4 cm (Figure 1). This gives a spot volume of 0.2 × 10-12 l. Typically, labeled strands are resuspended in a volume of 4.5 × 10-5 l, and this volume is applied under a cover slip of area 6.5 cm × 2.4 cm. For this geometry, the height of the microarray chamber is about 2.9 × 10-3 cm. The solution phase volume is the difference between the sample chamber volume (4.5 × 10-5 l) and the spot volume (0.2 × 10-12 l). It has been shown before that only the mobile strands within a radius of 0.15 cm from the spot are available for hybridization due to transport effects [12]. Hence, we have assumed the solution phase to have a diameter of 0.15 cm and a height of 2.9 × 10-3 cm corresponding to the height of the microarray chamber.
Hybridization rate constants
The hybridization rate constants are assumed to be identical for DNA strands from both genomic DNA and cDNA. However, the rate constants for reaction between two mobile strands may differ from the rate constant for the reaction of a mobile strand and a bound strand. A forward rate constant (kf) for DNA hybridizations in solution of 106 M-1 s-1 [15, 16] was used for simulations. The rate constant of hybridization could be slower for hybridization of mobile strands to immobilized strands (kf-bound) on a solid surface and values in a range up to 100-fold lower than kf have been used in simulations. We also discuss the effect of this reduced forward rate constant of hybridization between mobile and bound strands for a range between 106 M-1s-1 and 104 M-1s-1. The backward rate constant for dissociation of DNA double strands as calculated from equilibrium constants reported in literature ranges from 10-3 s-1 to 10-1 s-1 [17, 18]. Simulations were carried out using a backward rate constant ranging from 10-1 s-1 to 0 s-1 (irreversible). In all simulations performed, the deviation from the true value was greatest when the hybridization reaction is set to be irreversible. The results discussed below are for the case with the backward rate constant set to 0 s-1. The results obtained when the reaction is set to be reversible are very close to the true value under all conditions tested.
Rate constant for transport
The rate constant for transport (kt) is estimated from the diffusion coefficient as
, where D is the diffusion coefficient, h is the height of the spot phase and r is the radius of the spot phase. The diffusivity of DNA single strands in solution has been estimated to be 10-7 cm2/s [19–21] which leads to an estimate of 1 s-1 for kt. The simulations reported in the next section were carried out under both very fast transport (kt = 10-3 s-1) and very slow transport (kt = 10-3 s-1) conditions. The transport rate has no effect on the performance of the assay in the range tested (data not shown).
Initial single-strand concentrations
From the yeast transcriptome data published by Velculescu et al. [22], the mass percentages of mRNA that belong to the rare, intermediate and abundant classes are estimated to be 65.2, 32.4 and 2.4% respectively. For an mRNA sample of 0.2 μg used in each microarray assay, the total amount of rare, intermediate and abundant species are 0.13 μg, 0.07 μg and 4.8 × 10-3 μg respectively. To convert those numbers to molar concentration we calculate the number of genes within each abundance class using the intensity data from microarray experiments of S. coelicolor. The S. coelicolor transcripts were divided into three abundance classes using the following intensity cut-offs: Intensity < 2000 as rare, 2000 < Intensity < 20000 as intermediate and >20000 as abundant sequences. Normalized intensity values were used for this estimate. 72.8%, 26% and 1.2% of all genes were classified rare, intermediate and abundant respectively. The corresponding number of rare, intermediate and abundant species are 5697, 2034 and 94 respectively. Assuming a sample volume of 4.5 × 10-5 l, 100% reverse transcription efficiency, and an average strand length of 1000 we calculate the rare species concentration as ~0.1 pM, intermediate species concentration ~1 pM and abundant species concentration ~20 pM.
Previous experimental reports of the use of genomic DNA as a reference for microarray hybridizations used genomic DNA concentrations ranging from 0.1 μg to 4 μg for M. tuberculosis [7]. Since M. tuberculosis has ~4000 genes, this translates to a concentration of each gene from ~1 pM to 34 pM, which is the range of genomic DNA concentrations used for simulations.
To calculate the concentration of the immobilized species, we assume the concentration of DNA in the spotting solution to be 0.1 g/l and that 10-9 l of the solution is spotted on the microarray. We also assume that 75% of the DNA thus spotted is washed away in the microarray post-processing steps. 2.5 × 10-11 g DNA remaining on the microarray is uniformly distributed in the 0.2 × 10-10 l spot phase volume. Assuming an average DNA strand length of 1000 bp, this is approximately equal to an immobilized strand concentration in the spot phase of 10-6 M.