Microarray-based estimation of SNP allele-frequency in pooled DNA using the Langmuir kinetic model

Background High throughput genotyping of single nucleotide polymorphisms (SNPs) for genome-wide association requires technologies for generating millions of genotypes with relative ease but also at a reasonable cost and with high accuracy. In this work, we have developed a theoretical approach to estimate allele frequency in pooled DNA samples, based on the physical principles of DNA immobilization and hybridization on solid surface using the Langmuir kinetic model and quantitative analysis of the allelic signals. Results This method can successfully distinguish allele frequencies differing by 0.01 in the actual pool of clinical samples, and detect alleles with a frequency as low as 2%. The accuracy of measuring known allele frequencies is very high, with the strength of correlation between measured and actual frequencies having an r2 = 0.9992. These results demonstrated that this method could allow the accurate estimation of absolute allele frequencies in pooled samples of DNA in a feasible and inexpensive way. Conclusion We conclude that this novel strategy for quantitative analysis of the ratio of SNP allelic sequences in DNA pools is an inexpensive and feasible alternative for detecting polymorphic differences in candidate gene association studies and genome-wide linkage disequilibrium scans.


Background
Single nucleotide polymorphisms (SNPs) represent the most genetic variation in the human genome, and are thought to have a promising future in a wide range of human genetics applications such as pharmacogenomics, population evolution, functional genomics, forensic and identification of genes responsible for the susceptibility of complex diseases. It has been suggested that 30,000-500,000 SNPs are required for a whole-genome association study [1,2]. Accurate determination of allele frequencies of such a large number of SNPs in a large number of human samples is an unusual challenge in the whole genome association studies for genetic alterations of low relative risk [3]. It not only involves heavy workload, unusual amount of time and cost, but also a large amount of DNA of each sample. Because only very few markers are expected to show linkage and/or association in family data, a simple, highly efficient and cost-effective screening approach to identification of genetic markers showing linkage and/or association is highly desirable. Using pooled DNA samples may significantly facilitate meeting this goal since hundreds of DNA samples can be reduced to a single sample. Although pooling DNA samples may result in a loss of information of haplotype information, it is still appealing because of the tradeoff of the significant reduction in the amounts of effort and cost.
A number of approaches used for SNP genotyping have been used to estimate allele frequency in pooled DNA samples. These include primer extension followed by DHPLC [4], allele-specific amplification with real-time PCR [5], BAMPER [6], TAQMAN™ and RFLP analysis [7], dynamic allele-specific hybridization (DASH) [8], Mas-sARRAY™ [9], mass spectrometry [10], pyrosequencing [11,12], SSCP [13], the amplification refractory mutation system (ARMS) [14], and DNA microarrays [15,16]. However, most of these methods are based on substantial post-PCR processing, and for one or very few SNPs for one pooling sample as a time. In this report, we describe a new microarray-based method for estimating the allele frequency in pooled DNA samples based on the physical principles of DNAs immobilization and hybridization on solid surface. This method well suits large-scale genetic association study, and has a number of advantages: capability of scaling up both in the numbers of SNPs and pooled samples (cases and controls) by utilizing microarray platform, assay of thousands of SNPs on one chip under uniform conditions, employing only two universal fluorescently labeled tags for thousands of SNPs, and no post-PCR processing.
The physical principle of hybridization on chip surface, modeled as a Langmuir adsorption process, has been extensively studied in recent years. However, most studies concentrate on the kinetics and thermodynamics of hybridization for gene expression assay [17,18] and genotyping [19]. To our knowledge, the present study is the first application of the Langmuir function to SNP allelefrequencies estimation using pooled DNA and microarray.
Six SNP markers, two, ESR1E-U11 (T/C, rs11155816) and ESR1F-U21 (A/G, rs9340799) in the ESR1 gene, one, TGFB1D-U2 (G/C, rs1800471) in the TGFB1 gene, and three, HBB17 (A/T, c.102 A > T), HBB28 (T/C, rs33931746) and HBB26 (C/T, rs33950507) in the HBB gene, were employed to demonstrate the feasibility of this approach. The estrogen receptor encoded by ESR1, is a ligand-activated transcription factor composed of several domain important for hormone binding, DNA binding and activation of transcription. It has widely been demonstrated that ESR1 polymorphism is associated with breast cancer and bone mineral density. TGFB is a multifunctional peptide that controls proliferation, differentiation, and other functions in many cell types. It is well known that TGFB plays an important role in human diseases. Mutations in the TGF-beta pathway are responsible for many biological processes in cancer development. The increased expression and a polymorphism of TGFB1 have also been associated with abdominal obesity in humans. The inherited blood disorder β-thalassemia is caused by mutations in the HBB gene, which markedly decreases or completely prevents the production of β-globin chains. It is the most common inherited single-gene disorders in the world with the highest prevalence in many areas of southern China including Taiwan, and has been and/or remains endemic.

Principle and Design of the Method
Six SNP markers, two, ESR1E-U11 (T/C, rs11155816) and ESR1F-U21 (A/G, rs9340799) in the ESR1 gene, one, TGFB1D-U2 (G/C, rs1800471) in the TGFB1 gene, and three, HBB17 (A/T, c.102 A > T), HBB28 (T/C, rs33931746), and HBB26 (C/T, rs33950507) in the HBB gene, were included to demonstrate the feasibility of our new approach. When the PCR product from a pooled DNA sample containing the two alleles at a given ratio for an SNP is spotted onto the glass slide surface, competition between the two allelic sequences may occur during surface immobilization (Figure 1) because the spot has limited binding capacity determined by the active groups and steric hindrance. As a result, the amounts of the two allelic sequences bound to the spot should be proportional to the initial amounts in the PCR product. When two fluorescently labeled allele-specific probes hybridize to their corresponding allelic sequences in the spot, the intensities of the two fluorescent colors in the spot should reflect the relative amounts of allelic sequences in the pooled DNA sample.
Two allele-specific probes were designed for each SNP. Each probe was composed of a specific domain (~15 bp) to hybridize with its allelic sequences at their 5'-ends, and a 3'-segment consisting of one of two 20-bp universal sequences to hybridize with a fluorescence-labeled universal tag. The two universal tags (Cy5-TTACGTGATGG-TAATAGTGCTG and Cy3-TTAGGTCGTAGGTGCGTTAG AT) labeled with different fluorescent dyes, were designed to perfectly match their corresponding universal sequences of the SNP probes to form a "sandwich" architecture with the PCR product. The two universal tags were used for all SNPs under analysis. The sequences of the probes and the universal tags were carefully selected to avoid formation of tag-to-tag hybridization and hair-pin structures, which may interfere with the hybridization between tags and their probes, and between probes and their allelic sequences [20]. The strategy of tag-probe-alleles sandwich complex is compatible with high-throughput analysis, flexible in experimental design, and cost effective.
Based on the analysis of the thermodynamics of immobilization and hybridization process on the solid surface, we Schematic illustration of the competitive immobilization and hybridization processes on glass surface using the strategy of tag-probe-allele sandwich structure Figure 1 Schematic illustration of the competitive immobilization and hybridization processes on glass surface using the strategy of tag-probe-allele sandwich structure. The single-base variation is indicted by letters. Fluorescent dyes are represented by small red or green spheres.
constructed the Langmuir-type isotherm model to integrate the two processes. The Langmuir-type parameters of the model were obtained from a series of allele frequency experiment. Several theoretical considerations and experimental constraints were imposed as follows: (a) the immobilization process was viewed as a reversible competitive bimolecular surface reaction (adsorption and desorption) between DNA molecules and the glass surface; (b) the immobilization was allowed to reach equilibrium under the defined experimental conditions; and (c) all PCR products were diluted to the same concentration before printed onto the slides. Under these conditions, we can deduce the equations to characterize the relationship between the concentrations and of the allele sequences, A and B, immobilized to the glass surface and the initial concentrations of these alleles in the spotted solution as follows [21,22], where the superscripted "i" represents the surface immo- Generally, the hybridization kinetics on the solid surface is represented by the familiar relationship, It is well known that the simplest model for DNA hybridization on a chip is the Langmuir kinetic model for adsorption [23,24]. Langmuir adsorption theory for microarray is based on the assumption that there are two competing processes: the adsorption process, which is binding the probe molecules to the immobilized DNA molecules to form duplexes, and desorption, which is the reverse process of duplexes dissociating into separate molecules. Thus the form of the equation results in a non-linear relation between the concentrations and the signal intensities. In our method, some theoretical considerations and experimental constraints are imposed to fit the conditions required by the Langmuir regime: (a) the labeled tag-probe is in large excess compared to allelic sequences immobilized on the surface; (b) the tag-probe concentration near the slide surface is assumed to be constant and equal to the bulk concentration during hybridization process;.(c) the hybridization reaction can achieve equilibrium.
If we neglect some effects such as that secondary structures and cross-hybridization, and there is only specific binding between a given probe and its complimentary allelic strand, and between labeled tag and its corresponding probe, we can deduce that the amount of labeled tag-probe bound to the complementary immobilized allelic sequence is proportional to the fluorescence intensity. Then the two fluorescence intensities on the same spot could be translated into the relative amounts of the two alleles in the pooled DNA sample. By recasting the result to yield the background-corrected intensity, I Combining with the Eqs. (5) and (6), we can rewrite the Eqs. (10) and (11)  Assuming that the allele concentration ratio (c A /c B ) remains constant in the spotting solution and the initial pooled DNA sample, which means the alleles are equally amplified. By transforming and combining Eqs. (12) and (13), we can determine the pool allele frequency by the following equations, ( 1 5 ) where f A and f B represent the alleles A and B frequencies in the pooled DNA sample.

Characterization of the kinetics of microarray
In order to determine the values of K a (K A or K B ) and I R , a series of pooled reference DNA samples with different ratios between the two allelic sequences of each of the six SNPs at different proportions for the six SNPs (ESR1E- HBB28(T/C), and HBB26(C/T)) were prepared from two DNA samples homozygous for the two alleles (see Method). For each SNP, fourteen pooled reference DNA samples were prepared with allele ratios ranging from 0% to 100%. The samples were amplified by PCR, purified, brought to a final total concentration of 50 μM, and printed onto a glass slide. Each sample was spotted five times for evaluating spot-to-spot variation. The fabricated array was then incubated in a hybridization solution containing Cy3/Cy5 labeled tag-probe duplexes , ,

c A c A c B I A I R B I B K B I A I R B I B K B I B I R A I A K A
Average background-corrected signal intensity as a function of allele spotting concentration for the six SNPs which were in large excess over the number of immobilized allelic molecules on the array. The plots of the allele concentrations in spotting solution versus average background-corrected fluorescence intensities for the six SNPs were analyzed using the Langmuir-type model (Eqs. 12 and 13). Results are shown in Figure 2.
The values of constants, K a and I R , for the six SNPs derived from the best-fit Langmuir isotherms are presented in Table 1. I R is the maximum fluorescence intensity, representing the saturated hybridization signal for the immobilized allele. In general, I R is determined by hybridization conditions, immobilization conditions, and surface chemistry of the slide, but independent from the spotting concentration if the amount of allelic sequences in the spotting solution is in a great excess over the absorbance capacity of the slide surface. In Eqs. (12) and (13), K a is the observed apparent equilibrium constant of reaction between the probes in hybridization solution and the allele DNA in spotting solution. It is determined by immobilization affinity (K a (i) ), hybridization affinity (K a (h) ), surface chemistry (R, surface concentration of active group), and the total spotting concentration (c R ). As shown in Table 1

Allele frequency estimation for pooled DNA samples from synthetic DNAs and plasmids
Since experimental factors such as immobilization condition, hybridization, and surface chemistry may affect the values of K a and I R , it is better to include the reference samples with known allele frequencies as controls in each subarray. With the values of K a and I R derived from the experimental data, it is possible to estimate the allele frequencies in the pooled samples. This was demonstrated by using pooled DNA samples with different ratios (frequency from 5% to 100%) between the two allelic sequences, prepared by mixing two DNA samples homologous from either allele from synthetic or plasmid DNA stocks. The allele frequencies of all samples were calculated using Eqs. (14) and (15). As summarized in Table 2, the mean standard deviations (SD) between the five replicates for the six SNPs were from 0.23 to 0.26 with a maximal range of 0.011 to 0.041 for each SD. These results confirm the reliability of our method.

Allele frequency estimation of pooled genomic DNA samples
We also investigated the validity of our method for pooled genomic DNA samples from clinical specimens using the same protocols described above for the synthetic and plasmid DNA stocks. Ten genomic DNA samples with different ratios between allelic sequences were prepared by pooling 100 individual genomic DNA samples of known genotypes for SNP HBB28(T/C). The fraction of the minor allele "C" ranged from 2% to 10%, with 1% increments. Twenty replicas of each pooled sample were measured to test the repeatability of the method. The variations are expressed as ± SD and ± standard error of the mean (SEM). Results are listed in Table 3. As shown, SDs ranged from 0.005 to 0.010, and SEM ranged from 0.0044 to 0.0103, indicating that our method is highly reproducible.

Robustness of the method
In an association studies, it is critical to learn which markers with their allele frequencies significantly different among populations. The simplest strategy is to compare between results from the pooled samples of all cases and those from pooled samples of all controls. In more complex design, creating sets of sub-pools allows stratification, not only on the basis of the disease trait but also on secondary and tertiary traits as well. In these cases, it is very important to detect minor differences in allele frequencies between pools or sub-pools. To evaluate sensitivity of our method, the significance level of the differences in the allele frequencies between the two pooled samples with the closest allele frequencies (Table  Table 4. As shown, samples with a difference in their allele frequencies as small as 1% could be discriminated (P < 0.0078) if the sampling error is negligible. Therefore, our method provides a very powerful tool for association studies using pooled samples. Figure 3 is a scatter plot that summarizes all allele frequency estimation data ( Table 2 and Table 3) for the six markers. The estimated pooled allele frequencies are in good agreement with the results of individual genotyping. For the genomic DNA samples (Table 3), the estimates and known values of allele frequencies are highly correlated (r 2 = 0.9917). For all assays in the present study, the known and the measured allele frequencies were highly correlated with a correlation coefficient of 0.9992 (P < 0.01). The mean value of the differences between known and measured frequencies was 0.007. These results indicate that our method is highly accurate and reproducible.
According to the data shown in Table 3, it is possible to apply our method to determining allele frequencies ≥ 2%.
To obtain a high degree of accuracy and sensitivity, several factors need to be taken into consideration. These include the number of samples to pool, the volumes of DNA to transfer by pipetting, microarray preparation and hybridization. For this reason, the reproducibility for the quantification of this method was evaluated. We investigated the sampling and measurement errors [5] using samples with allele frequencies of 0.05 and 0.10 for HBB28(C). Each sample was repeated 20 times which were subdivided into four groups, five each. Then we calculated the measurement error using the standard error of the mean s m = S.D of measurements no. of measurements , where the number of measurements was five, the measurement error is independent of sample size. The expected sampling error can be expressed as (f = allele frequency and n = sample size) [25], and the equation of is for the estimated combined sampling and measurement error. Results for the two allele frequencies are shown in Figure 4. With an actual allele frequency of 0.05, the measurement error was ± 0.0077 (Panel (a), Figure 4). For an actual allele frequency of 0.10, the measurement error was ± 0.0103 (Panel (b), Figure 4). Results in Figure 1 show that when the sample size is smaller than 400, the sampling error is greater than the measurement error at allele frequency of 0.05; when the sample size is smaller than 436, the sampling error is   Scatter plot demonstrating the accuracy of allele frequency estimation using pooled DNA samples Figure 3 Scatter plot demonstrating the accuracy of allele frequency estimation using pooled DNA samples. The chart is drawn based on the results for allele frequency listed in Table 2 and Table 3 versus the known frequencies prior to PCR amplification for the six SNPs. Short horizontal bars are error bars. The diagonal line shows complete concordance between known and observed allelic fractions.
Comparison of expected sampling errors and experimental errors of the frequencies for the two alleles of SNPs HBB28 Figure 4 Comparison of expected sampling errors and experimental errors of the frequencies for the two alleles of SNP HBB28. The chart was drawn based on the results in Table 3. (a) The solid line is the expected sampling error for SNP site (HBB28) for the allele frequency of 0.05 for sample size up to 1000. The upper broken line is the estimated combine sampling and measurement error for this method based on Table 3 (see text). The lower broken line is measurement error. (b) The same as (a) for the allele frequency of 0.10 described in Table 3.
greater than the measurement error at allele frequency of 0.10. Thus the measurement error will be dominant in allele frequency estimation for a large sample size such as n > 500. Generally, the measurement error would be smaller than the statistical error of sampling at a large sample size. To obtain a reliable allele frequency, the sample size ≥ 500 is necessary for an allele frequencies greater than 0.10.

Discussion
Identification of genetic loci associated with genes responsible for susceptibility to complex human diseases with a clinically available sample size is still a major challenge for whole genome association study. In addition to including a large number of SNPs, the chance of detecting significant association also requires a very large number of samples owing to the low phenotypic effect of the genes involved in multifactorial diseases. Although, high-throughput genotyping techniques are readily available for handling large sample size and a large number of genetic markers, the cost is still very high. Therefore, methods for estimation of SNP allele frequencies in such studies should be amenable to scaling-up both in the number of loci and in the number of samples. DNA pooling is a well established method, which can vastly reduce the amounts of effort, labor and cost involved in large-scale association studies [26]. Pooling allows one to estimate the allele frequency among large numbers of individuals by examining a single or much fewer samples, reducing the workload from hundreds of samples to one or very a few. Instead of genotyping the large numbers of SNPs in individual samples on Affymetrix SNP chip for genome-wide association scans, a large number of papers have addressed pooling the DNA from large numbers of individuals [27][28][29][30].
In the present study, we report a new microarray-based strategy to estimate the allele frequencies of pooled samples. The method is highly sensitive and can be used to analyze a large number of markers and multi-pools simultaneously. Association studies with multifactorial subdivision strategy provide a powerful tool for the study of complex diseases and quantitative traits, influenced by disease heterogeneity, gene-gene and gene-environment interactions. Our method meets the need for the association studies using pooling strategy more elaborate than the two-pool (case vs. control) design. The populations of cases and controls can be stratified on the basis of secondary and tertiary traits to construct a series of sub-pools. In addition to some traits, factors such as age at onset, sex, lifestyle, or other clinical descriptions can also be used for categorization. This might capture effects of environmental factors that are known to affect the disease trait in question. The association studies with multifactorial subdivision strategy provide a powerful tool for the study of complex common disease and quantitative traits, influenced by the effects of disease heterogeneity, gene-gene and gene-environment interactions.
Genotyping pooled samples based on subdivision using microarray significantly reduces workload and cost with similar statistical power compared with genotyping individual samples. Assuming 1000 case and 1000 control samples with 100 candidate SNPs for a disease are to be studied and 100 samples in each pool, using our method, the 2000 PCR products can be analyzed by a single microarray, and a single hybridization reaction. Its statistical power is equivalent to 2 × 10 5 individual genotyping (100 SNPs × 1000 cases + 100 SNPs × 1000 controls), which is over a 100-fold reduction. In addition, we introduce the strategy of employing two universal florescence-labeled (Cy3 and Cy5) tags to form the sandwich structure with SNP detection probes and allelic sequences. Regardless the number of SNPs that need to be analyzed, only two universal florescence-labeled tags are needed. Therefore, it vastly reduces the experiment cost.
The fluorescent intensities of the two colors on a microarray spot are not directly proportional to the allele frequencies in pooled DNA samples. Differences in various aspects of oligonucleotide hybridization make it difficult to estimate allele frequencies based on the fluorescence signals. We have tackled this problem by theoretically deducing a Langmuir-type formula for each allele. This Langmuir-type isotherm model integrates the thermodynamics of immobilization and hybridization processes on microarray surface, and describes the relationship between the fluorescence intensity and allele concentration in the spotting solution. Biased amplification of different alleles may occur when the size range of microsatellite alleles is sufficiently great. However, no significant biased amplification with SNP alleles (only onebase difference which is not size difference) was observed in our experiments. Therefore, allele frequencies in the spotting solution can be directly considered as allele frequencies in pooled DNA.
Our method has advantages over other published protocols in its high sensitivity and capability to detect minor difference between allele frequencies. Compared to determination of allele frequencies in individual pools, it is more important to learn whether the allele frequencies among pools and sub-pools are significantly different from each other. The experimental results demonstrate that our method can successfully distinguish allele frequencies differing by 0.01 in the actual pool of clinical samples (P < 0.0078). Our results also demonstrate that alleles with a frequency as low as 2% can be detected. Our approach exemplifies the reproducibility of measurement with the mean divergence between individual and pooled allele frequencies of 0.7%, ranging from 0.05% to 2.2%. The observed SEMs varied from 0.0044 to 0.0103.

Conclusion
In conclusion, we have developed an accurate, robust and high-throughput method to estimate the allele frequency in a large number of samples by pooled DNA samples followed by PCR amplification and microarray analysis. The dynamics of the immobilization and hybridization of the PCR products on the solid surface was studied. The kinetics of these two processes was integrated to estimate the allele frequency in pooled DNA by establishing a Langmuir-type kinetic model. Our approach is inexpensive, efficient and capable of detecting interesting polymorphic differences in candidate gene association studies and genome-wide linkage disequilibrium scans.
cassette (Telechem International). After hybridization, the slide was sequentially rinsed in 2 × SSC/0.1% SDS and in 2 × SSC at 37°C, 5 min for each step, and dried by centrifugation. The slides were then scanned using a Genepix 4000B scanner (Axon Instruments, Foster City, CA, USA) at a resolution of 5 μm with 100% excitation intensity with PMT set to 55% and 48% for Cy5 and Cy3 channels, respectively. Spot analysis and quantification of the original 16-bit TIF images were performed with the Genepix software (v 5.0).

Statistical analysis
Standard deviation (SD) was calculated for the signal intensities of the repeats of each pooled sample. The significance levels of the differences between the allele frequencies for the all comparison were analyzed by the Kolmogorov-Smirnov test. The P value was used as a measurement of the degree of significance between the measured and known allele frequencies. The statistical software Systat 12.0 (Systat Software Inc., San Jose, CA, USA) was used to perform the data analyses.