Allelotyping of pooled DNA with 250 K SNP microarrays

Background Genotyping technologies for whole genome association studies are now available. To perform such studies to an affordable price, pooled DNA can be used. Recent studies have shown that GeneChip Human Mapping 10 K and 50 K arrays are suitable for the estimation of the allele frequency in pooled DNA. In the present study, we tested the accuracy of the 250 K Nsp array, which is part of the 500 K array set representing 500,568 SNPs. Furthermore, we compared different algorithms to estimate allele frequencies of pooled DNA. Results We could confirm that the polynomial based probe specific correction (PPC) was the most accurate method for allele frequency estimation. However, a simple k-correction, using the relative allele signal (RAS) of heterozygous individuals, performed only slightly worse and provided results for more SNPs. Using four replicates of the 250 K array and the k-correction using heterozygous RAS values, we obtained results for 104.141 SNPs. The correlation between estimated and real allele frequency was 0.983 and the average error was 0.046, which was comparable to the results obtained with the 10 K array. Furthermore, we could show how the estimation accuracy depended on the SNP type (average error for A/T SNPs: 0.043 and for G/C SNPs: 0.052). Conclusion The combination of DNA pooling and analysis of single nucleotide polymorphisms (SNPs) on high density microarrays is a promising tool for whole genome association studies.


nd new susc
ptibility loci for complex diseases on the human genome, a high number of case and control samples is required.An old approach with new perspective is the pooling of cases and controls.The larger the number of analyzed SNPs, the more striking are the advantages of a pooling study.With advanced microarray technology it is now possible to analyze SNPs throughout the whole genome.With the Human Mapping 500 K array set from Affymetrix and the BeadChips from Illumina, over 500,000 SNPs can be genotyped on two arrays.Different groups have tested the reliability of Affymetrix microarrays for pooling studies with either the 10 K array [1][2][3][4][5][6] or the 50 K array [7,8].On these arrays, each SNP is interro-gated by 40 probes (20 for the plus and 20 on the minus strand).On the 250 K arrays over 90% of the SNPs are represented by only 24 probes (some SNPs are only on the plus or the minus strand).This reduction of probes, as well as the reduction of the feature size from 18 μm (10 K), and 8 μm (50 K) to 5 μm (250 K) could have a negative influence on the outcome of pooling results.To examine if this is true, we tested the Nsp I 250 K array which represents 262.264SNPs and is part of the 500 K array set.According to the Data Sheet from Affymetrix, over 85% of the human genome is covered by SNPs within 10 kb distance with this array set.If allelotyping of pooled DNA is feasible with these arrays, whole genome association studies including thousands of samples could be performed within a few weeks in a cost-effective manner.


Results


K array

To asses

ement er
or in our lab, we estimated the allele frequency in a pool of 26 DNA samples previously genotyped in our lab with the 10 K array.We calculated the allele frequency with three methods (see Material and Methods).As reference data for the correction of unequal allele signals, we took either data generated in our lab ("our") or data from other labs ("web" or "brohede").From 10,561 SNPs on the 10 K array, the allele frequency of 3,574 SNPs could be estimated with all three methods.In Table 1, we show the mean and median error (absolute difference between known and estimated allele frequency), the correlation coefficient between known and estimated allele frequency, and the standard deviation (SD) between the four replicates.As expected, the estimates were better when using the reference data generated in our lab.The PPC method was the most accurate method with a mean error of 0.043.However, the kcorrection with heterozygous RAS values gave only slightly worse results with an error of 0.046.In comparison with other methods the PPC is the only algorithm that uses only perfect match data.To elucidate if the k-correction can be improved by utilizing just perfect match data, we set all cell intensity values in the original cell files to zero.Then we derived a perfect-match-RAS and reanalyzed the data using the k-correction with heterozygous references.The resulting estimates gave an average error of 0.108.Applying a second degree polynomial on these perfect-match-RAS values could reduce the error to 0.054.However, for "normal" RAS values the second degree polynomial did not improve the error.


K array

From the 262,264 SNPs on the Ns

250 K ar
ay, the rsnumbers of 195,158 SNPs could be identified from the HapMap CEPH Population (NCBI_Build35).We excluded 137 SNPs (3 on Chr. 1, 128 on Chr. 2, 6 on Chr.16) which had inconsistent genotype information in the two sources (e.g.rs1364648, Affymetrix annotation: A/G, minus-strand; HapMap data: C/G, plus-strand).From the remaining SNPs, 122,754 had a 100% call rate in the 88 HapMap samples.For the evaluation, 104,141 SNPs could be used because they had at least one "AB" genotype (required for k-correction) in the 56 reference samples genotyped in our lab.Table 2 shows the mean error, the correlation coefficient between known and estimated allele frequency, and the standard deviation between the pool replicates.We also specified how the accuracy depended on the number of pool replicates, the number of reference RAS values (with AB genotype), the minor allele frequency, and the SNP type.As expected, we found that the mean error decreased by the number of pool replicates.The mean error also decreased by the number of "AB" reference samples, and with an increasing minor allele frequency.To see if the error improves with higher allele frequencies only because of a higher number of "AB" references or vice versa, we adjusted both parameters and found the same trend.We could further show that the estimation of the allele frequency in A/T SNPs was significantly less accurate than in G/C SNPs (p < 0.001).The same trends were found for the 10 K array (results not shown).

For the reference samples, arrays with less than 93% all rate were excluded.For pooled DNA, however, the call rate normally is around 80%, because many SNP frequencies lie between homozygous and heterozygous frequencies.To prove if the call rate can be partially explained by the detection rate (MDR), we plotted the call rates against detection rates from 100 Nsp and 100 Sty arrays previously analyzed with individual DNA in our lab (Figure 1).According to the regression curve, a call rate of 93% corresponds to a detection rate of about 97.8%.One of our 250 K arrays (hybridized with pooled DNA) had a detection rate of 96.7%.It was therefore considered to be of bad quality and was excluded.This array also had a significantly poorer accuracy (error: 0.075).In the other four arrays (with MDR >99.2) a high MDR also correlated with a low error (see Figure 2).


Discussion

With our data from the 10 K array, we could conf

m that from
the three tested methods, the PPC algorithm [1] gave the best estimates.Compared to other methods, this algorithm (a) utilizes the signal intensities from individual probes (not RAS values); (b) it takes only data from the perfect matches; (c) it applies a second degree polynomial for correction of unequal hybridization; and (d) it uses reference information from all three genotypes (AA, AB, BB).Our results suggest that neither of these parameters alone is responsible for the good performance of the PPC algorithm but the combination of all.However, the need for all three genotypes in the reference samples limits the The errors are based on estimates from 3574 SNPs which could be analyzed by all methods.*Data used for normalization: "our" = 34 individuals analyzed in our lab, "Brohede" = 26 individuals analyzed in the lab of Brohede et al. [1], "web" >3000 individuals analyzed in the lab of Caig et al. [9], files are available under [15].where k is calculated from pooled data of all perfectly matching probes.To avoid the use of reference data in a case-control study with pooled samples, it is also possible to directly compare the signal intensities of the perfectly matching probes between cases and controls as shown by Macgregor et al. [7].In this study, the use of a correction for unequal hybridization signals had only little effect upon the results.However, also slight improvements can be important for the finding of low susceptibility genes in pooling studie .

Despite the reduction of the feature number and feature size, the absolute error between real and estimated allele frequency with the 250 K array was as low as the one for the 10 K array when using Simpson's k-correction.The correlation between real and estimated allele frequency was even higher with the 250 K array, and the standard deviation was lower.However, our results from the 10 K and the 250 K array are not directly comparable, because (a) pools were constructed from different DNA samples, (b) the experimental protocol was different, (c) different scanners were used for both chips, and (d) the software used for data extraction was different.
As shown in Table 2, the accuracy of the allele frequency estimation improved with the number of pool replicates.The absolute error between three and four replicates only decreased by 0.001.Therefore, we assume that the addition of further technical replicates would not essentially improve the accuracy.In our study, we used pools of identical samples.However, for a case-control study, it might be of advantage to use pools of independent samples to capture the variance between the individuals.In this case, an increase of replicates can improve the accuracy.With increasing number of "AB" references, the error decreased to 0.024 when 35 references were present.In our study, the mean error was smaller when the minor allele frequency was higher.This was also true for the 10 K results using the PPC algorithm, which is in contrast to the results published by Brohede et al. [1], where the best estimates were obtained at minor allele frequencies <0.1.Interestingly, the accuracy of A/T SNPs was found to be significantly worse than the accuracy of G/C SNPs on the 250 K array.This is probably due to the higher affinity of the G-C hydrogen bound compared to the A-T bound.For the stability of the entire hybridization complex, an unspecific hybridization with "A" or "T" is relatively less important than with "G" or "C".Here we analyzed only one of the two 250 K arrays from the 500 K set.The only difference between the two arrays is the cleavage side in the first fragmentation step.Therefore, we assume that both arrays, Nsp and Sty, perform equally well.

Pooling of s mples has several disadvantages compared to a case-control study analyzing individual genotypes: (a) Associations which do not result in a significant change of the allele frequency can be overlooked; (b) Measurement errors can lead to false results; (c) Stratification of the population by age, sex, disease subtype, etc. has to be done before the analysis; (d) Haplotype analysis is only possible under certain conditions [10,11]; and (e) Analysis of gene-gene interactions can not be performed.However, with advancing technologies and algorithms, the mean measurement er or can probably be reduced to values < 0.03 [1,4].The use of linkage information should improve the likelihood of finding "real" associations and detect false positive SNPs.Taking the HapMap information (Build 35) for the 10 K array, we found ~30% of the SNPs to be linked to its down tream SNP (LOD >3); with the 500 K array set it was ~50%.With this high linkage, the allele frequency of one SNP can be partly explained by the allele frequency of a linked SNP.To take advantage of this fact, two recent publications propose to use p-value combinations in a sliding-window concept [9,12].With increasing number of analyzed SNPs and better linkage information most haplotypes can be explained by individual SNPs [13].


Conclusion

We think that DNA pooling might be a useful and affordable tool to detecting new candidate genes for genetic diseases, especially at a whole genome level.However, this has to be proven in future association studies with pooled DNA.


Methods


DNA pooling and microarray analysis

The determ

ation of th
DNA concentration in the individual DNA samples was done with PicoGreen reagent (Molecular Probes) using a standard curve of λ-DNA.

From each sample, 50 ng genomic DNA was taken for the pool construction.For the 10 K array, we pool

mples that were individually genotyp
d before with the 10 K array.For the 250 K array we pooled 88 samples from the HapMap CEPH Population, whose genotype information is available at the HapMap omepage [14].From individual or pooled samples 250 ng DNA was analyzed on the GeneChip Human Mapping 10 K Xba 131 array or the 250 K Nsp array (Affymetrix) according the manufacturers protocols.Four replicates of the same DNA pool from the 10 K and the 250 K array were processed and hybridized on four different days, respectively.Imaging of the microarrays was performed using either the GCS3000 scanner (10 K array) or the upgraded GCS3000-G7 scanner (250 K array) from Affymetrix.Genotype calls and probe intensity data were extracted with the GDAS software using default parameters (10 K array) or the GTYPE software from Affymetrix setting the call threshold for homozygous and heterozygous calls to 0.26 (250 K arrays).For individual DNA, only arrays with a call rate >93% (as guarantied by Affymetrix) were included in the study.For pooled DNA, only arrays with a detection rate (MDR) >97.8% (corresponding to call rate of >93%, see Results) were used for the allele frequency estimation.

One array had to be repeated because of its low MDR (96.7%).


Estimation of allele frequency with the 10 K array


Estimation of allele frequency with the 250 K array

For the 250 K arrays, the k-correction proposed by Simpson, et al. was used to estimate the allele frequencies [6].Heterozygous RAS values were taken from a set of 56 arrays (all with call rates >93%), which were previously analyzed with individual DNA in our lab.The average RAS Graph showing the correlation between detection rate (MDR) and the error (absolute difference between estimated and known allele frequency) Figure 2 Graph showing the correlation betwe

detection rate (MDR) and the error (absolute differ
nce between estimated and known allele frequency).Each cross stands for one 250 K array, all hybridized with the sam DNA pool.

Publish with Bio Med Central and every scientist can read your work free of charge values as well as the discrimination scores were calculated from the cell intensity data using the "R" script from Meaburn et al. [8] which is freely available [16].We excluded RAS values from the four pools which had discrimination scores < 0.04, as described by Meaburn et al [8].The discrimination score (DS snp is a measure of unspecific hybridization used in the 10 K MPAM mapping algorithm (see Affymetrix GeneChip DNA Analysis Softw