With our data from the 10 K array, we could confirm that from the three tested methods, the PPC algorithm [1] gave the best estimates. Compared to other methods, this algorithm (a) utilizes the signal intensities from individual probes (not RAS values); (b) it takes only data from the perfect matches; (c) it applies a second degree polynomial for correction of unequal hybridization; and (d) it uses reference information from all three genotypes (AA, AB, BB). Our results suggest that neither of these parameters alone is responsible for the good performance of the PPC algorithm but the combination of all. However, the need for all three genotypes in the reference samples limits the number of SNPs that can be estimated. Another disadvantage of this method is the time consuming computation in Perl and R. This made it impossible to use the algorithm for our 250 K data yet. For the Nsp 250 K array, we used the k-correction with heterozygous RAS values. This algorithm performed only slightly worse than the PPC algorithm. It was the simplest of the tested algorithms and it scored for more SNPs, because homozygous calls were not required. The algorithm proposed by Craig et al. [9], also uses RAS values and includes reference information of all three genotypes, which should improve the estimation. However, this method gave the worst estimates for our data set. The algorithm used by Kirov et al. with a reported average error of only 0.014 with 10 K arrays might improve the allelotyping accuracy for 250 K arrays. Instead of using heterozygous references, the correction coefficient k is derived from RAS values of a pool with known allele frequencies. This algorithm was not applied here, because it requires a second independent DNA pool with known allele frequencies. Future studies can use our k values (supplied as Additional Material) for allele frequency estimation on the 250 K Nsp arrays. However, results for SNPs with a very low/high frequency in the reference pool may not be reliable. Another approach could be the combination of the PPC algorithm and the algorithm from Kirov et al. where k is calculated from pooled data of all perfectly matching probes. To avoid the use of reference data in a case-control study with pooled samples, it is also possible to directly compare the signal intensities of the perfectly matching probes between cases and controls as shown by Macgregor et al. [7]. In this study, the use of a correction for unequal hybridization signals had only little effect upon the results. However, also slight improvements can be important for the finding of low susceptibility genes in pooling studies.
Despite the reduction of the feature number and feature size, the absolute error between real and estimated allele frequency with the 250 K array was as low as the one for the 10 K array when using Simpson's k-correction. The correlation between real and estimated allele frequency was even higher with the 250 K array, and the standard deviation was lower. However, our results from the 10 K and the 250 K array are not directly comparable, because (a) pools were constructed from different DNA samples, (b) the experimental protocol was different, (c) different scanners were used for both chips, and (d) the software used for data extraction was different.
As shown in Table 2, the accuracy of the allele frequency estimation improved with the number of pool replicates. The absolute error between three and four replicates only decreased by 0.001. Therefore, we assume that the addition of further technical replicates would not essentially improve the accuracy. In our study, we used pools of identical samples. However, for a case-control study, it might be of advantage to use pools of independent samples to capture the variance between the individuals. In this case, an increase of replicates can improve the accuracy. With increasing number of "AB" references, the error decreased to 0.024 when 35 references were present. In our study, the mean error was smaller when the minor allele frequency was higher. This was also true for the 10 K results using the PPC algorithm, which is in contrast to the results published by Brohede et al. [1], where the best estimates were obtained at minor allele frequencies <0.1. Interestingly, the accuracy of A/T SNPs was found to be significantly worse than the accuracy of G/C SNPs on the 250 K array. This is probably due to the higher affinity of the G-C hydrogen bound compared to the A-T bound. For the stability of the entire hybridization complex, an unspecific hybridization with "A" or "T" is relatively less important than with "G" or "C". Here we analyzed only one of the two 250 K arrays from the 500 K set. The only difference between the two arrays is the cleavage side in the first fragmentation step. Therefore, we assume that both arrays, Nsp and Sty, perform equally well.
Pooling of samples has several disadvantages compared to a case-control study analyzing individual genotypes: (a) Associations which do not result in a significant change of the allele frequency can be overlooked; (b) Measurement errors can lead to false results; (c) Stratification of the population by age, sex, disease subtype, etc. has to be done before the analysis; (d) Haplotype analysis is only possible under certain conditions [10, 11]; and (e) Analysis of gene-gene interactions can not be performed. However, with advancing technologies and algorithms, the mean measurement error can probably be reduced to values < 0.03 [1, 4]. The use of linkage information should improve the likelihood of finding "real" associations and detect false positive SNPs. Taking the HapMap information (Build 35) for the 10 K array, we found ~30% of the SNPs to be linked to its downstream SNP (LOD >3); with the 500 K array set it was ~50%. With this high linkage, the allele frequency of one SNP can be partly explained by the allele frequency of a linked SNP. To take advantage of this fact, two recent publications propose to use p-value combinations in a sliding-window concept [9, 12]. With increasing number of analyzed SNPs and better linkage information most haplotypes can be explained by individual SNPs [13].