Evaluating the performance of commercial whole-genome marker sets for capturing common genetic variation
BMC Genomics volume 8, Article number: 159 (2007)
New technologies have enabled genome-wide association studies to be conducted with hundreds of thousands of genotyped SNPs. Several different first-generation genome-wide panels of SNPs have been commercialized. The total amount of common genetic variation is still unknown; however, the coverage of commercial panels can be evaluated against reference population samples genotyped by the International HapMap project. Less information is available about coverage in samples from other populations.
In this study we compare four commercial panels: the HumanHap 300 and HumanHap 550 Array Sets from the Illumina Infinium series and the Mapping 100 K and Mapping 500 K Array Sets from the Affymetrix GeneChip series. Tagging performance is compared among HapMap CEPH (CEU), Asian (JPT, CHB) and Yoruba (YRI) population samples. It is also evaluated in an Estonian population sample with more than 1000 individuals genotyped in two 500-kbp ENCODE regions of chromosome 2: ENr112 on 2p16.3 and ENr131 on 2p37.1.
We found that in a non-reference Caucasian population, commercial SNP panels provide levels of coverage similar to those in the HapMap CEPH population sample. We present the proportions of universal and population-specific SNPs in all the commercial platforms studied.
Reduced genotyping costs and the availability of the International HapMap Project data  have made genome-wide association studies possible [2, 3]. Multiple commercial SNP panels have been made available for large-scale studies. As the SNP selection strategies of these panels are different , it is important to know how well they can capture common variations in the human genome. Several studies have evaluated the "completeness" of these commercial panels on the HapMap population data [4–6]. The results of these studies indicate that most common SNPs are well captured, and despite substantial differences in marker selection strategies, the first-generation high-throughput platforms all offer similar levels of genome coverage [4, 5].
The completeness with which variation is captured must also be evaluated for different populations. Unfortunately, the ethnicities of many patients sampled for complex disease gene identification projects will not be sufficiently reflected in the reference populations (CEU, YRI, CHB and JPT) selected by the International HapMap project. In addition, the number of genotyped individuals in HapMap populations is quite small, leading to under-representation of SNPs with lower allele frequencies. Some commercial panels have been designed using the limited data from HapMap. In this study, we have evaluated the performance of these commercial panels on HapMap populations and on one non-HapMap sample containing a large number of Estonian individuals. Estonia is a Northern European country that has been influenced by many waves of migration from Europe and Russia .
Several studies have already been performed to evaluate how well other Caucasian population samples can be described by tagSNPs calculated from HapMap CEPH data [7–10]. The authors of one study found that in three out of four selected gene regions, the tagSNPs of the CEPH population worked well on other European populations (> 70% markers had a r2 ≥ 0.8 with one of the CEPH tagSNPs) . Another study found that 90–95% of Estonian SNPs with MAF > 5% have a r2 of at least 0.8 with one of the CEPH tagSNPs . In a third study, the authors suggest that CEPH samples provide an adequate basis for tagSNP selection in Finnish individuals . The study by Gonzalez-Neira et al.  indicates that tagSNPs defined in Europeans are also efficient for describing Middle Eastern and Central/South Asian populations. Algorithms for tagging SNPs in multiple populations have been proposed by Howie et al. .
In view of this information, the aim of our study is to determine how well the recent commercial genome-wide genotyping arrays capture genetic variation in reference HapMap populations and in one non-HapMap population.
The number of SNPs in the regions studied
One of our main aims was to compare the tagging performances of different commercial platforms on a non-HapMap population, specifically an Estonian population. As the Estonian individuals were genotyped only in two genomic regions we had to limit the analysis to these regions. The Estonian genotypes in our study originated from one gene-rich and one gene-poor ENCODE region (ENCODE regions of Chromosome 2: ENr112 on 2p16.3 and ENr131 on 2p37.1). In these regions, Yoruban, Asian and CEPH population samples contained 4540, 4495 and 4670 genotyped SNP assays, respectively (Table 1), in the final HapMap version 21. The number of genotyped SNPs in the Estonian population sample was 1420 (Table 2). These SNPs were randomly selected from the HapMap Phase I dataset. Among the CEPH, Asian and Estonian population samples the percentage of markers passing validation criteria was similar (49%, 48% and 54% for MAF ≥ 1%), but it was higher in the Yoruban population sample (68%), possibly because of the higher allelic diversity in African populations. Most of the SNPs that failed validation did so because of the low frequency of the minor allele (MAF < 1%).
Evaluating the performance of commercial marker sets in capturing the genetic variation of HapMap population samples
After selecting and validating SNPs, we compared the performances of commercial panels in two selected regions with those shown in other publications [4–6]. The comparison also gave us information about the performance of HumanHap 550 on HapMap populations that has not previously been published.
To evaluate performance of commercial panels, for each marker present in HapMap data we calculated the best tagging SNP from each commercial panel. Then (a) the percentage of SNPs covered with r2 ≥ 0.8, and (b) the mean r2 between each marker and their best tagging SNP for the investigated population was calculated. This was done for all population samples with two minor allele frequency cut-offs (1% and 5%). As shown in Figure 1 A–B, all commercial whole-genome SNP sets have poor coverage on the Yoruban population, whereas coverage of the CEPH and Asian populations can reach 80–90% on HumanHap 550. In addition to coverage in two ENCODE regions, the whole-genome coverage for commercial SNP panels was also evaluated as in the study by Barrett et al. 2006 . The previously unpublished HumanHap 550 had the following whole-genome coverage estimations: CEU 86%, JTP + CHB 83%, YRI 48%. Among the technologies analyzed in this paper, HumanHap 550 had the best performance in all populations (Table 2). The advantage over HumanHap 300 is that HumanHap 550 has increased coverage in non-European populations. For other platforms, we observed coverage values nearly identical to previously published results (Table 3) despite some differences in data (HapMap ver.20 combined with Affymetrix genotypes on the HapMap samples vs. HapMap ver.21). The mean r2 of the whole genome is shown on Table 3, the mean r2 of two ENCODE regions is shown in Figure 1 C–D. In the Table 3, the r2 value expresses the mean r2 of all SNPs studied and additionally the r2of "covered" SNPs as in some previous studies . Here again, HumanHap 550 shows higher values than other platforms, although the increase over HumanHap 300 is not large on the CEPH population.
Evaluating the performance of commercial marker sets in capturing the genetic variation in Estonian population samples
Since fewer SNPs were genotyped in the Estonian sample than in the HapMap populations, the mean r2 and coverage of the CEPH, Asian and Yoruban population samples could not be compared directly with the Estonian one. Many tagSNPs from the commercial panels were not genotyped in the Estonian sample so their pairwise LD could not be calculated for the Estonian markers. Our solution was to reduce the marker counts in the CEPH, Asian and Yoruban samples so that only the markers present in the Estonian dataset were used for pairwise LD calculation. By this means we could calculate the relative performances of the commercial platforms on the reduced SNP set (validated markers out of a total of 1420 genotyped in the Estonian population sample, see Table 2). The calculation was carried out for the CEPH, Asian, Yoruban and Estonian population samples and the results were expressed as fractions of the coverage of the CEPH sample (Figure 2A–D). The results show that the commercial products cover the SNPs investigated with the same efficiency in the Estonian, Asian and CEPH samples, but tagging performance was lower in the Yoruban sample.
The fractions of universal and population-specific SNPs in commercial panels
It would be interesting to know how universal are the commercial panels for studying different populations. We counted the tagSNPs used for describing only one population and those that could identify SNPs from multiple populations (Figure 3 A–B). For each SNP in each population sample, the best-describing tagSNP from each of the commercial panels was identified. We then determined whether each commercial SNP was the best describer of all SNPs in one, two or all three populations.
Thus we were able to compare the universality of coverage of the different commercial platforms in different populations. We observed a strong bias towards CEPH-specific markers in the HumanHap 300 panel. This can easily be explained in terms of the SNP selection strategy used: markers were picked according to the CEPH HapMap population data using the r2 based method , ensuring that the CEPH population has best coverage and thus contains more CEPH-specific SNPs. In contrast, GeneChip 100 K and GeneChip 500 K describe population-specific markers from all three populations fairly equally.
Our results show that universal markers constitute 63–82% of all SNPs and these numbers are similar in all the commercial platforms studied. Approximately 10% of the SNPs in commercial panels describe SNPs from only a single population sample.
In this study, two 500 kb ENCODE regions (0.3% of the genome) were used to find the efficiency with which a non-reference Caucasian population can be tagged by commercial SNP panels. As the whole-genome SNP coverage and the coverage of these two ENCODE regions are similar, we presume that these ENCODE regions are representative samples of the human genome. Estonian genotype data contain fewer commercial panel SNPs. Thus, several commercial panel SNPs were not genotyped and the LD between them and Estonian genotype data SNP could not be calculated. The lower density of commercial panel SNPs might reduce both coverage and mean r2 values. To overcome the problem, similar HapMap reduced datasets were created and Estonian set was compared as a ratio vs. the CEU population results in Figure 2.
The results of our analysis show that the non-reference Caucasian population is tagged with the same efficiency as the CEPH population from HapMap. All non-African populations show similar levels of coverage in all commercial panels, irrespective of the SNP selection method for each platform. This is consistent with previous studies, which have shown that the CEPH population data from HapMap samples can successfully be used to tag other European population samples [7–10]. Other studies indicate that most of the common SNPs are captured by first-generation whole genome SNP panels [4, 5]. Our study supports the combination of these results with another conclusion: commercial SNP panels can capture most of the common SNPs from non-reference European population samples. The new Illumina HumanHap 550 describes common markers slightly better than the smaller HumanHap 300 platform and reaches 86% coverage. Unfortunately, the remaining 14% of markers that are covered by r2 < 0.8 can be quite numerous. If we assume that we would like to cover circa 7.5 million markers overall, 14% gives approximately one million poorly-covered markers. Any of these could be the disease-causing SNP that we are looking for in whole-genome association studies. Our hope is that upcoming commercial platforms will be able to cover most of these currently uncovered SNPs by additional tagSNPs.
In contrast to the results of previous studies [4, 5], we observed equal or slightly smaller coverage in Asian and YRI population samples for Affymetrix 500 k than for Illumina HumanHap 300. However, this lower coverage may be due to the random variation of genomic regions; we used two 500 kb regions from the whole human genome. Some commercial panel SNPs can be used to tag markers from different populations. Other markers, however, are only useful for describing markers from a single population. The information about the universality of tagSNPs is important for planning association studies in non-HapMap populations. The markers that are able to tag different populations are expected to be useful in many populations. The fraction of universal markers (MAF > 1%) was found to be 72–82%.
We found that in a non-reference Caucasian population, commercial SNP panels offered similar levels of coverage to the HapMap CEPH population sample. Although the coverage of commercial SNP panels has been evaluated for the HapMap CEPH population sample in previous papers, our results indicate that it is also possible to use that information for other European populations. We present the performance calculations for HumanHap 550, which have not previously been published. The coverage of HumanHap 550 reaches 90% of CEPH markers and 45% of Yoruban markers. We also present an analysis of the fraction of markers on commercial platforms that is universal and the fraction that is population-specific.
Two previously resequenced 500-kb ENCODE regions on chromosome 2 (ENCODE 1: ENr112, NCBI Build 34 positions 51633239–52133238 on 2p16.3 and ENCODE 2: ENr131, NCBI Build 34 positions 234778639–235278638 on 2p37.1) were used in this study. These regions differ in their average recombination rates (0.8 cM/Mbp for ENCODE 1 and 2.1 cM/Mbp for ENCODE 2) and content of known genes (ENCODE 1 is a gene-poor region, whereas ENCODE 2 is a gene-rich region).
Overall, there are 2,431 and 2,067 SNPs in ENCODE 1 and ENCODE 2, respectively. These have been successfully genotyped in the HapMap project. From the two 500-kb ENCODE regions, 1420 SNPs were randomly selected and genotyped in 1090 samples from the Estonian Genome Project Foundation at McGill University and the Genome Quebec Innovation Centre, as part of the HapMap project, using the Illumina GoldenGate® Assay. The total number of monomorphic SNPs was set at 100 for each region in all four HapMap populations included in the selection process. The same genotype data have previously been used in a study by Montpetit et al. .
For population comparisons, additional genotype data from CEPH (CEU, Utah residents with northern and western European ancestry), Asian (ASI, Mixed dataset of Japanese from the Tokyo area and Chinese from Beijing) and Yoruban (YRI, Yoruba people in Ibadan, Nigeria) populations of HapMap v. 21 were used, containing 4670 and 4540 SNPs respectively in these ENCODE regions.
The markers for all three populations were validated using the Haploview program . The population samples had to have genotyping success ≥ 95%, p-level of Hardy-Weinberg Equilibrium ≥ 0.001. Two minor allele cut-off levels were used (1% and 5%) to study the difference in results if markers with low allele frequency were present.
TagSNP sets and evaluation of coverage
Information about the four evaluated commercial genome-wide genotyping arrays was retrieved from the manufacturers' websites: for the Infinium HumanHap 300 and HumanHap 550 Array Sets from Illumina, Inc , and for the Affymetrix GeneChip Mapping 100 K and the Mapping 500 K Array Sets from Affymetrix, Inc . For analyzing the two ENCODE regions in HapMap populations (Figure 1 and 3) the following numbers of commercial panel SNPs were used: HumanHap 300, 296 SNPs; HumanHap 550,413 SNPs; GeneChip 100 k, 61 SNPs; GeneChip 500 k, 225 SNPs. For analyzing the Estonian dataset together with the reduced HapMap dataset (Figure 2) the following numbers of commercial panel SNPs were used: HumanHap 300, 118 SNPs; HumanHap 550,161 SNPs; GeneChip 100 k, 22 SNPs; GeneChip 500 k, 86 SNPs. Marker validation and LD calculations were performed using the Haploview  program.
Coverage numbers shown in Figure 1 and Table 3 were measured as a fraction of markers that had pairwise r2 > = 0.8 with their best tagSNP from given commercial panel and its captured SNPs. To correct for the overestimate of coverage, we used the same correction as described by Barrett et al. 2006 .
To analyze how effectively the markers of different tag sets have been put to use, we determined the counts of tagSNPs used to describe each population and tagSNPs that could tag SNPs from multiple populations.
The International HapMap Project. Nature. 2003, 426: 789-796. 10.1038/nature02168.
Hirschhorn JN, Daly MJ: Genome-wide association studies for common diseases and complex traits. Nat Rev Genet. 2005, 6: 95-108. 10.1038/nrg1521.
Wang WY, Barratt BJ, Clayton DG, Todd JA: Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet. 2005, 6: 109-118. 10.1038/nrg1522.
Barrett JC, Cardon LR: Evaluating coverage of genome-wide association studies. Nat Genet. 2006, 38: 659-662. 10.1038/ng1801.
Pe'er I, de Bakker PI, Maller J, Yelensky R, Altshuler D, Daly MJ: Evaluating and improving power in whole-genome association studies using fixed marker sets. Nat Genet. 2006, 38: 663-667. 10.1038/ng1816.
Nicolae DL, Wen X, Voight BF, Cox NJ: Coverage and characteristics of the Affymetrix GeneChip Human Mapping 100K SNP set. PLoS Genet. 2006, 2: e67-10.1371/journal.pgen.0020067.
Montpetit A, Nelis M, Laflamme P, Magi R, Ke X, Remm M, Cardon L, Hudson TJ, Metspalu A: An evaluation of the performance of tag SNPs derived from HapMap in a Caucasian population. PLoS Genet. 2006, 2: e27-10.1371/journal.pgen.0020027.
Mueller JC, Lohmussaar E, Magi R, Remm M, Bettecken T, Lichtner P, Biskup S, Illig T, Pfeufer A, Luedemann J, Schreiber S, Pramstaller P, Pichler I, Romeo G, Gaddi A, Testa A, Wichmann HE, Metspalu A, Meitinger T: Linkage disequilibrium patterns and tagSNP transferability among European populations. Am J Hum Genet. 2005, 76: 387-398. 10.1086/427925.
Willer CJ, Scott LJ, Bonnycastle LL, Jackson AU, Chines P, Pruim R, Bark CW, Tsai YY, Pugh EW, Doheny KF, Kinnunen L, Mohlke KL, Valle TT, Bergman RN, Tuomilehto J, Collins FS, Boehnke M: Tag SNP selection for Finnish individuals based on the CEPH Utah HapMap database. Genet Epidemiol. 2006, 30: 180-190. 10.1002/gepi.20131.
Gonzalez-Neira A, Ke X, Lao O, Calafell F, Navarro A, Comas D, Cann H, Bumpstead S, Ghori J, Hunt S, Deloukas P, Dunham I, Cardon LR, Bertranpetit J: The portability of tagSNPs across populations: a worldwide survey. Genome Res. 2006, 16: 323-330. 10.1101/gr.4138406.
Howie BN, Carlson CS, Rieder MJ, Nickerson DA: Efficient selection of tagging single-nucleotide polymorphisms in multiple populations. Hum Genet. 2006, 120: 58-68. 10.1007/s00439-006-0182-5.
Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA: Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet. 2004, 74: 106-120. 10.1086/381000.
Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2004
Illumina Inc. [http://www.illumina.com]
Affymetrix Inc. [http://www.affymetrix.com]
We thank Elin Org for valuable comments on the manuscript and Jody Novakoski for valuable help with English grammar. This work was supported by the Estonian Ministry of Education and Research grants 0182649s04 and 0182582s03, Enterprise Estonian RD project EU19955 and Biospinno II to the Estonian Biocentre. The genotyping of the Estonian samples was made possible by a grant from Genome Canada and Genome Quebec to Prof. T. Hudson.
RM performed the statistical analysis, created the figures and drafted the manuscript. AP initiated and helped to design the study, provided SNP data and was involved in drafting the manuscript. MN and AMo carried out the genotyping of the Estonian population samples under the supervision of AMe. MR participated in the design of the study and wrote the final version of the results and discussion. All authors read and approved the final manuscript.
About this article
Cite this article
Mägi, R., Pfeufer, A., Nelis, M. et al. Evaluating the performance of commercial whole-genome marker sets for capturing common genetic variation. BMC Genomics 8, 159 (2007). https://doi.org/10.1186/1471-2164-8-159