Development of admixture mapping panels for African Americans from commercial high-density SNP arrays

Background Admixture mapping is a powerful approach for identifying genetic variants involved in human disease that exploits the unique genomic structure in recently admixed populations. To use existing published panels of ancestry-informative markers (AIMs) for admixture mapping, markers have to be genotyped de novo for each admixed study sample and samples representing the ancestral parental populations. The increased availability of dense marker data on commercial chips has made it feasible to develop panels wherein the markers need not be predetermined. Results We developed two panels of AIMs (~2,000 markers each) based on the Affymetrix Genome-Wide Human SNP Array 6.0 for admixture mapping with African American samples. These two AIM panels had good map power that was higher than that of a denser panel of ~20,000 random markers as well as other published panels of AIMs. As a test case, we applied the panels in an admixture mapping study of hypertension in African Americans in the Washington, D.C. metropolitan area. Conclusions Developing marker panels for admixture mapping from existing genome-wide genotype data offers two major advantages: (1) no de novo genotyping needs to be done, thereby saving costs, and (2) markers can be filtered for various quality measures and replacement markers (to minimize gaps) can be selected at no additional cost. Panels of carefully selected AIMs have two major advantages over panels of random markers: (1) the map power from sparser panels of AIMs is higher than that of ~10-fold denser panels of random markers, and (2) clusters can be labeled based on information from the parental populations. With current technology, chip-based genome-wide genotyping is less expensive than genotyping ~20,000 random markers. The major advantage of using random markers is the absence of ascertainment effects resulting from the process of selecting markers. The ability to develop marker panels informative for ancestry from SNP chip genotype data provides a fresh opportunity to conduct admixture mapping for disease genes in admixed populations when genome-wide association data exist or are planned.


Background
Admixture mapping is an approach for localizing disease susceptibility loci that attempts to capitalize on the long-range linkage disequilibrium occurring in populations formed by recent mixing of ancestral populations [1][2][3][4][5][6]. The approach uses samples from recently admixed populations to detect susceptibility loci at which the risk alleles have different frequencies in the ancestral parental populations. Admixture mapping is an economical and theoretically powerful approach. Compared to linkage, admixture mapping does not require families and has more power. Compared to association, admixture mapping requires~200-500-fold fewer markers, is not susceptible to allelic heterogeneity, and can be used with either case-only or case-control study designs. Admixture mapping can also be performed with generalized linear models to accommodate quantitative traits [1]. Admixture mapping has been performed for many complex traits which exhibit strong differences in prevalence across ethnicities, such as end-stage renal disease [7,8], hypertension [9][10][11], multiple sclerosis [12], obesity [13][14][15], peripheral arterial disease [16], prostate cancer [17,18], rheumatoid arthritis [19], serum inflammatory markers [20], systemic lupus erythematosus [21], type 2 diabetes [22], and white blood cell count [23].
Several groups have built panels of ancestry-informative markers (AIMs) based on multiple databases of human genetic variation [24][25][26][27]. Previously, admixture mapping required the construction of panels of AIMs based on screening large reference sets of genetic variation for ancestry-informative markers followed by de novo genotyping at those preselected markers in the admixed study sample and the samples representing the (putative) ancestral parental populations [12,20,23,26,28,29]. However, given commercially available high-density marker arrays, it is now possible to construct customized panels from markers already genotyped in the admixed study sample(s) [27,[30][31][32][33][34].
In this study, we constructed marker panels for admixture mapping with African American populations, starting from the Affymetrix Genome-Wide Human SNP Array 6.0, which probes variation at 909,508 singlenucleotide polymorphisms (SNPs). Using genome-wide genotypes in our study sample of African Americans already experimentally determined for genome-wide association studies and HapMap data to represent the presumed ancestral parental populations, we constructed one panel consisting of SNPs with large differences in allele frequencies between the ancestral parental populations and a second panel consisting of SNPs with large F ST values between the ancestral parental populations. We also constructed a panel consisting of random markers not selected to be ancestrally informative. Characteristics of these panels, including the number of markers and information content, are presented. As a test case, we apply these panels to a study of hypertension in African Americans.

Study Population
The admixed population under study comprised participants in the Howard University Family Study (HUFS) from the Washington, D.C. metropolitan area [35]. The first phase of recruitment involved enrolling and examining a randomly ascertained cohort of African American families with members in multiple generations. To facilitate nested case-control study designs, additional unrelated individuals from the same geographic area were enrolled in a second phase of recruitment. Participants were not ascertained based on any phenotypes.
Participants were interviewed and measured for various anthropometric and clinical variables. Blood pressure was measured in the sitting position using an oscillometric device (Omron Healthcare, Kyoto, Japan). Three readings were taken with a ten minute interval between readings. The reported systolic and diastolic blood pressure readings were the average of the second and third readings. Hypertension case status was defined as systolic blood pressure ≥ 140 mmHg, or diastolic blood pressure ≥ 90 mmHg, or treatment with antihypertensive medication. We identified a subset of 1,017 unrelated individuals including 509 hypertensive cases and 508 controls for use in admixture mapping.
Genome-wide genotyping in the HUFS was performed using the Affymetrix Genome-Wide Human SNP Array 6.0. DNA samples were prepared and hybridized following the manufacturer's instructions [35]. Genotype calls were made using the Birdseed algorithm, version 2 [36]. We had four inclusion criteria: the individual sample call rate had to be ≥ 95% (no samples excluded), the SNP call rate had to be ≥ 95% (41,885 SNPs excluded), the minor allele frequency had to be ≥ 0.01 (19,154 SNPs excluded), and the p-value for the Hardy-Weinberg (HWE) test of equilibrium had to be ≥ 1.0×10 -3 (6,317 SNPs excluded). After filtering, 842,074 autosomal and X chromosomal SNPs remained.

δ and F ST Calculations
For a given SNP, δ was calculated as the absolute difference in allele frequencies in the CEU and YRI data, δ = |p CEUp YRI |. Wright [37]

Genetic Map of SNPs
The Rutgers Combined Linkage-Physical Map of the Human Genome was used to locate markers on the genetic map (in cM) given positions on the physical map (in bp). The positions of SNPs on the genetic map were obtained using a web-based application http:// integrin.ucd.ie/cgi-bin/rs2cm.cgi.

Selection of Ancestry-Informative Markers from HapMap Data
We followed a six-step process to select AIMs. First, we selected SNPs for which the minor allele frequency was ≥ 0.01 in both ancestry populations (CEU and YRI). Second, we filtered for SNPs for which δ ≥ 0.6 between CEU and YRI. Third, we divided each chromosome into consecutive, non-overlapping bins of size 1 Mb and sorted the SNPs within each bin in descending order according to the δ values. Fourth, for each chromosome, we estimated pairwise correlations between the topranked SNPs across the bins. Fifth, for each pair of SNPs, if r 2 ≥ 0.4 in either the CEU or YRI sample, we discarded the SNP with the smaller δ value from its bin and promoted all remaining SNPs in that bin. If δ values were equal (to the fourth decimal place), we discarded the distal SNP. We iterated steps 4-5 until r 2 < 0.4 in either of the CEU or YRI sample for all pairs of topranked SNPs per bin. The resulting panel comprised 2,076 AIMs. We repeated this entire process based on F ST ≥ 0.4, yielding a second panel consisting of 1,923 AIMs. Given δ = 0.6, the allowable values of F ST range from δ 2 = 0.36 to   2 0 429 − = .
[39]. Similarly, given F ST = 0.4, the allowable values of δ range from to F ST = 0 632 . [39]. These calculations show the comparability of the two thresholds.

Information Content and Map Power
We calculated the Shannon information content (SIC), defined as For a locus i and individual j, X ij was defined as the entropy of the locus-specific ancestry estimate and G j was defined as the entropy of the genome-wide ancestry estimate. The relative power at locus i was defined as . If X ij = Gj for all j, then r i = 0 and there is no additional information about local ancestry beyond information about genome-wide ancestry. If X ij = 0 for all j, then r j = 1 and there is perfect information for local ancestry [31]. The statistic r i and the average of r i across loci, r avg , were estimated using ANCESTRYMAP [3]. Relative to a study with perfect information about local ancestry (r avg = 1), 1/r avg times as many samples must be genotyped to achieve comparable power [31].

Estimation of Individual Admixture and Population Structure
We used the variance inflation factor (VIF) to prune markers in linkage disequilibrium (LD). The VIF is equal to, is the multiple correlation coefficient. A VIF of 1 implies that the index SNP is completely independent of all other SNPs. Starting from a common set of SNPs passing quality control among the HapMap CEU, HapMap YRI, and HUFS data sets, we used LD-based pruning (VIF 1.1, window size 50 SNPs, window slide of 5 SNPs) to generate a set of 74,546 SNPs with minimal LD between the markers. We then randomly selected one-third of the SNPs to obtain a random marker panel (21 k random panel) that had 10-fold greater marker density than the AIMs panels. We also generated an additional panel (2 k random panel) by randomly sub-sampling 10% of the 21 k random panel to match the marker density of the AIMs panels. We examined clustering using a parametric approach implemented in STRUCTURE [40] and a nonparametric approach implemented in AWclust [41]. Analysis was performed in STRUCTURE without any prior population assignment and was performed ten times for each number of clusters (K), with 10,000 burn-in steps and a run length of 10,000 steps under the admixture model. We recorded the log likelihood of each analysis conditional on K estimated by STRUCTURE. Compared with this parametric approach, the nonparametric approach in AWclust [41] uses allele-sharing distance (ASD) and Ward's minimum variance algorithm to cluster the individuals in the ASD matrix. AWclust does not assume Hardy-Weinberg equilibrium or linkage equilibrium and does not require allele frequency estimates. We varied K from one to six in both programs.

Application of the panels to a study of hypertension
Two statistics were used to test for the presence of disease loci using ANCESTYMAP [3]. One was the locusgenome statistic, which compared the admixture proportion at one locus with the genome-wide average among cases only. The locus-genome statistic was tested via a likelihood-ratio statistic, i.e., the likelihood of a locus being a disease locus to the likelihood of the locus not being a disease locus. The LOD score was defined as the likelihood-ratio test statistic divided by 2ln (10). The genome-wide significance threshold of the LOD score was set at 2 [3]. The other statistic was the casecontrol statistic, which compared cases with controls at every point in the genome, testing for differences in ancestry estimates. A deviation from the genome-wide average of one parental population ancestry seen in cases but not in controls provided evidence of a disease locus. The case-control statistic followed the standard normal distribution under the null hypothesis that a locus was not a disease locus. The genome-wide significance threshold of the z-statistic was set at ± 4.2 for the two panels of AIMs and ± 4.7 for the panel based on random markers. We specified in the disease model that the relative risk for hypertensive heart disease among African Americans was 2.80 compared to European Americans [26].

Marker Panels for Admixture Mapping in African Americans
The distribution of SNPs across the AIMs panels (one based on δ contained 2,076 AIMs (Additional file 1), the other based on F ST contained 1,923 AIMs (Additional file 2)) and two random marker panels (21 k random marker panel and 2 k random marker panel, Additional file 3) are shown in Table 1. The panels covered all 22 autosomes and the X chromosome (Table 1). All marker panels showed lower heterozygosities in the parental samples than in the admixed sample, with the two panels of AIMs showing ascertainment effects of lower heterozygosities in the parental samples and higher heterozygosity in the admixed sample ( Table 2). Scatter plots of allele frequencies for AIMs showed clear differentiation of the two parental populations (Figure 1), as did the STRUCTURE plot assuming K = 2 populations ( Figure 2) and the AWclust plot (Additional file 4). Excluding centromeres, the average inter-marker distance was 1.33 cM for the panel based on δ, 1.43 cM for the panel based on F ST , 0.124 cM for the panel based on The two panels of AIMs shared 1,745 markers. The remaining markers (331 in the panel based on δ, 178 in the panel based on F ST ) showed no significant difference in Shannon information content (SIC) (t-test, p = 0.10). The δ and F ST values in the two panels were highly positively correlated (r = 0.92, p < 0.0001). The δ in the panel based on δ was significantly higher than

Sample Characteristics
The genome-wide average F ST between HUFS and YRI was 0.0295, indicating little population differentiation. The genome-wide average F ST was 0.0656 between HUFS and CEU and 0.0753 between CEU and YRI, both indicating moderate population differentiation. As expected, these results indicated that our admixed HUFS sample was more similar to YRI than CEU, i.e., the proportion of African ancestry exceeded the proportion of European ancestry. Similarly, principal coordinate analysis showed that the HUFS sample was intermediate between the two ancestral parental populations and on average closer to YRI than CEU (Additional file 4). The estimated proportions of African ancestry in the HUFS sample using ANCESTRYMAP were 0.81 ± 0.11 and 0.84 ± 0.08 for the autosomes and the X chromosome, respectively.

Admixture Information Content
We evaluated the informativeness of the two panels of random markers compared to the informativeness of the two panels of AIMs.  (Figures 3 and 4). The proportion of markers in the panel based on 2 k random markers for which r i ≥ 0.50, r i ≥ 0.75, and r i ≥ 0.80 were 0.19%, 0%, and 0%, respectively, and the panel had a map power of r avg = 0.13 (Figures 3 and 4). These estimates indicate that the two panels of AIMs extracted more ancestry information than a 10-fold denser panel of random markers and much more than the 2 k random marker panel.
Using the r avg statistic, one would need to study We constructed panels conditional on approximate linkage equilibrium over 1 Mb bins. Our iterative pruning procedure was designed to avoid gaps in coverage and to eliminate background linkage disequilibrium. To compare our panels with previously published panels, we obtained two panels of AIMs developed for African Americans by Tian et al. [28]. From their panel of 4,222 AIMs, 682 AIMs were in common with the CEU, YRI,  and HUFS data sets and all 682 AIMs passed quality control. Similarly, 321 AIMs from their panel of 2,000 AIMs were in common with the CEU, YRI, and HUFS data sets and all 321 AIMs passed quality control. As a result of the substantial reduction in marker density, the map power was reduced for both panels of Tian et al. using our HUFS data set ( Table 3). The substantial reduction in marker density occurred because the panels of Tian et al. were developed independently of the Affymetrix chip we used for genotyping our sample and there was little overlap in the SNPs in their panels and on the chip. To investigate if this limitation also applied to another African American data set, we obtained the HapMap phase III ASW data. In the ASW data set, 50% of the AIMs in either panel of Tian et al. were present, compared to > 98% of the AIMs from our panels, whereas almost every AIM present in the data passed quality control (Table 4). These comparisons highlight the advantage of being able to customize a panel using preexisting GWAS genotypes, especially for filling in gaps to improve coverage.

Application of the Admixture Panels
As an example of applying our newly developed panels, we investigated hypertension in the HUFS. The relative risk for hypertensive heart disease among African Americans was 2.80 compared to European Americans [26]. Averaged genome-wide, the individual proportion of European ancestry was 0.192 ± 0.098, 0.193 ± 0.098, and 0.264 ± 0.106 among normotensive subjects and 0.196 ± 0.119, 0.196 ± 0.119, and 0.268 ± 0.109 among hypertensive subjects, for the panels based on δ, F ST , and 21 k random markers, respectively. Although this result suggests that most of the differential risk in hypertension is probably not explainable by genetics, it does not preclude specific loci from significantly contributing to differential risk. Assuming the hybrid isolation model, i.e., a single generation of admixture with no subsequent gene flow, the estimated number of generations since the original admixture event was 7.44 ± 3.35, 7.33 ± 3.01, and 8.65 ± 5.31 for the panels based on δ, F ST , and 21 k random markers, respectively.
We performed admixture mapping using both the locus-genome and case-control statistics for hypertension in the HUFS data. No marker reached genomewide significance for hypertension case/control status using ANCESTRYMAP ( Figure 5). Using a pairwise score test for markers shared between the two AIM panels, no significant difference was found between the panels (p = 0.8616 for the locus-genome statistics, p = 0.3087 for the case-control statistics). Similarly, using a t-test for AIMs not shared between the two panels, no significant difference was found between the panels (p = 0.6099 for the locus-genome statistics, p = 0.5607 for the case-control statistics).

Discussion
In this study, we constructed panels of markers with variable informativeness for ancestry in admixed African Americans. We had previously genotyped our sample using the Affymetrix Genome-Wide Human SNP Array 6.0 for genome-wide association studies. Repurposing markers for admixture mapping eliminates the need for de novo genotyping. After linkage disequilibrium-based pruning, we constructed a set of 2,076 uncorrelated markers with large differences in allele frequencies and another set of 1,923 uncorrelated markers with large F ST values. Using these ancestry-informative markers, we estimated that the proportion of European ancestry in our sample of 1,017 unrelated African Americans from Tian 4222 (4,222) 682 (100%) 0.56 [28] 1 We compared the panels using all autosomal AIMs with quality control criteria locus call rate ≥ 95%, minor allele frequency > 0.01, and HWE p ≥ 1.0×10 -3 . 2 Map power (r avg ) based on 1,017 individuals in the HUFS data set. Washington, D.C. was 0.19 ± 0.11 for both panels, comparable to an estimated proportion of 0.21 ± 0.11 in a sample of 442 African Americans with multiple sclerosis and 276 controls [3]. Using a set of 21 k random markers (i.e., not ascertained to be informative for ancestry) in our study yielded a slightly higher estimate of admixture proportions (0.266 ± 0.108). Although it is possible to perform genome-wide admixture mapping using panels of markers not preselected to be informative for ancestry [30], our results confirm that a few thousand AIMs can be used to estimate admixture proportions as efficiently as 10-fold more random markers. Admixed populations most commonly used in admixture mapping to date involve those formed by recent admixture between groups originating from different continents as a result of European maritime expansion during the past few hundred years [4]. The number of generations since the original admixture event based on our sample of African Americans was estimated at 7.44 ± 3.35 and 7.33 ± 3.01 generations for the panels based on δ and F ST , respectively. This estimate is similar to previous estimates of 6.0 ± 1.6 [3], 6.3 ± 1.1 [26], and 7 [42]. Thus, these estimates are stable across different marker panels and different samples of African Americans.
The power of admixture mapping is affected by the information content of the marker map, the sample size, and admixture proportions. We estimated that both AIM panels had an average map power of 0.73 ± 0.08, which is similar to 0.71 ± 0.09 for a previously constructed panel of 2,154 AIMs in African Americans [26]. The two panels had higher map power than the panel of 21 k random markers, which had an average map power of 0.65 ± 0.08. For the locus-genome statistic, a sample size of 500 cases provides 70% power to detect a locus conferring 1.7-fold increased risk due to ancestry [3]. Our study sample size of 509 cases and 508 controls was underpowered for loci conferring 1.5-fold or less risk due to ancestry. Although the power of admixture mapping decreases in populations with a much larger contribution from only one parental population [26], the map power is fairly constant for values of admixture proportion from 10% to 90% [3]. Our estimated values of 19% European ancestry and 81% African ancestry both fall within this range.

Conclusions
We constructed two panels of AIMs for admixture mapping in African Americans from experimentally determined genotypes using the Affymetrix Genome-Wide Human SNP Array 6.0. We constructed the panels conditional on linkage equilibrium over 1 Mb bins. Our iterative pruning procedure was designed to avoid gaps in coverage and to eliminate background linkage disequilibrium. Given the mathematical relationship between δ and F ST , we recommend both panels of AIMs equally.
Developing marker panels for admixture mapping from existing genotype data derived from commercial high density SNP chips offers two major advantages. (1) No de novo genotyping needs to be done, thereby saving costs. (2) Markers can be filtered for various quality measures and replacement markers (to minimize gaps) can be selected at no additional cost. For our African American sample, we took advantage of preexisting HapMap genotypes for the CEU and YRI samples, but appropriate parental populations may not have already been sampled for some admixed populations. We found that the map power for sparser panels of AIMs is higher than for denser panels of 21 k random markers. Historically, the number of AIMs in an admixture panel reflected the trade-off between maximizing genomic coverage and minimizing genotyping costs. Currently, custom genotyping a panel of~2,000 AIMs is less expensive than chip-based genome-wide genotyping. However, chip-based genome-wide genotyping is currently less expensive than custom genotyping a panel of 20,000 random markers. Presumed parental populations are necessary to characterize AIMs. In contrast, parental populations are not needed to characterize random markers prior to estimating admixture proportions. Apart from needing many more random markers compared to AIMs, the major disadvantage of using a panel of random markers without parental populations or external reference samples is the inability to label clusters. Taken together, the ability to develop dense panels of markers from commercial chips provides a fresh opportunity to conduct admixture mapping for disease genes in admixed populations.