Selecting SNPs informative for African, American Indian and European Ancestry: application to the Family Investigation of Nephropathy and Diabetes (FIND)
- Robert C. Williams1Email author,
- Robert C. Elston2,
- Pankaj Kumar1,
- William C. Knowler1,
- Hanna E. Abboud3,
- Sharon Adler4,
- Donald W. Bowden5,
- Jasmin Divers5,
- Barry I. Freedman5,
- Robert P. IgoJr.2,
- Eli Ipp4,
- Sudha K. Iyengar2,
- Paul L. Kimmel6,
- Michael J. Klag7,
- Orly Kohn8,
- Carl D. Langefeld5,
- David J. Leehey9,
- Robert G. Nelson1,
- Susanne B. Nicholas10,
- Madeleine V. Pahl11,
- Rulan S. Parekh12,
- Jerome I. Rotter13,
- Jeffrey R. Schelling14,
- John R. Sedor14,
- Vallabh O. Shah15,
- Michael W. Smith16,
- Kent D. Taylor13,
- Farook Thameem3, 17,
- Denyse Thornley-Brown18,
- Cheryl A. Winkler19,
- Xiuqing Guo13,
- Phillip Zager15,
- Robert L. Hanson1 and
- the FIND Research Group
© Williams et al. 2016
Received: 18 December 2015
Accepted: 22 April 2016
Published: 4 May 2016
The presence of population structure in a sample may confound the search for important genetic loci associated with disease. Our four samples in the Family Investigation of Nephropathy and Diabetes (FIND), European Americans, Mexican Americans, African Americans, and American Indians are part of a genome- wide association study in which population structure might be particularly important. We therefore decided to study in detail one component of this, individual genetic ancestry (IGA). From SNPs present on the Affymetrix 6.0 Human SNP array, we identified 3 sets of ancestry informative markers (AIMs), each maximized for the information in one the three contrasts among ancestral populations: Europeans (HAPMAP, CEU), Africans (HAPMAP, YRI and LWK), and Native Americans (full heritage Pima Indians). We estimate IGA and present an algorithm for their standard errors, compare IGA to principal components, emphasize the importance of balancing information in the ancestry informative markers (AIMs), and test the association of IGA with diabetic nephropathy in the combined sample.
A fixed parental allele maximum likelihood algorithm was applied to the FIND to estimate IGA in four samples: 869 American Indians; 1385 African Americans; 1451 Mexican Americans; and 826 European Americans. When the information in the AIMs is unbalanced, the estimates are incorrect with large error. Individual genetic admixture is highly correlated with principle components for capturing population structure. It takes ~700 SNPs to reduce the average standard error of individual admixture below 0.01. When the samples are combined, the resulting population structure creates associations between IGA and diabetic nephropathy.
The identified set of AIMs, which include American Indian parental allele frequencies, may be particularly useful for estimating genetic admixture in populations from the Americas. Failure to balance information in maximum likelihood, poly-ancestry models creates biased estimates of individual admixture with large error. This also occurs when estimating IGA using the Bayesian clustering method as implemented in the program STRUCTURE. Odds ratios for the associations of IGA with disease are consistent with what is known about the incidence and prevalence of diabetic nephropathy in these populations.
KeywordsIndividual genetic ancestry Population structure SNP Diabetic nephropathy
The Family Investigation of Nephropathy and Diabetes (FIND) is a multicenter study that is designed to find genes that contribute to the onset of diabetic nephropathy in four target, self-reported, heritage groups: European Americans, Mexican Americans, American Indians, and African Americans [1–4]. Two strategies were employed to ascertain the role of specific genes, a family-based linkage study and a case–control genome-wide association study (GWAS). In the GWAS each person in the four groups was typed for 1 M single nucleotide polymorphisms (SNPs) on a common platform after which the genotype distributions in the cases and controls for each SNP were compared to identify risk alleles with genome-wide significance. A common practice in GWAS such as the FIND is to control for population stratification by adding principal components (PCs) or individual genetic ancestry (IGA) estimates as covariates to the statistical models .
While the assessment of IGA is potentially important for GWAS and for other genetic analyses, the evaluation of an American Indian heritage has been difficult because there has been little information on ancestry informative markers (AIMs) from a large sample of American Indians typed on a commercially available platform. The Pima Indians of the Gila River Indian Community in Arizona, who have a very high prevalence of type 2 diabetes, are one of the most intensively studied American Indian groups in the United States; genetic and heritage analyses have been performed in this native group for many years, involving research that includes GWAS with 100 K and 1 M SNP arrays [6–10]. Pima Indians also constituted a large proportion of the American Indian sample in the FIND. Therefore data from the Pima Indian GWAS, conducted with the Affymetrix Genome-Wide Human 6.0 SNP array , were used to isolate informative markers for IGA in American Indians, which were then combined with 3 populations from HapMap to create a panel of AIMs.
Study participants and phenotypes
The criteria for diabetes, nephropathy, and the overall study design for the FIND have been previously described [1–4]. The FIND is a multi-ethnic family study of severe kidney disease, where the index case had diabetic nephropathy and at least one sibling reported a diagnosis of either diabetic nephropathy or long-standing diabetes without nephropathy. Samples from four different ethnic FIND groups were collected: African American, American Indian, European American, and Mexican American. For the discovery GWAS unrelated cases and controls were genotyped, yielding one individual per pedigree, except that in American Indians and Mexican Americans, because the total available sample was small, some family members were also genotyped. Patients with severe DN based upon diabetes duration > 5 years and urine albumin/creatinine ratio (UACR) ≥ 0.3 mg/g or with severe kidney disease (ESRD) were defined as cases. Controls had DM durations ≥ 9 years, UACR < 30 mg/g, and serum creatinine < 1.6 mg/dl (males) or < 1.4 mg/dl (females) without first-degree relatives having kidney disease. Additional cases and controls that were not part of the original FIND study were included to increase the statistical power.
A total of 5156 discovery DNA samples, plus 244 blind duplicates, was submitted to Affymetrix, Inc. (Santa Clara, CA) for genotyping. Genotypes were generated with the Affymetrix Genome-Wide Human 6.0 SNP array  using the Affymetrix Commercial Service (Santa Clara, California), via a contract to Translational Genomics Research Institute (TGEN, Phoenix, AZ). Samples were submitted at a concentration of 100 ng/μl in Tris-EDTA buffer, then plated according to ethnic membership that included HapMap controls and blind duplicates on each plate. Samples were tested for DNA quality and quantity using PicoGreen prior to genotyping. All ethnic groups were genotyped with the Affymetrix 6.0 chip during the GWAS phase. Genotypes were called using the Birdseed version 2 algorithm  implemented in the Genotyping Console software (Affymetrix). Alleles 1 and 2 for each SNP are assigned in the order that they are found in the HapMap data set.
was used, where p ij is the frequency of allele 1 at the jth SNP in the ith ancestral population, and p j l is the overall average frequency of allele 1 at SNP j. In addition, F-statistics were calculated by the method of Weir and Cockerham  to determine the utility of F st for balancing information in the contrasts.
Estimates of individual admixture were also calculated for the 4 parental samples with the STRUCTURE  program to compare this Bayesian clustering method with the fixed parental allele algorithm and to determine whether either or both were vulnerable to the unbalanced information in the choice of human ancestry SNPs.
Principal components (PCs) were computed using SNPs that passed quality control and were not in genomic regions with extended linkage disequilibrium (LD). Specifically, markers in the following regions were excluded: chromosomes 5 (44–51.5 Mb), 6 (25–33.5 Mb), 8 (8–12 Mb), 11 (45–57 Mb), and 17 (40–43 Mb). The PC analysis was computed on the combined ethnic samples for the GWAS. The first two principal components were determined to account for a large proportion of the genetic variation in the multi-ethnic PC analysis and appropriately reduce the inflation factor in the ethnic-specific logistic regression models. Outlying individuals based on the first two PCs were excluded from the GWAS and, thus, are not included in the present analysis. A total of 33 individuals were omitted based on outlying PCA values.
Logistic regression was performed by standard methods with the disease, diabetic nephropathy, as the dependent variable and enrolment age, sex (women), enrolment center, and the respective heritage estimates as explanatory variables.
Descriptive statistics for 1300 ancestry informative SNP Loci
Maximized contrasts, δ ≥ 0.5
Mean distance (Bp)
|PEU-PAI| N = 450
|PEU-PAF| N = 450
|PAI-PAF| N = 400
Measures for balancing information (standard deviation) in the three information contrasts
Number of SNPs
N = 450
N = 450
N = 400
0.529 (0.022) N = 450
0.528 (0.022) N = 450
0.542 (0.027) N = 400
All SNPs N = 1300
Mean F st (standard deviation) in individual and combined contrasts
F st by contrast
0.516 (0.023) N = 450
0.381 (0.015) N = 450
0.437 (0.035) N = 400
F st over all contrasts N = 1300
Mean (standard deviation) of source samples for AIMs typed with the 3 sets of informative markers
SNPs in estimates
Source samples for AIMs
HapMap CEU, N = 165
HapMap LWK, N = 110
HapMap YRI, N = 193
Pima, N = 964
|PEU-PAI| N = 450
|PEU-PAF| N = 450
|PAI-PAF| N = 400
All SNPs N = 1300
The above analysis was repeated for the 4 ancestral samples using the STRUCTURE Bayesian cluster method with 3 ancestral components, EU, AI, and AF (K = 3) and gave very similar results to those presented in Table 4 and Fig. 2 (Additional file 4: Table S1, Additional file 5: Figures S1–S4). In two instances for the Pima Indians, for maximized contrasts |PAI-PAF| and |PEU-PAI|, the Bayesian method did not return the expected value of AI even when the information in the contrast was maximized for this ancestral component. When the 1300 SNPs with balanced information were incorporated into the STRUCTURE program, it returned the expected mean values and proportions of ancestry in the four ancestral samples (Additional file 5: Figures S1 and S5).
Mean (standard deviation) and range for individual heritage and standard error estimates for FIND populations
An alternate method for controlling for population structure in GWAS is to calculate the PCs from the samples. To compare the PC and heritage estimates, a Pearson correlation coefficient was calculated for the 3 admixture components and the first 2 PCs for the combined sample (N = 4391). The EU heritage component was highly correlated with PC2 [0.9954 (95 % C.I. 0.9951, 0.9957)], while the AF heritage had a more modest correlation [0.9111 (0.9060, 0.9160)] with PC1. American Indian heritage was negatively correlated with both PC1 and PC2 [−0.8600 (−0.8675, −0.8521) and −0.5067 (−0.5283, −0.4843)]. When the three admixture components were each used as a dependent variable in a linear regression with PC1 and PC2 as explanatory variables, the R-square values were close to 1.0: EU (0.995), AI (0.996), and AF (0.996).
To assess the potential role of ancestry in confounding associations with diabetic nephropathy, each of the three heritage estimates was first tested singly for association, with the covariates, for each of the 4 FIND populations (Additional file 4: Tables S2, S3 and S4). While the variables enrolled age, sex, and enrolment center were consistently associated with the disease, there was no significant association with any individual heritage variable when tested within each sample. However, when the samples were combined (N = 4126) it introduced population structure and each heritage variable had a significant odds ratio when tested singly in the model: EU odds ratio 0.338, p < 0.0001; AI 1.960, 0.028; and AF 2.519, p < 0.0001;
Tests for the association of heritage with diabetic nephropathy in the combined FIND populations, N = 4126
0.311 (.232, .418)
1.031 (.547, 1.944)
0.269 (.143, .507)
0.748 (.381, 1.468)
3.762 (1.958, 7.228)
2.956 (2.212, 3.947)
A panel of SNPs informative for African, American Indian and European ancestry
A panel of 1300 SNPs was developed which can serve as informative markers for African, American Indian and European ancestry; these ancestry components are often of interest in genetic epidemiologic studies of populations from the Americas. Although other similar marker panels have been developed, the samples used as the American Indian ancestral group were few and potentially admixed. The present panel was developed using a large sample of American Indians; although the samples derived from a single tribe, the Pima Indians of Arizona, there is minimal European admixture in this population . The SNPs are useful for estimation of global ancestry across the genome. Estimation of local ancestry at specific genomic regions requires more dense genotypic data, which may not be available to all investigators. Although local ancestry estimates can be useful for mapping studies, when they are used as covariates it can result in over-adjustment, whereas adjustment for global ancestry is more useful to reduce confounding in GWAS [19–21]. Further, the association of global ancestry with disease risk may be of interest in itself in some genetic epidemiologic applications. Thus, the present set of SNPs, or a subset of them, may be useful for genetic epidemiologic studies. If a subset of the markers is chosen, it is important to balance the information regarding the contrasts among ancestral populations.
An information contrast will only return reliable estimates for two ancestral populations
When a model for individual ancestry estimates has only two ancestral populations in the AIMs set, then the balance of the model is not in question because there is only one allele frequency contrast for each AIM, |P1-P2|. However, when a poly-ancestry model (>2) is created, then all allele frequency difference contrasts must be considered. For a model with 3 ancestral populations (Fig. 1) the contrasts |P1-P2|, |P1-P3|, and |P2-P3| must be integrated into the estimates. But, as we have shown, each contrast is still only reliably informative for the two ancestral populations in it.
We chose to demonstrate this with the extreme case by choosing 3 sets of AIMs that were each maximized for information in only one contrast (δ ≥ 0.5 in the chosen contrast and δ < 0.3 in the other two) and then using each set to estimate all of the ancestral components (Table 4). When the ancestry is for one of the two ancestral populations in the contrast, then the maximum likelihood model is balanced and provides accurate estimates. When one tries to estimate an ancestral component for which the markers contained in the maximized contrast set are not informative, then the model is unbalanced and the estimates are not correct. Also, the error in the unbalanced design appears to be random and distributed equally in the two ancestral components that are not part of the ancestral sample. In addition, if no standard error of the estimate is computed, there are no internal signals that would indicate that the estimates are incorrect. When the information is unbalanced, the internal signal for incorrect estimates is a large standard error. Even with the unbalanced design, the computer algorithm maximizes the likelihood and provides 3 estimates of ancestry for each person. This fact highlights the need to validate each set of SNPs that is incorporated into a maximum likelihood model for ancestry by testing them with the individuals in the ancestral populations from which the AIMs were chosen: the expected mean value should be 1.0 for the respective ancestral component.
Accurate ancestry estimates require careful balancing of information between contrasts
To insure the accuracy of the ancestry estimates the information in the 3 contrasts of the 3-ancestry model must be balanced (Table 2). There are many approaches possible to address this problem The key is to balance the information over all SNPs for the three contrasts, whether or not a single AIM is informative for either one or two contrasts. This becomes more difficult with two-contrast informative SNPs because, when trying to balance the model, each addition or subtraction affects two information statistics. One strategy, shown in the present work, is to choose three sets of single contrast informative SNPs. A second approach is to choose a set of double informative SNPs, such as ones with |P1-P2| and |P1-P3| informative, and balance these with single informative |P2-P3| loci.
Using a Bayesian clustering method with K = 3 does not obviate the need for balanced information in the ancestry markers
Repeating the individual admixture estimates using the STRUCTURE program (K = 3) gave similar results to the fixed parental allele frequency algorithm but showed, in addition, that it was even more sensitive to imbalances in information. It did not return the expected mean value of AI for Pima Indians even when the contrast, |PAI-PAF| or |PEU-PAI|, was maximized for this component (Additional file 4: Table S1); whereas the fixed parental allele algorithm always returned the correct mean expected values for the components maximized in the contrast when all 3 components were being simultaneously estimated (Table 4, Fig. 2). When the Bayesian cluster algorithm was used with all 1300 SNPs with balanced information, it returned the appropriate mean expected values for all ancestry samples. This further illustrates the need for careful balancing of the ancestral information when selecting markers, irrespective of whether the algorithm uses a classical method such as maximum likelihood or a more recent method such as STRUCTURE.
Previous studies have shown that, given sufficient information, maximum likelihood methods, Bayesian methods such as STRUCTURE and hybrid methods produce similar admixture estimates [22, 23]. For optimal ancestry estimates, all methods require information on allele frequencies in the ancestral populations, either by taking them as known quantities, as in the classic maximum likelihood method used here, or by inclusion of genotypes from representative ancestral reference groups as in STRUCTURE [22, 23]. Raw genotypic data from a suitable American Indian reference ancestry population may not be readily available, however, and in the absence of these data there was a modest overestimation of the Amerindian component in the FIND Mexican Americans when STRUCTURE was used (Fig. 5, Panel b). In the absence of genotypic data from an American Indian reference ancestry group, the maximum likelihood method with specified ancestral allele frequencies is preferable (Fig. 5, Panel c). Given genotypes on some of the AIMS, this method can be readily implemented with the allele frequencies provided in supplementary tables of American Indian (Pima) SNP allele frequencies used in the present study.
Balancing information in contrasts minimizes the error in replicate tests
A second set of 975 AIMs (Additional file 6) was chosen to investigate the error when individual heritage is estimated in the same person with two balanced sets of SNPs. It was also applied to the four ancestral populations in the present study and the distribution of the heritage differences was examined. For the HapMap CEU sample the mean difference for EU heritage was −0.003 with a median and mode value of 0.000 with the distribution of the differences being relatively symmetrical on either side of the mean (Additional file 4: Table S5). Very similar results were obtained for the distributions of AF heritage in the HapMap LWK and YRI samples and for AI heritage in the Pima. Therefore balancing information in the contrasts of the AIMs creates “correct” estimates of individual heritage by minimizing error inherent in the algorithm and the vector of AIMs, and emphasizes the importance of including the standard error or 95 % confidence contrasts with any point estimate of individual genetic heritage.
The FIND samples
The distribution of mean IGA in the FIND samples represents the creation of new American populations from immigrants from historically separated parental groups. African Americans in the FIND have 83.3 % of their genome derived from Africa and about 15.1 % from Europe, while there is only a small component from American Indians (Table 5). The genetic composition of African American populations can vary greatly by geographical location, whether urban or rural, north or south. Parra et al.  estimated EU by weighted least squares (WLS) in 10 urban African American samples and reported proportions from 0.116 (Charleston, S.C.) to 0.225 (New Orleans). An isolated population, the Gullah Sea Islanders off the coast of South Carolina, had an EU contribution of only 0.035 . A more recent estimate of IGA in 228 African Americans recruited by the University of Connecticut Health Center reported: EU, 0.17; AF, 0.75; and AI, 0.08 . Therefore the proportion of EU-derived genes in the FIND AA sample accords well with reports for urban African Americans in the United States.
Persons who self-identify as Mexican Americans in the southwest United States have reported admixture that is consistent from California to Texas. Long et al., in 730 unrelated persons from paternity tests in Arizona, reported WLS proportions EU 0.68, AI 0.29, and AF 0.03 and that these proportions are within one standard error of the mean from proportions reported from San Antonio, Texas, and Los Angeles, California . The Arizona sample was later enlarged to 2249 persons with revised WLS proportions EU 0.616, AI 0.314, and AF 0.071 and correspondingly smaller standard errors. Additional Mexican American admixture proportions (EU, AI and AF, respectively) have been reported from the San Antonio Diabetes Study (0.502, 0.464, 0.031) and the San Antonio center for Biomarkers of Risk of Prostate Cancer (0.589, 0.382, 0.029) . In two case–control studies of breast cancer in Latinas in the San Francisco Bay area, genetic admixture was measured; Fejerman et al. reported proportions EU 0.53, AI 0.40, AF 0.07 in 597 controls and 0.58, 0.35, 0.07 in 440 cases in women born in the U.S. ; Ziv et al. stratified their sample by 175 women born in Mexico, EU 0.520, AI 0.443, AF 0.037, and 100 persons born in the U.S. whose grandparents were Mexican-born, 0.473, 0.478, 0.048 . The FIND Mexican American proportions (Table 5) fit well within these and other data reported in the literature, that the European American component is the largest in the range of 0.45-0.65 followed by a smaller American Indian component and 0.03–0.07 African admixture. As the sample size increases, and the number of American Indian informative SNPs becomes larger in the estimate, the fraction of European admixture appears to decrease while that of American Indians increases.
While variation across studies appears to be the norm, the variation within the FIND Mexican American sample is relatively consistent when stratified by sex and enrolment center. The 554 males (EU 0.482, AI 0.446, AF 0.072) and the 846 females (EU 0.467, AI 0.456, AF 0.077) are well within one standard deviation for all three proportions. When the 4 enrolment centers that have sample sizes greater than 25 are considered (center 2, N = 634; 3, 114; 4, 308; and 5, 318), the range of proportions is small: EU 0.456–0.486, AI 0.443–0.482, and AF 0.071–0.076. Centers 2, 3, and 5 are in California, while center 4 is in Texas. Therefore, the FIND Mexican Americans, when IGA is estimated with the 1300 informative markers, exhibit a relatively uniform distribution of admixture across a large geographical area.
In contrast with FIND African American and Mexican American samples, the European American and American Indian samples exhibit small amounts of genetic admixture (Table 5). Persons who self-identify as of European heritage have only 1.5 % AI and 2.6 % AF mean heritage. Full Heritage Pima Indians make up a large proportion of the 869 American Indians who were recruited for the FIND; the amount and origin of their genetic admixture has been reported [10, 13, 31]. Pima Indians lie on the western end of a cline of European admixture that has its highest values in the northeastern United States, falls into intermediate levels in the Midwestern states, and reaches its lowest level in the desert southwest. This cline generally comports with the settlement of the country by persons of European origin from east to west. European IGA in the Pima Indians can be traced primarily to their genetic and cultural relations with the people of Mexico since the Spanish first entered the new world . The IGA estimates derived by the present method, and most other commonly used methods, assume Hardy-Weinberg equilibrium, and this assumption may not hold in some situations, such as a case–control study when markers are associated with disease; however, simulation studies have shown that admixture estimates are generally robust to deviations from Hardy-Weinberg equilibrium .
Standard error of the estimate
An advantage of the maximum likelihood method for individual ancestry estimation is the ability to calculate the information matrix and invert it for estimates of the variances, because point estimates of population parameters have little meaning without a measure of error accompanying them. Figure 4 illustrates that the standard error of individual ancestry has its largest improvement, decrease, within the first 100 informative SNPs in the estimates. After this there is steady improvement in the precision of the numbers, though the average effect of each additional AIM becomes progressively less. However, increasing the number of SNPs can have a significant effect on the confidence intervals of the individual heritage estimates. Gaining this additional precision could be important when the magnitude of estimated ancestry is small. At approximately 700 SNPs the mean standard errors are below 0.01, while with 1300 AIMs in the estimate the average standard error is in the range of 0.006–0.008.
Maximum likelihood ancestry estimates versus principle components for measuring population structure
In the FIND samples, the principal components derived from the GWAS SNPs and the ancestry estimates derived from the AIMs capture largely the same information, but, as they represent somewhat different functions of the data, the interpretation of the variables may differ. The relative advantage of PCs to account for population structure in association studies, compared to heritage estimates, is their relative ease of calculation and they do not require an a priori specification of ancestral populations. However their primary disadvantage is the ambiguity of their biological meaning. Maximum likelihood individual ancestry estimates, with standard errors, have the advantage of a clear biological meaning. Each proportion represents the fraction of alleles in the individual’s genome from an historical ancestral population. The disadvantage of the maximum likelihood method as currently implemented is the need for a large set of parental frequencies that are unlinked, balanced in their information, and with low replicate error rates in the SNP genotyping. The computational burden of maximum likelihood is also higher than for PCs. If these conditions can be met, however, heritage estimates can have great utility for tests of admixture equilibrium, monitoring information, and computing odds ratios as a function of individual heritage, as well as being used as covariates in tests of association in GWAS.
Population structure from combining samples leads to the association of ancestry and diabetic nephropathy
Tests of the association of diabetic nephropathy and IGH were computed separately for each of the 4 FIND samples in a logistic regression with enrolled age, sex, and enrollment center as covariates (Additional file 4: Tables S2, S3 and S4); no IGH component had a statistically significant association with disease in the individual samples. When the three tests were performed in the combined sample all IGH components were associated with diabetic nephropathy. To further parse the associations, a second set of logistic regressions was performed on the combined sample while assessing two heritage components at a time and using the third heritage as a reference with sex, enrolled age, and enrollment center again as covariates (Table 6). With AF heritage as reference, persons of European heritage are protected from the disease, while persons with AI heritage do not have an odds ratio statistically different from 1.0, which suggests that their odds ratio is similar to those with African heritage. A symmetrical result occurs when AI heritage is the reference; EU heritage is again protective while the odds ratio for AF is not statistically different from 1.0. This is confirmed further by the model that tests AI and AF heritage with EU as reference, in which both AI heritage and AF heritage are significantly greater than 1.0 while their 95 % confidence intervals overlap. While these estimates cannot necessarily be interpreted as reflective of population risk because of the way that patients are recruited in FIND, the odds ratios resulting from the population structure of the combined sample do generally reflect what is known about the relative occurrence and risk of diabetic nephropathy in the 4 heritage groups.
Failure to balance AIM information in poly-ancestry models creates biased estimates of individual admixture with large error. This occurs whether one employs the fixed parental allele algorithm for estimating IGA or the Bayesian clustering method as implemented in the program STRUCTURE. It is very important to describe the information contrasts explicitly and then emphasize the attention to them that is needed to compute correct estimates with low error because many researchers who are not trained in the details of the algorithms are downloading code, choosing sets of AIMs, and applying these to their analysis of population structure.
A set of ancestry informative markers is provided for estimating American Indian ancestry that reflects an ancestral tribe from the Paleo-Indian migration across the Bering Strait, the Pima Indians , who are the most completely characterized Indian group in North America. These AIMs will be particularly useful for estimating genetic admixture in populations from the Americas.
A statistic with no measure of error has very limited meaning and utility. Our method provides the researcher with a tool to construct 95 % confidence intervals for IGA and to gage how many SNPs are necessary to achieve a desired mean error in the sample.
We parse population structure by estimating both IGA and PCs and show that the two methods are highly correlated and useful for adjusting for structure in association studies, and suggest that IGA has the further advantage of being a number that is more easily understood in the context of the sample than are PCs.
We test the association of IGA with diabetic nephropathy in the FIND in both the individual and combined samples and demonstrate how combining samples to increase power in a genome wide association study can create associations between ancestry and the disease. We then find that the odds ratios for the associations of IGA with disease in the combined sample are consistent with what is known about the incidence and prevalence of diabetic nephropathy in these populations. Therefore we exploit population structure to provide us with useful information about the relative occurrence of the disease among the groups.
All FIND phenotype and genotype files, except those for the American Indian subjects, are available from the dbGAP database (accession number phs000333.v1.p1). Data for the American Indian subjects are not publically available for privacy reasons. Interested researchers who meet the criteria for access to the data can contact: Robert Hanson (firstname.lastname@example.org) or Clifton Bogardus (email@example.com).
Ethics approval and consent to participate
The FIND was completed in accordance with the principles of the Declaration of Helsinki. Written informed consent was obtained from all participants. The Institutional Review Board at each participating center (Case Western Reserve, Cleveland, Ohio; Harbor University of California Los Angeles Medical Center; Johns Hopkins University, Baltimore; National Institute of Diabetes and Digestive and Kidney Diseases; University of California, Los Angeles, CA; University of New Mexico, Albuquerque, NM; University of Texas Health Science Center at San Antonio, San Antonio, TX; Wake Forest School of Medicine, Winston-Salem, NC) approved all procedures, and all study subjects provided written informed consent. A certificate of confidentiality was filed at the National Institutes of Health.
Consent for publication
Publication of the results of the analyses was part of the informed consent. No individual-level clinical data were published.
A list of the members of the FIND Research Group follows (key: *Principal Investigator; **Co-investigator; #Program Coordinator; §University of California, Davis; †University of California, Irvine; ‡Study Chair). Genetic Analysis and Data Coordinating Center, Case Western Reserve University: *S.K. Iyengar,**R.C. Elston,**K.A.B. Goddard,**J.M. Olson, S. Ialacci, # J. Fondran, A. Horvath, R. Igo Jr, G. Jun, K. Kramp, J. Molineros, S.R.E. Quade; Case Western Reserve University: *J.R. Sedor, **J. Schelling, #A. Pickens, L. Humbert, L. Getz-Fradley; Harbor-University of California Los Angeles Medical Center: *S. Adler, **E. Ipp, **†M. Pahl, **§M.F. Seldin, ** S. Snyder, **J. Tayek, #E. Hernandez, #J. LaPage, C. Garcia, J. Gonzalez, M. Aguilar; Johns Hopkins University: *M. Klag, *R. Parekh, **L. Kao, **L. Meoni, T. Whitehead, #J. Chester; NIDDK, Phoenix, AZ: *W.C. Knowler, **R.L. Hanson, **R.G. Nelson, **J. Wolford, #L. Jones, R. Juan, R. Lovelace, C. Luethe, L.M. Phillips, J. Sewemaenewa, I. Sili, B. Waseta; University of California, Los Angeles: *M.F. Saad, *S.B. Nicholas, **Y.-D.I. Chen, **X. Guo, **J. Rotter, **K. Taylor, M. Budgett, #F. Hariri; University of New Mexico, Albuquerque: *P. Zager, *V. Shah, **M. Scavini, #A. Bobelu; University of Texas Health Science Center at San Antonio: *H. Abboud, **N. Arar, **R. Duggirala, **B.S. Kasinath, **F. Thameem, **M. Stern; Wake-Forest University: *‡B.I. Freedman, **D.W. Bowden, **C.D. Langefeld, **S.C. Satko, **S.S. Rich, #S. Warren, S. Viverette, G. Brooks, R. Young, M. Spainhour; Laboratory of Genomic Diversity, National Cancer Institute, Frederick, MD: *C. Winkler, **M.W. Smith, M. Thompson, #R. Hanson, B. Kessing; Minority Recruitment Centers: Loyola University: *D.J. Leehey, #G. Barone; University of Alabama at Birmingham: *D. Thornley-Brown, #C. Jefferson; University of Chicago: *O.F. Kohn, #C.S. Brown; NIDDK program office: J.P. Briggs, P.L. Kimmel, R. Rasooly; External Advisory Committee: D. Warnock (chair), L. Cardon, R. Chakraborty, G.M. Dunston, T. Hostetter, S.J. O’Brien (ad hoc), J. Rioux, R. Spielman. We acknowledge the contributions of the Wake Forest participants and coordinators Joyce Byers, Carrie Smith, Mitzie Spainhour, Cassandra Bethea, and Sharon Warren and the contributions of FIND participants and physicians and CHOICE patients, staff, laboratory, and physicians at Dialysis Clinic Inc. and Johns Hopkins University.
This study was supported in part by National Institutes of Health (NIH) grants R01 DK 070941 and R01 DK 084149 (Dr. Freedman) and R01 DK53591 (Dr. Bowden). Dr. Bostrom was supported by F32 DK080617 from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). Computing resources were provided by the Wake Forest University Health Sciences Center for Public Health Genomics. This study was also supported by FIND grants U01DK57292, U01DK57329, U01DK057300, U01DK057298, U01DK057249, U01DK57295, U01DK070657, U01DK057303, U01DK070657, U01DK57304 and CHOICE study DK07024 from the NIDDK and in part by the Intramural Research Program of the NIDDK. This project has been funded in whole or in part with federal funds from the NIH National Cancer Institute (NCI) under contract HHSN26120080001E and the Intramural Research Program of the NIH-NCI Center for Cancer Research. This work also was supported by the National Center for Research Resources for the General Clinical Research Center grants: Case Western Reserve University, M01-RR-000080; Wake Forest University,M01-RR-07122; Harbor–University of California, Los Angeles Medical Center, M01-RR-00425; College of Medicine, University of California, Irvine, M01-RR-00827-29; University of New Mexico, HSC M01-RR-00997; and Frederic C. Bartter, M01-RR-01346. The CHOICE Study was supported in part by HS08365 from the Agency for Healthcare Research and Quality, Rockville, MD, and HL62985 from the National Heart, Lung, and Blood Institute, Bethesda, MD. Genotyping was performed by the Center for Inherited Disease Research, which is fully funded through a federal contract from the NIH to Johns Hopkins University (N01-HG-65403).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Knowler WC, Coresh J, Elston RC, Freedman BI, Iyengar SK, Kimmel PL, et al. The family investigation of nephropathy and diabetes (FIND) Design and Methods. J Diabetes Complicat. 2005;19:1–9.View ArticlePubMedGoogle Scholar
- Iyengar SK, Abboud HE, Goddard KA, Saad MF, Adler SG, Arar NH, Bowden DW, Family Investigation of Nephropathy and Diabetes Research Group, et al. Genome-wide scans for diabetic nephropathy and albuminuria in multiethnic populations: the family investigation of nephropathy and diabetes (FIND). Diabetes. 2007;56:1577–85.View ArticlePubMedGoogle Scholar
- Kao WH, Klag MJ, Meoni LA, Reich D, Berthier-Schaad Y, Li M, Family Investigation of Nephropathy and Diabetes Research Group, et al. MYH9 is associated with nondiabetic and end-stage renal disease in African Americans. Nat Genet. 2008;40:1185–92.View ArticlePubMedGoogle Scholar
- Iyengar S, Sedor JR, Freedman BI, Kao WHL, Kretzler M, Keller BJ, et al. Genome-wide association and trans-ethnic meta-analysis for advanced diabetic kidney disease: Family Investigation of Nephropathy and Diabetes (FIND). PLoS Genet. 2015;11(8):e1005352. doi:10.1371/journal.pgen.1005352.View ArticlePubMedPubMed CentralGoogle Scholar
- Rosenberg NA, Huang L, Jewett EM, Szpiech ZA, Jankovic I, Boehnke M. Genome-wide association studies in diverse populations. Nat Rev Genet. 2010;11:356–66.View ArticlePubMedPubMed CentralGoogle Scholar
- Knowler WC, Pettitt DJ, Saad MF, Bennett PF. Diabetes mellitus in the Pima Indians: incidence, risk factors and pathogenesis. Diabetes Metab Rev. 1990;6:1–27.View ArticlePubMedGoogle Scholar
- Hanson RL, Muller YL, Kobes S, Guo T, Bian L, Ossowski V, et al. A genome-wide association study in American Indians implicates DNER as susceptibility locus for type 2 diabetes. Diabetes. 2014;63:369–76.View ArticlePubMedPubMed CentralGoogle Scholar
- Malhotra A, Kobes S, Knowler WC, Baier L, Bogardus C, Hanson RL. A genome-wide association study of BMI in American Indians. Obesity. 2011;19:2102–6.View ArticlePubMedGoogle Scholar
- Hanson RL, Bogardus C, Duggan D, Kobes S, Knowlton M, Infante AM, et al. A search for variants associated with young-onset type 2 diabetes in American Indians in a 100 K genotyping array. Diabetes. 2007;56:3045–52.View ArticlePubMedGoogle Scholar
- Williams RC, Knowler WC, Pettitt DJ, Long JC, Rokala DA, Polesky HF, et al. The magnitude and origin of European admixture in the Gila River Indian Community of Arizona: A union of genetics and demography. Am J Hum Genet. 1992;51:101–10.PubMedPubMed CentralGoogle Scholar
- McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, Nemesh J, Wysoker A, et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet. 2008;40:1166–74.View ArticlePubMedGoogle Scholar
- Korn JM, Kuruvilla FG, McCarroll SA, Wysoker A, Nemesh J, Cawley S, et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet. 2008;40:1253–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Williams RC, Long JC, Hanson RL, Sievers ML, Knowler WC. Individual estimates of European genetic admixture associated with lower body-mass index, plasma glucose, and prevalence of type 2 diabetes in Pima Indians. Am J Hum Genet. 2000;66:527–38.View ArticlePubMedPubMed CentralGoogle Scholar
- Li CC. A first course in population genetics. Pacific Grove: The Boxwood Press; 1976.Google Scholar
- The International HapMap Consortium. The International HapMap Project. Nature. 2003;426:789–96.View ArticleGoogle Scholar
- Rosenberg NA, Li LM, Ward R, Pritchard JK. Informativeness of genetic markers for inference of ancestry. Am J Hum Genet. 2003;73:1402–22.View ArticlePubMedPubMed CentralGoogle Scholar
- Weir BS, Cockerham CC. Estimating F-statistics for the analysis of population structure. Evolution. 1984;38:1358–70.View ArticleGoogle Scholar
- Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–59.PubMedPubMed CentralGoogle Scholar
- Liu J, Lewinger JP, Gilliland FD, Gauderman WJ, Conti DV. Confounding and heterogeneity in genetic association studies with admixed populations. Am J Epidemiol. 2013;177:351–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang J, Stram DO. The role of local ancestry adjustment in association studies using admixed populations. Genet Epidemiol. 2014;38:502–15.View ArticlePubMedGoogle Scholar
- Martin ER, Tunc I, Liu Z, Schmidt MA, Bustamante CD, Beecham GW. Confounded by ancestry? Considerations for ancestry adjustments in genetic association tests. (Abstract #1786M). Presented at the 64th Annual Meeting of The American Society of Human Genetics, October 20, 2014 in San Diego, CA; 2014.Google Scholar
- Tang H, Peng J, Wang P, Risch NJ. Estimation of individual admixture: analytical and study design considerations. Genet Epidemiol. 2005;28:289–301.View ArticlePubMedGoogle Scholar
- Tsai HJ, Choudhry S, Naqvi M, Rodriguez-Cintron W, Burchard EG, Ziv E. Comparison of three methods to estimate genetic ancestry and control for stratification in genetic association studies among admixed populations. Hum Genet. 2005;118:424–33.View ArticlePubMedGoogle Scholar
- Parra EJ, Marcini A, Akey J, Martinson J, Batzer MA, Cooper R, et al. Estimating African American admixture proportions by use of population-specific alleles. Am J Hum Genet. 1998;63:1839–51.View ArticlePubMedPubMed CentralGoogle Scholar
- Parra EJ, Kittles RA, Argyropoulos G, Pfaff CL, Hiester K, Bonilla C, et al. Ancestral proportions and admixture dynamics in geographically defined African Americans living in South Carolina. Am J Phys Anthropol. 2001;114:18–29.View ArticlePubMedGoogle Scholar
- Halder I, Yang BA, Kranzler HR, Stein MB, Shriver MD, Gelernter J. Measurement of admixture proportions and description of admixture structure in different US populations. Hum Mutat. 2009;30:1299–309.View ArticlePubMedPubMed CentralGoogle Scholar
- Long JC, Williams RC, McAuley JE, Medis R, Partel R, Tregellas WM, et al. Genetic variation in Arizona Mexican Americans: Estimation of Admixture Proportions. Am J Phys Anthropol. 1991;84:141–57.View ArticlePubMedGoogle Scholar
- Beuten J, Halder I, Fowler SP, Goring HHH, Duggirala R, Arya R, et al. Wide disparity in genetic admixture among Mexican Americans from San Antonio, TX. Ann Hum Genet. 2011;75:529–38.View ArticlePubMedPubMed CentralGoogle Scholar
- Fejerman L, John EM, Huntsman S, Beckman K, Choudhry S, Perez-Stable E, et al. Genetic ancestry and risk of breast cancer among U.S. Latinas. Cancer Res. 2008;68:9723–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Ziv E, John EM, Choudhry S, Kho J, Lorizio W, Perez-Stable EJ, et al. Genetic ancestry and risk factors for breast cancer among Latinas in the San Francisco Bay area. Cancer Epidemiol Biomarkers Prev. 2006;15:1878–85.View ArticlePubMedGoogle Scholar
- Williams RC, Steinberg AG, Knowler WC, Pettitt DJ. Gm3;5,13,14 and stated-admixture: independent estimates of admixture in American Indians. Am J Hum Genet. 1986;39:409–13.PubMedPubMed CentralGoogle Scholar
- Bansal V, Libiger O. Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations. BMC Bioinformatics. 2014;16:4. doi:10.1186/s12859-014-0418-7. Published online 2015 Jan 16.View ArticlePubMed CentralGoogle Scholar
- Williams RC, Steinberg AG, Gershowitz H, Bennett PH, Knowler WC, Pettitt DJ, et al. Gm allotypes in Native Americans: Evidence for three distinct migrations across the Bering Land Bridge. Am J Phys Anthropol. 1985;66:1–19.View ArticlePubMedGoogle Scholar