Five measures of ancestry informativeness
Notation
Consider populations i = 1, 2,..., K with K ≥ 2 and a locus with N ≥ 2 alleles. Let p
ij
denote the frequency of allele j, j = 1, 2, ..., N, in population i. Let p
j
denote the average frequency of allele j over the K populations, i.e., . Consider an admixed population with two parental populations, the frequency of allele j at a locus in the admixed population p
Aj
is a linear combination of allele frequencies in the ancestral populations, and can be written as p
Aj
= m1p1j+ m2p2j, where m
i
is the proportion of contribution of the ith ancestral population, and m1 + m2 = 1.
Absolute allele frequency difference (delta, δ)
Delta is the most commonly used measure of SNP marker informativeness for ancestry between two parental populations. Delta is defined as the absolute difference in the frequencies of a particular allele observed in two ancestral populations. For a biallelic locus, suppose allele one is the reference allele, then,
A marker with δ = 1 provides perfect information regarding ancestry whereas a marker with δ = 0 carries no information. It has been shown that δ by itself only provides limited information regarding a marker's informativeness for ancestry [10]. The sum of the allele frequencies in the two parental populations, or equivalently the value of the smaller of the two frequencies, can provide additional information independent of δ.
F statistics (FST)
FST is the proportion of the total genetic variance (the T subscript) [39] contained in a subpopulation (the S subscript). When only two parental populations and markers with only two alleles are considered, the informativeness for ancestry includes the differences and the sum of the reference allele frequencies in the two parental populations. In other words,
Here, j = 1 or 2 is the reference allele. Values of FST can range from 0 to 1. A high FST value implies a considerable degree of differentiation between populations. FST is a pair-wise population measure of differentiation or relatedness (genetic distance measure between the two populations) based on genetic polymorphism data such as SNPs and was recently utilized as a criterion for selecting markers for ancestry estimation [40].
Fisher Information Content (FIC)
Pffaff [41] showed how FIC can be used to determine the informativeness of a specific marker. The determinant of the Fisher information matrix provides a measure of the amount of information contained in the data. The genetic contributions of the ancestral populations can be estimated by the maximum likelihood method from a sample of genotypes from the admixed population [42, 43]. For a biallelic locus of an admixed population from two parental populations,
Here, is the expected frequency of the jth allele in the admixed population or individual, δ
j
= p1j- p2jis the allele frequency difference, is the maximum likelihood estimate of the contribution from ancestral population one. FIC measure allows selection of markers that are particularly informative in an admixed population in which the contribution of one parental population is substantially greater than that of the other parental population. It favors selection of markers that are closer to fixation in the parental population with the greater contribution.
Shannon Information Content (SIC)
Rosenberg et al. (2003) [10] used the concept of entropy to develop a measure of marker informativeness. Entropy is a measure of the uncertainty associated with a random variable and quantifies the expected information content contained in the data. If the sampled population is an admixture of two parental populations (with ancestral proportion m1 and m2), the SIC for a biallelic locus can be written as:
Informativeness for assignment (In) measure
In is a mutual information-based statistics that takes into account self-reported ancestry information from the sampled individuals [10]. The informativeness for assignment of a SNP is defined as:
This formula is a generalization to more than two populations. From a likelihood perspective, it gives the expected logarithm of the likelihood ratio that an allele is assigned to one of the populations compared with a hypothetical 'average' population whose allele frequencies equal the mean allele frequency across the K populations. Its value is smaller when all alleles have similar frequencies in all populations.
Data
HapMap phase III dataset
We downloaded the complete HapMap phase III genotype data (http://www.hapmap.org, release #3, May 2010) available for Yoruban population in Ibadan, Nigeria [YRI], Caucasian population from the United States with northern and western European ancestry [CEU], and the African American population from Southwest USA [ASW]. HapMap is a public resource created by the International HapMap Project to catalogue genetic variants (SNPs) that are common in human populations. The HapMap phase III release #3 contains genotypes from 147 unrelated individuals (parents) from YRI population, 113 unrelated parents from CEU population, and 87 individuals from ASW population. For the purpose of the present study, YRI and CEU populations are assumed to be the ancestors of ASW population. Two criteria were used to filter the SNPs included in the final analysis: 1) the SNP should be shared by both YRI and CEU populations, i.e., SNPs for which allele frequencies were available in both YRI and CEU populations, and 2) SNPs with missing frequency for over 10% of the samples were excluded. Furthermore, to avoid the possibility of choosing two redundant SNPs that are in strong LD (linkage disequilibrium), for each measure, we calculated the informativeness on all shared SNPs, then filtered them for the most informative ones such that the physical distance between consecutive selected SNPs must be at least 100 kb.
Simulated dataset
To compare marker informativeness measures in the estimation of ancestry population contribution, we simulated two artificially admixed datasets from the phased HapMap III dataset (with known allele frequencies). The first one is an admixed population from two parental populations with relatively high divergence: 113 unrelated individuals in CEU population and 113 unrelated individuals in YRI population. The second one is an admixed population from less differentiated ancestral populations: 84 unrelated individuals of Han Chinese in Beijing, China [CHB] and 86 unrelated individuals of Japanese in Tokyo, Japan [JPT]. The simulations were run using simuPOP [32, 33] for 10 generations. During the simulation, we tracked the true ancestry contributions for each individual and calculated average ancestry contributions for each of the two admixed populations.
Statistical analysis
Python (http://www.python.org) scripts were written to retrieve and pre-process SNP and frequency data. Five measures of marker ancestry informativeness were calculated for shared SNPs between CEU and YRI population and for shared SNPs between CHB and JPT population, with YRI or JPT contribution fixed as 80% for the calculation of FIC and SIC. Sensitivity analysis of different YRI contributions on the selection of AIMs was performed for FIC and SIC. For each data set, the number of alleles per locus (SNP) was coded to a string of numbers to obtain a full design matrix of alleles where the cells give the number of copies of each major allele for each individual (zero, one, or two). R (R Foundation for Statistical Computing, 2010), SAS software (SAS 9.1.3, SAS Institute Inc.), and JMP Genomics (JMP® Genomics, v.5, SAS Institute Inc.) programs were used to analyze the various measures of informativeness for ancestry.
Correlation, concordance, and overlapping analysis
To assess the level of similarity of the estimates of genetic information contained in each SNP marker across the five measures of marker informativeness, we used three statistical procedures: Spearman correlation coefficient, Cohen's Kappa statistics, and overlapping frequency analysis of top n ranked AIMs by different measures. Although the three approaches share some common information, each provides unique and complementary views of the behavior of the five measures. Spearman correlation coefficient is a global measure of statistical dependence and provides a general sense regarding pair-wise monotonic relationship of the five measures of marker informativeness. The Cohen's Kappa coefficient based on deciles quantifies agreement between two measures and the corresponding mosaic plot exhibits overlapping structures, i.e., the distribution of markers according to one measure of informativeness relative to another. Finally, the overlapping frequency analysis demonstrates how often the same set of SNPs is selected by two or more different measures or which measures tend to select the same set of SNPs.
Spearman correlation coefficient [44] is a measure of correlation based on the ranks of the data values. It is a nonparametric alternative to Pearson's correlation coefficient and do not require the knowledge of the distribution of the data. The formula for Spearman correlation coefficient is , where, Xi and Yi are the ranks of observed data values, is the mean of Xi's, and is the mean of Yi's. In case of ties, the averaged ranks are used. Spearman correlation coefficient takes values between -1 and +1. A +1 or -1 indicates that the two measures are in a perfectly monotonically increasing or decreasing relationship, respectively, and a 0 means no relationship.
To show the distribution of markers according to one measure of informativeness relative to another, we further analyzed the data by grouping and rating SNP markers using deciles, producing mosaic plots and calculating Cohen's Kappa coefficients. Deciles are the nine values of a variable dividing its distribution into ten groups with equal frequencies. For each measure, based on its deciles we created a new categorical variable with values 1, 2..., and 10, indicating to which group a SNP belongs. We then used the new categorical variables to build mosaic plots and to examine the relationship between measures of marker informativeness. The mosaic plots show, for example, how the top 10% SNPs from one measure of informativeness distribute relative to another measure of informativeness. To assess the concordance of decile-based ratings of the informativeness of AIMs between measures, we computed the Cohen's kappa coefficient, a commonly used index to quantify agreement between two measurements [45]. It takes into account the concordance by chance and is calculated as κ = [Pr(a) - Pr(e)]/[1- Pr(e)], where Pr(a) is the observed agreement percentage, and Pr(e) is the chance agreement percentage. The larger the kappa coefficient, the better the concordance is between two measurements. Kappa takes values between 0 and 1. κ = 1 indicates a perfect agreement while κ = 0 indicates no agreement other than what would be expected by chance.
To answer the question of how often the same set of SNPs is selected by the different methods or which methods tend to select the same set of SNPs, we studied the overlap pattern of the top n AIMs selected by different measures of informativeness. Each SNP was assigned a 5-digit binary vector, where each digit represents a measure. From the first to the last these correspond to δ, FST, FIC, SIC, and In, respectively. A 1 in the digit indicates that the SNP is selected by the corresponding measure as one of the top n AIMs. For example, a binary vector 11001 represents the SNP is selected by δ, FST, and In as one of the top n AIMs, but not by FIC and SIC. For a specific n, the frequency of the different combinations of the 5-digit numbers (such as 11001 and 00110) shows how often the different methods select the same set of SNPs. The higher the frequencies, the higher the chance that the same set of SNPs are selected by the methods corresponding to the 1's in the 5-digit vector.
Discriminant analysis
To compare the discrimination power of the five measures of informativeness and assess how many markers are needed for accurate ancestral CEU vs. YRI population and CHB vs. JPT population membership assignment, discriminant analysis was performed using the top 1, 2, ..., and up to 150 ranked AIMs. Discriminant analysis [46] is a method of projecting high-dimensional data onto a lower-dimensional space in a way that data points from different classes are well-separated. The projection is given by , where x = (x1, x2, ..., x
p
) is a p dimensional data point (individuals with SNP information), and y is the projection of x onto w = (w1, w2, ..., w
p
), or a linear combination of x
i
with weights w
i
. The weights are chosen such that the projections of the data points (individuals) in the same class (CEU or YRI population) are close to each other while those of the data points from different classes are far from each other. Linear discriminant can be derived using a measure of generalized squared distance. An optimal linear classifier then can be found by minimizing classification error (probability of misclassification). The classifier can take into account of prior probabilities of the classes, which, in our analysis, were specified as proportional to the sample sizes in each class. A data point is classified into the class for which the posterior probability of the observation belonging to this class is the largest among all classes. Cross-validation is used to obtain prediction accuracy. The analysis was carried out using PROC DISCRIM in SAS (SAS 9.1.3, SAS Institute Inc.). We also examined the number of AIMs needed to achieve 90% or 95% classification accuracy.
Estimation of ancestry contribution in admixed ASW population
We estimated ancestry contribution for the admixed ASW population using up to 200 top ranked AIMs by different measures. We also estimated ancestry contribution using 100 sets of randomly selected 20 SNPs from the top 1%, 2%, 5%, and 10% ranked AIM panels. The analysis was performed using the software PSMIX [47]. This analysis allowed us to compare the consistency in the estimation of ancestry contribution when the number of informative markers in the pool increases or decreases. The rationale for conducting this analysis is that AIM panels generated by different measures are more similar when only the top AIMs are considered, which makes it difficult to compare their performances using only the top AIMs. By using random subsets of AIMs from the top AIM panels of various sizes, we expect to select less informative markers and the estimate of the ancestry contribution is expected to become less accurate (or more biased) with more variability. More importantly, we will be able to determine if there is clear separation in the performance of different measures based on the information content taken from similar (1% to 10%) pools of markers as determined by each method.
We constructed two new methods of ranking marker informativeness for ancestry by combining the information from all the five measures. For each marker, we assigned a ranking or score based on either the average ranking (AVE) or the minimum ranking (MIN) of the five measures. We didn't use the raw values from the five measures because they have different scales; thus, any score computed by weighted average of the raw values needs to be preceded by standardization of the raw values, which is beyond the scope of this paper.
Estimation of ancestry contribution in simulated admixed population
To validate the ancestral estimates of the five measures, the same set of analyses in the previous section were conducted for the two simulated admixed populations. In the simulated admixed populations, the ancestry proportion for each individual is known, so is the mean ancestry proportion across individuals in the same population. Estimation accuracy by different measures was compared at two different levels. At the population level, the estimate of the mean ancestry contribution across individuals was compared with the true value and bias was calculated for the five measures. At the individual level, individual true and estimated admixture values were compared, and root mean square error (RMSE) was used as a summary measure of precision in the estimation of individual ancestry proportion. RMSE is defined as , where M is the number of individuals in the sample, and q
i
and represent the true and estimated individual ancestries, respectively. We also plotted individual estimated contributions vs. true contributions.