The 1000 Genomes project chromosome-specific VCFs for the GRCh38 assembly contain between 7.07 M (chr2) to 1.1 M (chr22) variants over all the 2504 individuals. After filtering for biallelic SNPs, phased, filtered for PASS, removing indels, we are left with 6.78 M (chr2) to 1.05 M (chr22) variants. The experimentally phased data from the 10X Genomics platform has different numbers of called variants for each sequenced individual. For chromosome 1, the number of called variants varies from 414 K to 494 K across the 28 individuals, while, for chromosome 22, the number of called SNPs varies from 104 K to 120 K. After performing a similar filtering for the experimental data, the number of biallelic PASS phased SNPs ranges between 298 K and 357 K for chromosome 1 and 64 K and 75 K for chromosome 22.
The SNPs from the experimentally phased VCFs (Fig. 1a), averaged over continent groups show that the vast majority of SNPs in this selection have high continent-specific MAF values (> 5%). However, if we look at all the SNPs in the 1000 Genomes Data (filtered for biallelic PASS phased SNPs) as a function of continent-specific MAF, the distribution we observe has a very different trend. There is a significant over-representation of the very low continent-specific MAF SNPs (< 0.1%), ∼ 5 ∗ 107, as compared to all the subsequent higher MAF SNPs, which all range < 1 ∗ 107.
These discrepancies between the numbers in the 1000 Genomes data and in the experimentally phased data, as well as the differing trends as a function of MAF occur because the 1000 Genomes data includes a SNP if even one individual in the 2504 individuals has a variant (heterozygous or homozygous-alternate) at that position while the experimental data includes a SNP only if that particular individual has a variant (heterozygous or homozygous-alternate) at that position. This results in a much larger number of overall SNPs being present in the 1000 Genomes data as compared to the experimental and also the majority of the 1000 Genomes SNPs having extremely low MAF, as those would occur only in one or a few individuals.
Genotyping error
Genotyping error is computed comparing the 1000 Genomes genotypes with the experimental genotypes. The experimental genotypes for all SNPs not present in the experimental VCF for each individual are assumed to be homozygous reference. Mismatched genotypes are counted as errors. Figure 2a looks at the errors (fraction of genotypes which are incorrect) for the experimental VCF positions as a function of the continent-specific minor allele frequencies. There is higher error at the population invariant sites (MAF = 0.0%) in the African and American populations than the European, East Asian and South Asian populations. This correlates with a lower total number of population invariant SNPs in those continents (Fig. 1a). For non-invariant SNPs, we observe, as expected, a decreasing error rate with increasing minor allele frequency, to a < 2% error genotyping error rate for the SNPs with minor allele frequencies > 1%.
Within these errors in the experimental SNPs, we observe significantly different rates for SNPs which are heterozygous vs homozygous reference in the experimental data (Fig. 2b). The error rate for SNPs which are homozygous alternate in the experimental data is 1.5 times the error rate for the SNPs which are heterozygous in the experimental data.
Comparing false positive (sites non-homozygous reference in 1000 Genomes data and homozygous reference in the experimental data) vs false negative (sites homozygous reference in 1000 Genomes data and non-homozygous reference in the experimental data) error rates for all 1000 Genomes sites (Fig. 2c), we see that the East Asian and South Asian populations both have mostly low false positive rates, but show a wide range (factor of 2) of false negative rates, while showing only a ~ 15% variation in the false positive rates for most individuals. In contrast, the African individuals mostly have relatively low false negative rates, but have among the highest false positive rates. This indicates that the sequencing in the 1000 Genomes project has over called non-homozygous reference variants in African individuals compared to the rest, and over called SNPs as homozygous reference in some of the East and South Asian individuals.
Phasing
Phasing errors are all analyzed for overall 1000 Genomes minor allele frequencies, not continent specific MAFs. Comparing the switch error across individual chromosomes (Fig. 3), we observe that the switch error ranges between 20 and 30% for the rare MAF (< 0.1%) SNPs, falling to < 5% for SNPs with MAFs 1–5%. The majority of SNPs, which fall in the MAF > 5% category, have an error < 2.5%. However, a comparatively higher switch error at larger MAF values (> 5%) is observed for chromosome 21. This plot (Fig. 3) shows only a subset of chromosomes for a single individual (GM18552), but this trend is observed for all other chromosomes and individuals studied.
Figure 4a shows the total switch error for each of the individuals. The total switch errors for all the individuals studied go up to ∼ 2.5%. The switch errors for the East Asian individuals are grouped together, while those for the South Asian individuals show greater variability. This is in line with the general observation that South Asian populations have an overall greater heterogeneity than do East Asian populations, which some of the authors have observed in ongoing studies with hundreds of individuals [J. Wall, Unpublished data].
Analyzing the switch error as a function of minor allele frequency averaged over all chromosomes of all individuals of a population (Fig. 4b), we observe low switch error, < 5%, for low minor allele frequencies (MAF) (1–5%). For rare SNPs with MAF (0.2–1%), the switch error is ∼ 5–10%. For extremely rare minor allele SNPs, i.e. MAF < 0.2%, the error is much higher, i.e. 15–35%. For all higher MAF values (> 5%), the error is < 2.5%. The average error rate for the individuals from the African populations is almost the same over the range of MAF values > 0.1%.
As observed in Fig. 4c, the differences in the error rates between individuals decrease with increasing minor allele frequency. Individuals from South Asia show a larger variation in error as a function of MAF as compared to individuals from East Asia. The individuals from the African populations have the lowest switch error over the range of MAF values. Individual NA20900, an individual from the Gujarati Indians in Houston (GIH) population has the lowest switch error as a function of minor allele frequency for the low MAF SNPs. This individual is not part of a trio in the 1000GP data, and further analysis is required to ascertain why it shows much lower switch error as compared to the other individuals studied. One possible explanation is that the current limited sampling of only 11 individuals from the South Asian population is not capturing the full spread of error rate variation, and including more individuals might show more individuals with comparable low error rates.
We also analyzed phasing error as a function of the distances between SNPs (Fig. 5). The phasing error increases as a function of the inter-SNP distance, i.e. SNPs which are further apart are more likely to be out of phase with each other. The within population trends are the same as for switch error vs MAF, where the individuals from South Asia show a larger spread as compared to the individuals from East Asia. Individual NA20900 shows the lowest error rate, same as for the comparison of error vs MAF (Fig. 4c).
Comparing the switch error as a function of MAF vs. the switch error as a function of inter-SNP distance, we see that the individuals from the African populations show distinctly opposite trends. For low MAF SNPs, the error is the lowest averaging over the African individuals, while across the range of inter-SNP distances, the average over the African individuals was the highest error. The reason this occurs can be understood from the fact that there are a higher number of low MAF SNPs in the African individuals in the experimental data (Fig. 1a), as well as an overall higher number of SNPs in those individuals, leading to a higher SNP density for these individuals. In addition, there is less linkage disequilibrium (LD) in the individuals from the African populations, which would make it harder to phase them accurately [23, 24]. Hence, pairs of SNPs are more likely to be out of phase with each other, leading to higher switch error as a function of inter-SNP distance.
Imputation
Imputation error is computed as the fraction of SNPs with incorrectly imputed genotypes (genotype discordance). However, depending on the subset of SNPs under consideration, the error can be computed in two different ways, (1) fraction of experimental SNPs incorrectly imputed and (2) fraction of all 1000GP SNPs incorrectly imputed. In the case of the second definition of error, the experimental calls for all the positions not in the experimental VCFs are assumed to be homozygous-reference.
Figure 6a shows the total imputation error in the experimental SNPs while Fig. 6b shows the total imputation error in the 1000GP SNPs for each of the individuals. The total imputation errors in the experimental SNPs for all the individuals studied go up to ∼ 4%. For this subset of SNPs, the two American individuals have the among the highest imputation errors. The imputation errors for the East Asian individuals are grouped together, while those for the South Asian individuals show greater variability. This agrees with our observations for the switch error (Fig. 4a). In the 1000GP SNPs, on the other hand, since we are looking at a much larger set of SNPs, most of which are homozygous-reference in any given individual, we see a much smaller error < ∼ 1%.
Imputation error in experimental SNPs
Figure 7a shows the imputation error rates as function of the continent-specific minor allele frequency. The continent invariant positions (MAF = 0.0%) are imputed almost as accurately as the high MAF (> 5% in 3 populations, and > 1% in two populations) SNPs. In these positions, we make the same observation as we did for the original genotyping in the 1000 genomes reference data (Fig. 2a), i.e. the errors in the European, East Asian and South Asian individuals for these continent invariant positions are lower than those for the American and African individuals. For the very rare SNPs, i.e. MAF < 0.2%, the error is as high as ∼ 60%. These extremely high error rates are only observed in the American individuals and a few of the South Asian individuals. While this error rate seems high, a likely explanation for that is that the imputation method infers each allele by finding the most likely haplotype match from the reference database for the individual being imputed [11]. In the case of a SNP with a rare variant, the best matching haplotypes are likely to contain the reference allele, leading to a prediction of homozygous reference genotype at that position. However, the SNPs in the experimental VCFs only include positions for which there is a non-homozygous reference genotype for that particular individual. As a result, any prediction of homozygous reference genotype is going to be counted as an error, leading to comparatively high error rates at these very low MAF values. For the rest of the individuals, the error rates are < 50%. In the mid-range of MAF values, i.e. 0.2 to 1%, the errors range between 10 and 20%. The SNPs with higher MAF values are fairly accurate, with errors < 2% for common SNPs (MAF > 5%). This can also be seen looking at all the individuals separately (Fig. 7b). The South Asian (Gujarati in Houston, Texas) individual NA20900 still shows the lowest error rate as a function of MAF for imputation, just as it does for the switch error (Fig. 4c). Out of the imputed experimental SNPs, a very small fraction have low imputation INFO scores (Additional file 1: Figure S1a). However, most of those are SNPs which are imputed incorrectly, hence filtering out low INFO score SNPs gives much smaller error rates throughout the range of MAF values (Additional file 1: Figure S2b).
Imputation error in all 1000GP SNPs
Computing the error using all the 1000GP SNPs, we see a different trend for the errors as a function of minor allele frequency (Fig. 8a, b). The invariant sites have very low errors ~ 10− 4. For the variant sites, the errors increase as a function of minor allele frequency, as opposed to decreasing as they do in the experimental only SNPs. The reason this happens is that contrasting the number of experimental SNPs (Fig. 1a) with the numbers of all 1000GP SNPs (Fig. 1b), while the number of low MAF SNPs is 1–2 orders of magnitude less than the number of SNPs with MAF > 5% in the experimental data, the number of very low MAF SNPs is 2–10 times greater than the number of SNPs with MAF > 5% in the whole 1000 Genomes data. The vast majority of the very low MAF SNPs in the whole 1000 Genomes data are homozygous-reference, since those SNPs show variation in only one or very few 1000 Genomes individuals. Hence, imputation predictions get most of those positions correct in most of the individuals. As a result, the fraction of those very rare SNPs which are predicted incorrectly is much lower when considering all the 1000 Genomes SNPs as compared to only considering the experimental SNPs, where most of the SNPs are high MAF SNPs. However, it is important to note that a lot of the low MAF SNPs have low INFO scores for imputation (Additional file 1: Figure S1b). Hence filtering out SNPs with low INFO scores shows a decreasing error rate with increasing MAF, as is expected (Additional file 1: Figure S3b).
Consistent with the observations for the experimental only SNPs, at very rare SNPs (MAF < 0.2%), the American individuals still have the highest error rate. The individuals from the South Asian populations still show a greater spread than those from the East Asian populations. Individual NA20900 still shows the lowest error rate as with previous observations.
An alternative measure of imputation accuracy is genotype r2. Figure 9 shows the r2 as function of the alternate allele frequency (AAF) (as opposed to minor allele frequencies). This enables comparison to the imputation accuracies reported in the 1000GP phase 3 paper [3], and we see higher accuracies for EAS individuals and lower accuracies for AMR individuals at very low alternate allele frequencies compared to those previously reported values. The accuracies reported for SNPs with AAF > 1% are consistent with the previously reported values in the 1000GP phase 3 paper. Consistent with the observations in genotype discordance, the r2 values show the least accuracy for the American individuals at low alternate allele frequencies.
Comparison of reference panels
Here, we compare the imputation errors resulting from using different reference panels for imputation. A continent-specific reference panel for the individual of interest, a reference panel which includes all of the 1000 Genomes individuals, and a continent-specific reference panel for a different continent from the one from which the individuals are, are chosen. The minor allele frequencies used here are for all the overall 1000 Genomes minor allele frequencies, instead of a continent-specific minor allele frequency, since we want to understand the impact of the choice of reference panel, and continent-specific MAFs would not align with the whole reference or the reference from another continent. In this case, we used the South Asian reference panel as the different continent panel and estimated imputation accuracies for all the other individuals, using a reference panel corresponding to that individual’s continent group, the South Asian reference panel, and the whole 1000G reference panel.
The observed result for experimental only SNPs (Fig. 10a) when comparing reference panels for the AFR, AMR, EUR, and EAS individuals is very similar when looking at all 1000 Genomes SNPs (Fig. 10b). The imputation accuracy when using the entire 1000 Genomes data as a reference panel gives almost identical accuracy as using a continent specific reference panel corresponding to the individuals in 3 of the 4 continent groups. For the AMR individuals, however, there is a marked improvement in using the full 1000G reference panel than the AMR specific reference panel. The error while using an incorrect reference panel, in this case the SAS panel, however, is a factor of 2 or more greater than the error when using the appropriate reference, or when using the whole 1000 Genomes reference panel. In particular, the choice of the SAS panel gives significantly the highest error rate for the AFR individuals. The trend of error as a function of MAF for all 1000G SNPs is, again, the opposite of what was observed when looking at only the experimental SNPs, as discussed previously.