Whole-genome sequencing and variant calling
We obtained an average 566,035,688 read pairs per sample (range 495,617,772 − 735,313,000) for each of the 9 individuals with WGS data. These reads were aligned to MacaM [17], to produce an average 27X coverage across the genome (range 24-33X). From these reads, a total of 10,193,425 high-confidence SNVs were identified across all 9 individuals, with an average of 5,037,341 variants detected per individual. The transition/transversion ratio (Ti/Tv) observed in this study was 2.17, consistent with observations in larger macaque cohorts (unpublished data). This set of sites served as the source of our optimal dense marker set, as described in Methods.
Genotyping-by-sequencing and variant calling
For each of the 16 pedigree members, we prepared and sequenced GBS libraries based on individual digests for both BglII and PstI. Among BglII libraries, we obtained an average 3,754,352 reads per sample, resulting in an average 4,734,354 base-pairs (bp) from 45,597 fragments with at least 20X coverage per sample (equivalent to 0.17 % of the genome). In contrast, among PstI libraries, we obtained an average 5,686,709 reads per sample, resulting in an average 9,772,130 bp from 134,314 fragments with ≥20X coverage per sample (equivalent to 0.35 % of the genome) (Fig. 2a-b). Notably, although the PstI libraries originally had ~1.5X more reads than BglII libraries, they had ~3-fold the number of fragments with high-depth coverage. While the vast majority of GBS fragments were adjacent to the predicted restriction enzyme cleavage site, a small number appeared to be distant from these sites (Fig. 2c-d). While these results may reflect off-target sequencing, it is also possible that they reflect genomic variation altering restriction sites in one or more individuals.
We next determined the number of high-quality SNVs available from each digest that would be suitable for use as framework markers in imputation. We first restricted SNVs to those with >20X coverage, leaving a total of 29,449 high-confidence variants in the BglII samples, and 62,619 variants in PstI samples. We further restricted these SNVs to only those that were concordant between WGS and GBS data, which included 98.0 % of SNVs for BglII (range 97.7–98.4 % among individuals) and 98.7 % of SNVs for PstI (range 98.5–98.9 % among individuals). To maximize the probability of the variant being present in as many animals as possible, we also restricted SNVs to only those with MAF >0.25, which further reduced these numbers to an average of 7,399 variants per sample for BglII (among 10,775 sites across all samples) and 22,455 variants per sample for PstI (among 37,505 sites across all samples), with an average distance between SNVs of 259,690 bp and 78,785 bp, respectively (see Fig. 2e-f). We were not able to call genotypes in all individuals for all SNV sites, due to variation among individuals in sequence quality at each site. From the PstI data, all individuals had sufficient data to call genotypes at an average 15,516 (~69 % of 22,455) of these SNVs, but only an average of 4,418 (~60 % of 7,399) of these sites could be called for all individuals from the BglII data. Based on the significantly greater numbers of high-quality SNVs across the macaque genome available from PstI sequence data, we chose this enzyme for all final imputation analyses.
Imputation accuracy on chromosome 19 across 3 different strategies for selecting WGS individuals
Our initial tests of imputation focused on chromosome 19 as a small chromosome “test case” in which to validate a large-scale genome-wide approach. To characterize the set of framework markers needed to facilitate imputation on this chromosome, we selected 833 variants spaced ~65 kb apart, from the set of high-MAF sites on this chromosome, characterized as described above. To characterize the set of dense markers to be imputed on this chromosome, we selected 5,010 variants, spaced ~10 kb apart, from the total set of 332,260 sites discovered from WGS data on this chromosome. This reduced and optimized set of dense markers on chromosome 19 was used in all imputation analyses on this chromosome.
Using these chromosome-specific framework and dense marker sets, we evaluated the “Bottom of Pedigree”, the “Founders”, and the “GIGI-Pick” strategies for selecting the 3 most informative of the 9 individuals with WGS data, followed by imputation of dense markers into the remaining 13 individuals, based on their framework marker data from GBS. We imputed genotypes using both genotype-calling methods available in the GIGI software, i.e., the “most likely” genotype for each site (Fig. 3A, C), or assigning only those full or partial genotypes that exceeded a hard probability “threshold” of 0.98 for calling both alleles, or 0.99 for calling a single allele (Fig. 3B, D). The GIGI-Pick selection strategy produced higher median accuracy of imputed genotypes than either of the other strategies, at 86.7 % (“most likely” method, ML) or 99.9 % accuracy (“threshold” method, THR), compared to median accuracy of 81.9 % (ML) or 99.3 % (THR) in the “Bottom of Pedigree” strategy, and 81.9 % (ML) or 99.8 % (THR) in the “Founders” strategy. By design, the “most likely” method will always call a genotype at all markers attempted (Fig. 3C). In contrast, while the “threshold” genotype calling method resulted in considerably higher accuracy across all selection strategies than the “most likely” method, the number of genotypes called was also reduced considerably, with the proportion of markers imputed ranging from a median 29.8 % in “Bottom of Pedigree”, to 32.7 % in “Founders”, and 46.4 % in GIGI-Pick. Within the GIGI-Pick strategy, individuals A, D, and E display lower accuracy, and a lower proportion of imputed variants, likely due to being unrelated to all individuals with WGS in this analysis, i.e., B, H, and J. We note that an unavoidable limitation of our analyses is the larger number of dense variants used to estimate accuracy between individuals with both WGS and GBS data, compared to those with only GBS data. This limitation may explain the slightly lower accuracy of individuals G and N, in whom small differences were magnified relative to individuals having WGS data for comparison.
Imputation accuracy on chromosome 19 with increasing numbers of WGS individuals
Using the same chromosome 19 framework and dense marker sets, we used the GIGI algorithm to impute our dense markers from 1 to 9 WGS individuals into the remaining 7–15 pedigree members with GBS data, in the consecutive order B, H, J, F, M, K, P, C, and D as ranked by the GIGI-Pick algorithm. We called genotypes using both the “most likely” method (Fig. 4A, C), and the hard probability “threshold” method (Fig. 4B, D). Under the “most likely” method, median accuracy increased from 82.6 to 86.6 % as 1–3 individuals with WGS were added; however, these increases plateaued quickly with only minor gains after 3 WGS individuals were used (Fig. 4A). Maximum median accuracy of 87.2 % was achieved at 5 WGS individuals. We note that even as more WGS individuals were included, two individuals consistently had lower accuracy than other individuals more closely related to the rest of the pedigree. Under the “most likely” method, individual D displayed 76 % accuracy across 1–5 WGS individuals, which increased to 84 % at 6–8 WGS individuals, due to the inclusion of WGS data from K, the child of D. Individual E had 77 % accuracy across all WGS scenarios. As expected, using the “threshold” method, accuracy was considerably higher, with median accuracy of 99.9 % in all scenarios. Under this method, while there were relatively minor differences in accuracy among all scenarios, the median fraction of markers imputed increased significantly from 30.5 to 46.3 % as the first 3 WGS individuals were included, reaching a plateau at 50.8 % when 4–5 WGS individuals were used. Again, founders D and E had the fewest variants called, with only 32.9 % and 21.2 % of positions called, respectively. Based on the optimal tradeoff between accuracy and number of genotypes imputed that was suggested by these results, we selected a ratio of 4 individuals with WGS per 12 individuals with GBS using the”threshold” method for all subsequent analyses.
Imputation accuracy with varying density of framework markers
As the number of framework markers increased from 325 up to the maximum available of 2,737, there was little variability in median accuracy of imputed genotypes, or in the fraction of markers imputed. Based on the “threshold” genotype calling method, median accuracy was 99.9 % for all marker panels, and the median fraction of markers imputed was 44.7 % for Panel A, 48.0 % for Panel B, 48.2 % for Panel C and 48.0 % for All Sites (Fig. 5A-B). Only individuals E and N showed initial improvement in accuracy with increasing marker density, but these improvements plateaued quickly. Individuals E, N, and G consistently exhibited lower accuracy overall than all other individuals across all marker density panels (again likely due to differences in the number of markers used to compare accuracy between WGS and GBS individuals), while individuals D, E, and M were consistently more poorly imputed than all others. For individuals D and E, this is likely due to being unrelated to any of the 4 WGS individuals used for imputation, i.e., B, H, J, and F. Importantly, both accuracy and the fraction of markers imputed were relatively consistent across a range of framework marker sets, and robust to a wide range of missing genotype data. Gains in accuracy and fraction of markers imputed were negligible when using all GBS data available.
Imputation rate and accuracy across allele frequencies on chromosome 19
Using the threshold genotype calling method, among a total 96 instances on this chromosome, private/rare alleles were imputed a total of 38 times (~40 %), with 100 % accuracy, but were not called 58 times (~60 %; see Additional file 1: Figure S1). Across all other allele frequencies, only slightly lower accuracies were obtained (>99.7 %), while imputation rates increased with allele frequency, as expected. Alleles at frequencies of 0.00 − 0.25 were imputed 20.7 % of the time; at 0.25 − 0.50 frequencies, 30.3 % of the time; at 0.50 − 0.75 frequencies, 30.3 % of the time; and alleles with frequencies >0.75 were imputed 47.2 % of the time (see Additional file 1: Figure S1).
Imputation of dense markers across the genome
To evaluate our imputation strategy across the whole genome, we employed the same criteria outlined above to generate framework marker sets for each of the 20 macaque autosomes. The number of framework markers per chromosome ranged from 435 to 1,450, totaling 17,158 across the genome, with mean spacing between framework markers among all chromosomes of ~159 kb (81 kb-253 kb). At this stage, we modified our dense marker set to a total possible 7,655,491 sites across the macaque genome, by including variants discovered previously from WGS data in an additional 6 unrelated ONPRC Indian rhesus macaques (in preparation), and by removing any variants with insufficient allele frequency information to allow imputation. Based on the strategy identified by our analyses as optimal, we used the first 4 individuals among our 9 with WGS as ranked by the GIGI-Pick algorithm, and imputed variants at this comprehensive set of dense markers into the remaining 12 pedigree members, based on the “threshold” calling method (Fig. 6). Per chromosome, median accuracy ranged from 99.4 to 99.8 % and the median fraction imputed ranged from 47.1 to 48.8 % with an average of 3,680,237 correctly imputed variants per subject. However, several individuals had significantly lower overall numbers of variants imputed at this level of accuracy, including D, E, K, and M (range 1,563,908-2,524,889), and individual E also consistently exhibited the lowest accuracy across all chromosomes at 96.3-98.5 %. These data were imputed using a probability threshold of 0.98 or 0.99 to call both or only one allele in a genotype, respectively; at a more relaxed threshold of 95 % confidence required to call both alleles, and 98 % confidence to call a single allele, we obtained an average 4,458,883 (~58 %) correctly imputed variants with an overall accuracy of > 97 %.
Genome-wide imputation rate and accuracy across allele frequencies
Using the threshold calling method, 17.3 % of 27,761 private/rare alleles were imputed with 99.9 % accuracy. Across all other allele frequencies, only slightly lower accuracies were obtained (> 99.7–99.9 %), while imputation rates increased with allele frequency, as expected. Alleles at frequencies of 0.00 − 0.24 were imputed 15.5 % of the time; at 0.25 − 0.49 frequencies, 23.1 % of the time; at 0.50 − 0.74 frequencies, 27.3 % of the time; and alleles with frequencies ≥0.75 were imputed 44.3 % of the time.
Post hoc comparisons of this approach to low-coverage WGS sequencing
After completing the analyses described above, we compared the genotype accuracy and density described in our WGS/GBS approach, to other low-coverage sequencing approaches, with or without imputation, but limited to a similar total sequencing cost. As of this writing, low-coverage sequencing remains more expensive than GBS per individual, so alternative, equivalent-cost approaches that substitute low-coverage sequencing would need to compensate by reducing or eliminating higher-coverage sequencing elsewhere in the study design. For example, one individual with 16X coverage may be used to impute into four individuals with 4X data. To simulate this 16X/4X imputation situation, we down-sampled the 30X WGS data from our analysis to 16X or 4X coverage and used these data to call genotypes as described above. We then compared the accuracy and number of genotypes obtained from these down-sampled data to the original 30X data. As shown above, only a few hundred framework markers per chromosome are sufficient for imputation. While the simulated 4X data provided genotypes at more than enough sites per sample to support imputation, we note that overall genotype concordance between 4X and 30X data was comparatively low (96.8 %), due to the lower read depth of the 4X data. We did find very high concordance between genotypes called in the 16X and 30X data (>99 %); however, the total number of genotypes that could be called was significantly greater in the 30X data, with an average of 8,549,800 per sample, compared to only 5,030,830 per sample in the 16X data, i.e., a 70 % increase. This result suggests that including animals with higher-coverage sequencing will allow the imputation of significantly more genotypes, more extensively throughout the pedigree, than a moderate/low coverage imputation approach. We note that another possible imputation approach using one 28X individual to impute variants into 4-1X individuals would cost the same as the 16X/4X approach (i.e., both total to 32X coverage), but would require high-confidence information a priori on colony or population allele frequencies in order to call variants with sufficient accuracy from 1X data to produce framework markers. Similarly, sequencing all 5 animals at 6X without imputation would require the same high-confidence information a priori. This information does not currently exist for the rhesus macaque, and many novel model organisms face this same limitation.