Whole-genome characterization in pedigreed non-human primates using Genotyping-By-Sequencing and imputation

Background Rhesus macaques are widely used in biomedical research, but the application of genomic information in this species to better understand human disease is still undeveloped. Whole-genome sequence (WGS) data in pedigreed macaque colonies could provide substantial experimental power, but the collection of WGS data in large cohorts remains a formidable expense. Here, we describe a cost-effective approach that selects the most informative macaques in a pedigree for whole-genome sequencing, and imputes these dense marker data into all remaining individuals having sparse marker data, obtained using Genotyping-By-Sequencing (GBS). Results We developed GBS for the macaque genome using a single digest with PstI, followed by sequencing to 30X coverage. From GBS sequence data collected on all individuals in a 16-member pedigree, we characterized an optimal 22,455 sparse markers spaced ~125 kb apart. To characterize dense markers for imputation, we performed WGS at 30X coverage on 9 of the 16 individuals, yielding ~10.2 million high-confidence variants. Using the approach of “Genotype Imputation Given Inheritance” (GIGI), we imputed alleles at an optimized dense set of 4,920 variants on chromosome 19, using 490 sparse markers from GBS. We assessed changes in accuracy of imputed alleles, 1) across 3 different strategies for selecting individuals for WGS, i.e., a) using “GIGI-Pick” to select informative individuals, b) sequencing the most recent generation, or c) sequencing founders only; and 2) when using from 1-9 WGS individuals for imputation. We found that accuracy of imputed alleles was highest using the GIGI-Pick selection strategy (median 92%), and improved very little when using >4 individuals with WGS for imputation. We used this ratio of 4 WGS to 12 GBS individuals to impute an expanded set of ~14.4 million variants across all 20 macaque autosomes, achieving ~85-88% accuracy per chromosome. Conclusions We conclude that an optimal tradeoff exists at the ratio of 1 individual selected for WGS using the GIGI-Pick algorithm, per 3-5 relatives selected for GBS, a cost savings of ~67-83% over WGS of all individuals. This approach makes feasible the collection of accurate, dense genome-wide sequence data in large pedigreed macaque cohorts without the need for expensive WGS data on all individuals.

selection strategies, below), employing algorithms that assess Mendelian consistent error both 164 pairwise between relatives and within families, as implemented in PedCheck [15] and  software. No significant departures from expected patterns of allele-sharing were 166 noted, confirming the validity of the pedigree configuration depicted in Fig. 1. Nine animals were 167 selected using an ad hoc approach for whole-genome sequencing in this study, based on their 168 position within the pedigree. Genotyping-By-Sequencing (GBS): To determine the optimal restriction enzymes for conducting 176 GBS in rhesus macaques, we first performed in silico prediction of cut sites using the most 177 recent build of the macaque genome [17] that would be expected to produce 60,000-100,000 178 DNA fragments in the 200-500 bp size range, while also minimizing the presence of repeat 179 sequences (e.g. retrotransposons, DNA satellites). We initially tested the enzymes ApeKI, BglII,180 EcoRI,HindIII,PspXI,PstI,and SalI,ultimately selecting BglII and PstI as the two enzymes 181 most likely to meet these criteria. We then generated GBS libraries based on these 2 enzymes 182 using a modified version of the method described by Elshire et al. [18]. Specifically, to create the 183 adaptors, oligonucleotides for the top and bottom strands for each barcoded adaptor and for the 184 two common adaptors (one for BglII and one for PstI) were paired and annealed in 1X 185 Annealing Buffer (20mM NaCl, 10mM Tris-HCl pH 7.5, 2mM MgCl 2 ) using a thermal cycler (3 186 min at 95C, ramp down 1.6C/min for 44 cycles, cool to 4C). All adaptors were quantified with the 187 Qubit Broad Range dsDNA Assay (Invitrogen). Each of the 32 barcoded adaptors was then 188 paired with a common adaptor at a 1:1 ratio. Each of the 16 genomic DNAs was digested with volume) were incubated for 2 hours at 37°C, and digests were ligated (400U T4 DNA Ligase 191 (New England Biolabs) to adaptor mixes (4.5ng BglII, 15ng PstI, in 50ul volume) for 1 hour at 192 22°C. Four (4) ul from each ligation reaction was combined into two separate pools, one per 193 enzyme. Both pools were cleaned with DNA Clean and Concentrator (Zymo Research) and 194 eluted in 50uL. Following amplification parameters in Elshire et al [18], PCR was performed on 195 10 ul of each pool (Q5 High Fidelity 2X MM (New England Biolabs), 25 pmol of each primer, in 196 50 ul volume) using Primers A and B, to  develop an analytical pipeline that could be applied to all the remaining chromosomes. We 234 performed imputation using the method of GIGI ("Genotype Imputation Given Inheritance"; [13]), 235 as this method has been successfully used to impute genotypes with high accuracy in extended 236 human pedigrees. This approach infers inheritance vectors (IVs, representing shared 237 chromosomal segments) at sparse marker locations conditioned on observed sparse marker 238 genotypes, and then infers IVs at dense marker locations conditioned on the sparse marker IVs, 239 together with the genetic map. The probability distribution is then estimated for each missing genotype at a dense marker position, conditioned on observed genotypes at all dense marker 241 positions, corresponding allele frequencies, and IVs corresponding to dense markers. In the 242 last step, genotypes may be called using these estimated probabilities, based on user-defined 243 thresholds. We estimated inheritance vectors according to the algorithm of [29], as 244 implemented using a Markov-Chain Monte-Carlo (MCMC) sampler in the gl_auto function in the 245 software package for genetic epidemiology MORGAN 3; available at [30]. The GIGI approach 246 has been implemented in a software package of the same name, and is available at [31]. 247

248
To characterize the sparse set of markers needed to facilitate imputation of dense marker 249 genotypes on chromosome 19, we identified a set of markers that could be detected reliably by 250 GBS for as many macaques as possible. Accordingly, we selected a set of high-quality SNVs 251 that, 1) were sequenced to at least 20X depth across the majority of GBS libraries, 2) were 252 spaced evenly across the genome, 3) had minor allele frequencies (MAF) >0.25, and 4) were in 253 excess of what was needed to meet the desired goal of ~0.5-1.0 cM average marker spacing. 254 We refer to these as "framework" markers, as discussed in Cheung et al., 2013 [13]. Using this 255 approach, the desired spacing can be maintained in an approximate fashion, even when 256 individuals are missing a substantial amount of genotype data, an outcome characteristic of the 257 GBS method [32]. Second, we selected a non-overlapping set of SNVs from our WGS data that 258 were more densely distributed than the framework markers, designated as our "dense" markers, 259 and which we attempted to impute into animals having only sparse framework marker 260 genotypes from GBS. These dense markers were selected from the set of all high-confidence 261 SNVs identified in our cohort. 262

263
To determine the success of imputing dense marker data into animals having only sparse 264 framework marker data, we evaluated the accuracy of imputed alleles for each recipient based 265 on their framework marker data obtained by GBS. Here, we define accuracy as the proportion of alleles imputed correctly among all attempted allele calls at that position, such that a correctly 267 imputed allele is concordant with the allele call from either WGS or GBS sequence data. We 268 additionally define rare variants as those having only 1 copy among a total of 30 chromosomes 269 (i.e., singletons, present at ~3% frequency) with WGS data that we have studied to date, 270 including the 9 individuals discussed in this paper and an additional 6 unrelated rhesus 271 macaques (unpublished data). We define accuracy of imputation for rare variants as the 272 proportion of rare alleles imputed correctly among all rare variant heterozygotes called from 273 either WGS or GBS sequence data. 274

275
We assessed accuracy of imputed variants over a range of 1-9 animals with dense marker data 276 from WGS, imputed into the remaining pedigree members using only sparse data from GBS. iteratively, until all 9 animals with WGS were used to impute genotypes into the remaining 7 283 animals in the pedigree. We used the GIGI-Pick algorithm [14] to rank our 9 animals with WGS. 284 This algorithm calculates a metric of coverage, defined as the expected percentage of allele 285 copies called for a variant at a random locus, conditional on fixed IVs for a specific choice of 286 individual(s), and then iteratively selects those individuals with the highest coverage, calculated 287 by integrating over all possible genotype configurations within a given pedigree. This algorithm 288 is implemented in the suite of software based on the GIGI approach, and is available at [33]. We 289 evaluated accuracy of imputed genotypes for each of our recipient macaques by masking all 290 non-framework genotypes in recipient animals, and comparing imputed genotypes to masked 291 genotypes obtained from either WGS or GBS data, depending on the data available for each 292 recipient. Specifically, imputed genotypes were compared to genotypes from WGS where 293 available, but for recipients with only GBS data available, imputed genotypes were compared to 294 genotypes at any SNVs with coverage by GBS that were not designated as framework markers. 295 Imputed genotypes were called as the most probable genotype at that position, using allele 296 frequencies established from all Indian-origin rhesus macaques sequenced to date at the 297 ONPRC as a reference (n=15, including the 9 animals from this study and 6 additional unrelated 298 animals). 299 300 To evaluate differences in imputation success associated with different sequencing strategies, 301 we compared the accuracy of genotypes imputed by GIGI on chromosome 19, among 3 302 different sequencing strategies, including GIGI-Pick and two common heuristic methods. These 303 2 heuristic methods include whole-genome sequencing of, 1) pedigree founders only, or 2) the 304 most recent generation (i.e., individuals typically located at the bottom of the pedigree). To 305 compare the different selection strategies, we examined accuracy for the scenario in which 306 dense markers from 3 animals selected for WGS are imputed into the remaining 13 pedigree 307 members with GBS data, based on using individuals B, H, and J (GIGI-Pick selections), B, C, 308 and D ("Founders"), and M, P, and K ("Pedigree bottom") strategies (see Fig. 1). Imputed 309 genotypes were called using 2 different strategies available in the GIGI algorithm, i.e., 310 genotypes were either above the default probability threshold, or simply as the most probable 311 genotype at that position ("Threshold" vs. "Most Likely", respectively), using allele frequencies 312 established from all Indian-origin rhesus macaques sequenced to date at the ONPRC as a 313 reference. To assess the utility of the GIGI imputation approach for imputing low-frequency 314 variants, we further evaluated genotype accuracy within the GIGI-Pick strategy of 3 WGS into 315 13 GBS described above. Notably, although the PstI libraries originally had ~1.5X more reads than BglII libraries, they had 336 >3-fold the number of fragments with high-depth coverage. Virtually all GBS fragments were 337 adjacent to their predicted restriction sites, but a small number appeared to be distant from 338 these sites (Fig. 2C-D). While these results may reflect off-target sequencing, it is also possible 339 that they reflect restriction sites in the genome of one or more individuals not predicted by the 340 current macaque reference genome. 341

342
We next determined the number of high-quality SNVs available from each digest that would be 343 suitable for use as framework markers in imputation. We first restricted SNVs to those with >20X coverage, leaving a total of 52,500 high-confidence variants in the BglII samples, and 345 239,428 variants in PstI samples. We further restricted these SNVs to only those that were 346 concordant between WGS and GBS data, which included 96.3% of SNVs for BglII (range 95.9-347 97.2% among individuals) and 96.6% of SNVs for PstI (range 96.1-97.2% among individuals). 348 To maximize the probability of the variant being present in as many animals as possible, we 349 also restricted SNVs to only those with MAF >0.25, which further reduced these numbers to needed to facilitate imputation on this chromosome, we selected 490 variants spaced ~100 kb 362 apart, from the set of 981 high-MAF variants on this chromosome, as described above (Fig.  363 2G). To characterize the set of dense markers to be imputed on this chromosome, we selected 364 4,920 variants spaced ~10 kb apart and which were not framework markers, from the set of 365 260,000 SNVs located on this chromosome (see above). We additionally removed a set of 578 366 markers that consistently performed poorly in imputation. This reduced and optimized set of 367 dense markers on chromosome 19 was used in all imputation analyses on this chromosome, in 368 order to limit the computational time required. 369 Using these chromosome-specific framework and dense marker sets, we evaluated the "Bottom 371 of Pedigree", the "Founders", and the "GIGI-Pick" strategies for selecting the 3 most informative 372 of the 9 individuals with WGS data, followed by imputation of dense markers into the remaining 373 13 individuals, based on their framework marker data from GBS. For both genotype-calling 374 methods, the GIGI-Pick selection strategy produced slightly higher median accuracy of imputed 375 genotypes than either of the other strategies, at 89.5% ("Most Likely" method, ML) or 90.1% 376 accuracy ("Threshold" method, THR), compared to median accuracy of 88.4% (ML) or 88.8% 377 (THR) in the "Bottom of Pedigree" strategy, and 88.6% (ML) or 87.6% (THR) in the "Founders" 378 strategy ( Fig. 3A-B). However, more individuals in the "GIGI-Pick" strategy displayed greater 379 genotype accuracy than in either of the other 2 strategies (interquartile range (IQR) from 89.5%-380 92.6% for GIGI-Pick, compared to 84.2-90.1% for "Bottom of Pedigree", and 85.9-89.5% for 381 "Founders"). While the difference in median accuracy estimated by both genotype calling 382 methods was <1%, 100% of genotypes were called in the ML method, while only 47.7% were 383 called using the THR method across all strategies. Thus, the ML method called genotypes at an 384 average 2,350 more markers per subject, while maintaining nearly identical accuracy. The GIGI-385 Pick strategy did produce one individual that consistently displayed much lower genotype 386 accuracy than all other individuals; individual D had only 72.9-75.6% of genotypes accurately 387 imputed, depending on calling method. Neither the "Bottom of Pedigree" nor the "Founders" 388 strategy produced any individuals with substantially lower genotype accuracy than other 389 pedigree members. Finally, we used the GIGI-Pick results to examine accuracy of imputed 390 singleton alleles, as defined in Methods. These rare alleles were imputed a total of 405 times, 391 representing 178 distinct alleles, with an accuracy of 93%. 392 393 Imputation accuracy on chromosome 19 across 9 sequencing ratios: Using the same 394 chromosome 19 framework and dense marker sets, we used the GIGI algorithm with the ML 395 genotype-calling method to impute our dense markers from 1-9 individuals with WGS into the 396 remaining 7-15 pedigree members, as described in Methods. Among all imputation scenarios 397 that added consecutively from 1 to 9 individuals with WGS, median accuracy among recipients 398 increased from 86.3% to 93.2%, while variation in accuracy decreased (IQR from 5% at N=1 399 WGS, to 1% at N=9 WGS). Variability in accuracy improved in a stepwise fashion; greater 400 variability tended to occur in parallel with an increase in median accuracy, but would improve 401 during the following scenario in which the greater accuracy was retained or further increased. 402 Individual D was consistently an outlier across 8 out of 9 scenarios, from 1 through 5 at 76% 403 accuracy, then increasing to 84% accuracy in scenarios 6-8, due to the inclusion of WGS data 404 from K, the child of D. Median accuracy surpassed 90% beginning at N=4 individuals with 405 WGS; although variation in accuracy continued to decrease across all remaining scenarios, only 406 slight gains in median accuracy were observed beyond this sequencing ratio (i.e., imputing from 407 4 individuals with WGS into 12 individuals with GBS) (Fig. 4). 408 409 Imputation of dense markers across the genome: To evaluate our imputation strategy across 410 the whole genome, we employed the same criteria outlined above to generate framework 411 marker sets for each of the 20 macaque autosomes. The number of framework markers per 412 chromosome ranged from 350 to 636, with mean spacing between framework markers among 413 all chromosomes of ~273 kb (108 kb-467 kb). At this stage, we expanded our dense marker set 414 to a total 14,384,988 SNVs across the macaque genome, by including SNVs discovered 415 previously from whole-genome sequence data in an additional 6 unrelated ONPRC animals (in 416 prep). Using the strategy identified by our analyses as the one most likely to maximize 417 genotype accuracy while minimizing overall costs, we used the first 4 individuals among our 9 418 with WGS ranked by the GIGI-Pick algorithm, and imputed our expanded dense marker data 419 into the remaining 12 pedigree members, based on the ML call method (Fig 5). Per 420 chromosome, median accuracy ranged from 85-88%; this accuracy is somewhat lower than the 421 ~92% accuracy achieved in our original experiment using chromosome 19. However, unlike our previous analysis that imputed a much smaller set of dense markers on chromosome 19, our 423 final genome-wide imputation included all known variants. As before, D had consistently lower 424 genotype accuracy at 73-77%. Individual E was also an outlier on multiple chromosomes, with 425 accuracy ranging from 79-84%. disease has yet to materialize. Since 2007, next-generation sequencing technology has 438 speeded the collection of genomic data at steadily decreasing cost, but only a relatively small 439 number of additional macaque genomes have been explored for variation, and none have yet 440 been systematically applied to the problem of human disease. This is unfortunate, given that 441 rhesus macaque colonies at many of the national primate research centers constitute a powerful 442 resource for genetic analysis of disease that is on par with many human genetic isolates, due to 443 their maintenance in large outbred and pedigreed colonies within a homogeneous environment. 444 Here, in order to catalyze the application of genomic data in macaques to the study of human 445 disease, we present an approach that will make feasible the collection of accurate, dense 446 genome-wide sequence data in large numbers of pedigreed macaques without the need for 447 expensive whole-genome sequence data on all individuals. 448 449 Our approach is based on using a low-cost reduced representation sequencing method 450 (genotyping-by-sequencing, GBS), to facilitate pedigree-based imputation of dense marker 451 genotypes from selected individuals with whole genome sequence data. In this study, we 452 evaluated the ability of 2 candidate restriction enzymes (BglII and PstI) to produce genomic 453 fragments for GBS, using both in silico and empirical methods. When compared to BglII, we 454 show that PstI produces substantially larger numbers of high-quality SNVs that are supported by 455 greater sequence read depth. Further, we found that PstI libraries provided sufficient coverage 456 over more than twice the number of high-quality variants needed to generate the sparse 457 "framework" markers required to support imputation. This is The selection of individuals for WGS that will maximize the accuracy of imputed genotypes 470 throughout the pedigree is a critical component of this approach. We compared , 471 a pedigree-based statistical approach to prioritizing subjects for WGS, to two other common 472 heuristic methods for selecting individuals for WGS, including sequencing only the most recent 473 generation of the pedigree ("Bottom of Pedigree"), and sequencing only pedigree founders ("Founders"). Due to the small size of our sample pedigree, we used 3 individuals with WGS to 475 impute genotypes at our streamlined set of dense markers, into 13 remaining individuals based 476 on their framework marker genotypes from GBS data. We show that while median accuracy is 477 only somewhat greater for the GIGI-Pick approach compared to the other two approaches, 478 many more individuals overall displayed higher accuracy using the GIGI-Pick selection method, 479 as reflected in the strong upward shift of the interquartile range. Importantly, rare alleles were 480 imputed with exceptionally high accuracy using the GIGI-Pick selection strategy, suggesting that 481 this strategy offers powerful support for downstream analysis of rare variant effects on complex 482 traits. It is possible that these 3 selection methods may perform differently for alternative 483 pedigree configurations, e.g., in a more shallow pedigree, sequencing founders or the most 484 recent pedigree members may provide information equivalent to the more formal strategy 485 implemented in GIGI-Pick. However, we note that the GIGI-Pick approach results in a clear 486 advantage even in our small pedigree that extends to only 2 generations, but which includes 487 many of the most common relationship types found in NHP colonies. Our results are consistent 488 with those of Cheung et al. [14], in that the GIGI-Pick selection approach substantially 489 outperformed both the "Bottom of Pedigree " and "Founders" (i.e.,"PRIMUS" in [14]) approaches 490 in the ability to impute common alleles, although our results indicate more consistent accuracy 491 with the "Founders" approach than with the "Bottom of Pedigree" approach. 492 493 Using the WGS individuals ranked in order of priority by the GIGI-Pick algorithm, we examined 494 the gain in accuracy of imputed genotypes throughout the pedigree achieved by the consecutive 495 addition of from 1-9 whole-genome sequences. Our results demonstrate that there are 496 excellent compromises available that balance sequencing costs and the ability to obtain dense 497 and accurate marker data. While the accuracy of imputed genotypes was greatest when using 498 all 9 individuals with WGS, most of this accuracy was achieved using the first 4 WGS 499 individuals, i.e., at 4 WGS individuals, median accuracy is at ~92.4% but increased only another 0.8% with the addition of the remaining 5 genomes. These results suggest that an optimal 501 tradeoff between the animals selected for WGS and GBS exists at the ratio of 1 individual 502 selected for WGS, per 3-5 relatives selected for GBS, a cost savings of ~67-83% over WGS of 503 all 6 individuals. We note that these estimates of accuracy were based on careful selection of an 504 optimal, and thus reduced, set of dense markers that were available on chromosome 19. While 505 this strategy was used deliberately to reduce the total computational time required for this study, 506 in our subsequent imputation of the full set of dense markers across the genome, median 507 accuracy only decreased by 5-8% for all chromosomes. 508

509
The increase in overall accuracy observed with additional WGS individuals was not shared 510 uniformly among all individuals in the pedigree. For example, while there was a large increase 511 in accuracy between 3 and 4 individuals with WGS, all of this change is due to increased 512 accuracy in A, a founder (see Fig. 1). In this scenario, this improvement is almost certainly due 513 to the inclusion of WGS data from F, a child of A. We note that D remained an outlier in the 514 distribution of genotype accuracy throughout virtually all imputation analyses based on the 515 ranking of WGS individuals by the GIGI-Pick algorithm. This may be due to the limited initial 516 selection of WGS individuals located in the far right lineage, i.e., only when K, P, and C are 517 added to J and used for imputation does accuracy rise for D. This result is consistent with the 518 GIGI-Pick approach, which balances the selection of closely related individuals within the 519 pedigree to facilitate phasing of genotypes, with the selection of more distant relatives to 520 increase the chance of observing unique founder alleles [14]. Because of this compromise, we 521 note that while sequencing only pedigree founders is not the best strategy for maximizing 522 accuracy, founders may remain unselected using the GIGI-Pick approach when the ability to 523 phase genotypes produces more expected allele calls than does the observation of unique 524 founder alleles, for a fixed number of selected individuals. This result also highlights the 525 importance that prior knowledge of phenotypes plays in selecting individuals for WGS. If traits of 526 interest are known to segregate in a particular lineage within the larger pedigree, it may be 527 advisable to manually assign either founders or a close descendant in that lineage for WGS, if 528 neither individual is selected using a more unbiased approach. 529 530 The imputation of dense, genome-wide genotypes with high accuracy will allow the unbiased 531 mapping of genetic variants in the macaque genome to disease traits, using either linkage or 532 association approaches. Both of these approaches are important tools in translational research, 533 and should further advance the understanding of human disease already made possible by 534 research in this species. Large pedigreed colonies of macaques, such as the ~4,500 macaques 535 found at the Oregon National Primate Research Center, provide an almost unequaled resource 536 for translational genetic research, due to their multi-generational pedigree structure and the 537 excess of rare and low-frequency variants expected to segregate within this pedigree. Rare and 538 low-frequency variants are expected to play a significant role in human disease [44-47], and we 539 have shown that we can impute these variants in the macaque genome with high accuracy and 540 at a reasonable cost, using the approach we outline here. Moreover, our findings suggest that 541 this approach can be modified to support specific research goals. For example, it may be 542 beneficial to take advantage of the less accurate but greater amount of information provided by 543 the full set of imputed dense markers during initial discovery of variants either linked to or 544 associated with a disease trait, while fine-mapping or replication of a putative trait locus might 545 employ a reduced, optimal set of dense markers likely to provide greater genotype accuracy 546 over a smaller region of interest. 547

548
In this study, we demonstrated that it is feasible to obtain comprehensive genome-wide variation 549 at a fraction of the cost of whole-genome sequencing using the GBS method and pedigree-550 based imputation, which we describe for the first time in a non-human primate genome. 551 However, imputed genotypes will only be as accurate as the underlying whole-genome 552 sequence data and the reference genome to which it is compared. The macaque genome has 553 undergone multiple revisions; however, it remains in draft form and is less complete than many 554 other common model organisms [17,43]. There are also extremely limited available data on 555 genetic variants in macaques, and no databases with comprehensive information on known 556 variants or population-level allele frequency information are publicly accessible. Both of these 557 factors present obstacles for accurate whole-genome variant calling in macaques, and will thus 558 reduce accuracy for any genotyping approach. Improvements in both of these areas are 559 urgently needed in order to fully realize the value of the macaque as a genetic model of human