Lethal Haplotypes and Candidate Causal Mutations in Angus Cattle

Background If unmanaged, high rates of inbreeding in livestock populations adversely impact their reproductive fitness. In beef cattle, historical selection strategies have increased the frequency of several segregating fatal autosomal recessive polymorphisms. Selective breeding has also decreased the extent of haplotypic diversity genome-wide. By identifying haplotypes for which homozygotes are not observed but would be expected based on their frequency, developmentally lethal recessive loci can be localized. This analysis comes without the need for observation of the loss-associated phenotype (e.g., failure to implant, first trimester abortion, deformity at birth). In this study, haplotypes were estimated for 3,961 registered Angus individuals using 52,545 SNP loci using findhap v2, which exploited the complex pedigree among the individuals in this population. Results Seven loci were detected to possess haplotypes that were not observed in homozygous form despite a sufficiently high frequency and pedigree-based expectation of homozygote occurrence. These haplotypes were identified as candidates for harboring autosomal recessive lethal alleles. Of the genotyped individuals, 109 were resequenced to an average 27X depth of coverage to identify putative loss-of-function alleles genome-wide and had variants called using a custom in-house developed pipeline. For the candidate lethal-harboring haplotypes present in these bulls, sequence-called genotypes were used to identify concordant variants. In addition, whole-genome sequence imputation of variants was performed into the set of 3,961 genotyped animals using the 109 resequenced animals to identify candidate lethal recessive variants at the seven loci. Conclusions Selective breeding programs could utilize the predicted lethal haplotypes associated with SNP genotypes. Sequencing and other methods for identifying the causal variants underlying these haplotypes can allow for more efficient methods of management such as gene editing. These two methods in total will reduce the negative impacts of inbreeding on fertility and maximize overall genetic gains.

developmentally lethal recessive loci can be localized. This analysis comes without 23 the need for observation of the loss-associated phenotype (e.g., failure to implant, 24 first trimester abortion, deformity at birth). In this study, haplotypes were 25 estimated for 3,961 registered Angus individuals using 52,545 SNP loci using 26 findhap v2, which exploited the complex pedigree among the individuals in this 27 population. 28 Results: Seven loci were detected to possess haplotypes that were not observed in 29 homozygous form despite a sufficiently high frequency and pedigree-based 30 expectation of homozygote occurrence. These haplotypes were identified as 31 candidates for harboring autosomal recessive lethal alleles. Of the genotyped 32 individuals, 109 were resequenced to an average 27X depth of coverage to identify 33 putative loss-of-function alleles genome-wide and had variants called using a 34 custom in-house developed pipeline. For the candidate lethal-harboring haplotypes 35 present in these bulls, sequence-called genotypes were used to identify concordant 36 variants. In addition, whole-genome sequence imputation of variants was 37 grandsire. Maternal genotypes were rarely available as DNA was primarily sourced 162 from cryopreserved semen [2]. This family structure allowed the testing of 163 segregation distortion when both the sire and maternal grandsire were 164 heterozygous for the same haplotype. For each sliding window of 20 markers if a 165 haplotype was never observed in the homozygous state and had a sample frequency 166 of greater than 2%, the number of families for which the sire and maternal 167 grandsire were both heterozygous was counted. The probability of observing at 168 least one homozygote was then calculated based on the count of patrios and the 169 assumption of selective neutrality. The probability that no homozygous hh 170 haplotypes are observed in the progeny of C patrios when both the sire and 171 maternal grand-sire are Hh heterozygotes and the h haplotype does not harbor a 172 selected deleterious allele is (0.875 -0.25q) C where q is the frequency of the h 173 haplotype in the population. There were 2,480 patrios represented in the sample. 174 Regions of the genome with an identified deficit of homozygotes for a specific 175 haplotype were examined for underlying genes using pybedtools and the UMD3.1 176 genome assembly annotation run 104 [14,15]. 177

Generation of Sequence Data 178
To further analyze the variation within the genomic regions harboring haplotypes 179 that were deficient for homozygotes, the whole-genome sequences of 109 animals 180 from this population were examined [20]. These animals were selected for 181 sequencing based upon their impact on the breed assessed by the expected numbers 182 of genome equivalents (progeny have 0.5, grandprogeny have 0. 25 was not sufficient to run the population allele frequency or patrio analyses. 208

Imputation of BovineSNP50 Genotypes to Whole Genome Sequence Variation 209
We imputed the BovineSNP50 genotypes for this population to whole-genome 210 sequence level variation in order to identify potential lethal variants, using the 109 211 sequenced animals as the reference population. We selected 24,974,785 SNPs from 212 the full set that had been identified in these animals. We first included SNPs found in 213 the 109 sequenced Angus bulls that were biallelic and located within gene 214 boundaries (777,432), UTRs (33,379) or that were splice site variants (636,492). 215 These variants spanned the allele frequency spectrum but were identified at high 216 sequence coverage. To enable imputation we also included 23,527,482 variants that 217 had been independently identified in run 5 of the 1000 Bull Genomes project [23]. 218 These variants represent filtered, high quality segregating sites identified from the 219 whole genome sequences of 1,578 animals from multiple taurine breeds including 220 Angus. Genotypes called for the 24,974,785 variable sites genome-wide in the 109 221 registered Angus bulls were used as the reference set for whole-genome sequence 222 imputation of the BovineSNP50 data using Fimpute [24]. 223 The imputed genotypes were individually analyzed for the absence of 224 homozygotes for alleles present within each of the candidate haplotyped loci using 225 the frequency and pedigree approaches. We first identified high frequency variants 226 for which no homozygous individuals were predicted. The pedigree analysis was 227 also performed for all candidate variants identified in the frequency analysis. 228 Variants identified by either of these processes were characterized using the variant 229 effect predictor release 79 [25]. 230

Identification of Putative Lethal Haplotypes 232
We identified seven haplotypes with a pattern of inheritance in U.S. registered 233 Angus cattle that suggests that they each harbor an autosomal recessive lethal allele 234 (Table 1). Using a binomial distribution for the number of observed homozygotes in 235 the progeny of C patrios, we calculated the probability of observing no homozygotes 236 when each haplotype was selectively neutral. We selected haplotypes as putatively 237 harboring autosomal recessive lethals when the probability of observing no 238 homozygotes was less than 0.02. This threshold for statistical significance provides 239 considerable confidence that the lack of homozygosity for these haplotypes did not 240 occur by chance alone. 241

Sensitivity to Window Size 242
We evaluated the sensitivity of identification of these marker-based haplotypes to 243 window size and concluded that a window size of 20 contiguous BovineSNP50 244 markers was appropriate for capturing the haplotypic diversity within the 245 population ( Figure 1). This window size appears to discriminate between 246 haplotypes that are identical by descent (IBD) and those that are identical by state 247 (IBS). Our analysis shows that the number of common haplotypes detected rises as 248 window size increases, but begins to plateau at 20 markers. Considering the 249 moderate marker density of the BovineSNP50 (1 SNP per 50 kb), haplotypes that 250 are defined by only two markers are assumed to be IBS for the purposes of analysis 251 but likely actually represent a number of distinct haplotypes at the level of genome 252 sequence. As the window size increases, the likelihood increases that two 253 haplotypes found in different individuals that are IBS are also IBD and are thus 254 concordant at the level of sequence variation. However, with large window sizes, 255 recombination may lead to a lethal variant being present on more than one 256 haplotype, thus decreasing the power of the analysis. Indeed, as window size 257 increases, the overall rate of individuals homozygous for any haplotype declined. 258 The window size selected for this study appears to achieve an appropriate balance 259 of genome-wide homozygosity and rate of occurrence of high frequency haplotypes 260 ( Figure 1). Slight changes in the haplotype window size did not affect the detection 261 of the 7 haplotypes reported in Table 1.  262 We were also able to validate IBD status of these regions by an examination 263 of the sequence data generated for animals that were predicted to be carriers of 264 identical haplotypes. For each of the loci predicted to harbor recessive lethal 265 haplotypes (Table 1), we identified the bulls among the 109 sequenced animals that 266 were predicted to be carriers of each putative lethal haplotype and computed the 267 pairwise rate of opposing homozygous sequenced sites between all pairs of carrier 268 animals. For example, for the haplotype on chromosome 29, with 16 predicted 269 carriers we made 120 pairwise comparisons using all sequence called variant 270 genotypes within the haplotype coordinates ( Figure 2). Amongst individuals that 271 share a common haplotype, the alternate homozygote rate is related to the error 272 rate for sequence-based genotype calls for heterozygotes. The rate observed in our 273 predicted carriers within our 6 testable haplotype regions was similar to the rate for 274 sire-son pairs. One region (Chr 1) had only one sequenced predicted carrier and 275 could not be evaluated. The majority of the carrier pair comparisons with rates of 276 opposing homozygous sequenced sites >0.01 were caused by the inclusion in the 277 analysis of 3 individuals that had been sequenced to averages depths of < 10X. 278 Within the haplotype, opposing homozygous sequenced site rates were markedly 279 lower for predicted carriers than for randomly selected individuals. When predicted 280 carriers for a lethal haplotype were analyzed for a randomly selected 20 marker 281 region elsewhere in the genome, they had opposing homozygote rates that were 282 similar to those of the unrelated sire pairs. This indicates that the haplotypes 283 generated from the BovineSNP50 data successfully identified genomic regions that 284 were IBD at the level of the genome sequence. We also observed that this method is 285 sensitive to the depth of sequence coverage due to the inaccurate identification of 286 heterozygous loci as being homozygous when the alternate allele was never 287 sequenced. In total, 16 animals had depths of sequence coverage of < 10X and most 288 of the remaining animals (70) had > 20X. For a Sire-Son pair, low sequence coverage 289 for at least one member of the pair led to a rate of homozygous inconsistency that 290 was increased by an order of magnitude. 291

Haplotype Carriers 293
To identify the causal variants underlying these putatively lethal haplotypes, we 294 first directly examined resequencing data from bulls that were predicted to be 295 carriers of the lethal haplotypes. Among the 109 sequenced bulls up to 21 animals 296 were predicted to be carriers of each of the BovineSNP50 putatively lethal 297 haplotypes reported in Table 1. Within these seven genomic windows, we identified 298 candidate variants that were never homozygous in the 109 sequenced bulls. 299 However, none were observed to be exclusively heterozygous in the predicted 300 carrier animals. That is, all of the alleles found to be heterozygous in all of the 301 predicted carriers were also observed to be heterozygous in animals that did not 302 carry the predicted lethal haplotype. A recessive lethal variant could be 303 heterozygous in animals that were not carriers of the putatively lethal haplotype if 304 the mutation is sufficiently old that recombination has occurred relative to the 305 haplotype on which the mutation occurred. Table 1 reports, for each genomic region 306 predicted to harbor a lethal haplotype, the number of variants that were 307 heterozygous in all predicted haplotype carriers and that were never found to be 308 homozygous in any of the 109 resequenced bulls. For instance, on chromosome 15, 309 an intronic variant in SSRP1 was found in 9 of the 10 sequenced bulls that were 310 predicted to carry the putatively lethal haplotype, and in 25 of all 109 sequenced 311 Angus bulls, but was never found to be homozygous. Although the variant is 312 intronic, with no expected impact on splice site variation, it resides in an interesting 313 candidate gene as SSRP1 is essential for mouse embryonic development [26]. 314 This analysis was also conducted after excluding the 16 animals sequenced to 315 low average sequence depth of coverage (< 10X). Genotypes for these animals are 316 likely to contain false positive homozygotes at many true heterozygous sites. This 317 resulted in the detection of additional concordant loci but again none were exclusive 318 to the predicted carriers. The chromosome 29 locus had 12 additional variants 319 identified that were located in genes such as CAPN, NRXN2, PACS1 and PP25B. 320 However, none of these mutations are in exons or had other interpretations in 321 Variant Effect Predictor (VEP) and all 12 were heterozygous in 18 animals, including 322 4 that were not predicted by their marker data to be carriers. Thus, the region 323 appears to comprise a large consistent haplotype with no one particular variant 324 being simply implicated as causal for lethality. The locus on chromosome 4 had 118 325 variants identified as being heterozygous in haplotype carriers following the 326 removal of the low sequence coverage animals, but none were predicted to alter 327 protein amino acid sequences. 328 The locus on chromosome 1 for which we predicted a lethal haplotype 329 contains only one gene, GBE1 that encodes a protein that catalyzes the branching of haplotype's effect. In this study, a total of 12,020 haplotypes were found genome-404 wide (including partially overlapping haplotypes) that occurred at a MAF ≥ 2% but 405 were never found as homozygotes. Assuming random mating, a haplotype at this 406 frequency is expected to have, on average, only 1.5 homozygotes in our Angus 407 sample. However, the existence of non-random mating within this population could 408 substantially decrease the likelihood of observing homozygotes, and explain why so 409 many of these regions were observed. The patrio analysis directly incorporates the 410 matings that created the population to detect deviations from selective neutrality in 411 progeny genotypes. However, this approach is limited by the availability of 412 genotyped patrios. In breeds where insemination and pregnancy records are more 413 detailed, these have been crucial for validating putative lethal haplotypes [14]. 414 In the absence of these data, both the MateSel approach and the index 415 adjustment could be adapted to account for the uncertainty of the lethality of the 416 predicted allele or haplotype. In addition to validated haplotypes, and the high 417 confidence haplotypes that we observe here, this approach could enable the 418 incorporation of haplotypes into the selection scheme that appear to be lethal but 419 that are at low frequency in the population. There are now Angus pedigrees in 420 which hundreds of thousands of animals have been genotyped world-wide and an 421 analysis of these data would likely improve the resolution of lethal haplotype 422 detection and could also identify many rarer variants [15]. If these lethal alleles are 423 individually rare but each individual carries many of them, recessive lethals could 424 affect a substantial portion of pregnancies. 425

False Negatives 426
Even with adequate sample sizes to ensure statistical power, there are limitations to 427 the methods that we have employed. The approaches employed will only detect 428 recessive alleles that are perfectly concordant with a BovineSNP50 haplotype. If a 429 recent autosomal recessive lethal mutation has occurred on a common haplotype, 430 the population will comprise haplotypes harboring either the lethal mutation or the 431 wild type allele and homozygotes that include the wild type allele will be observed. deletion. This type of mutation has previously been associated with lethal recessive 453 diseases in cattle [7]. Additionally, lethal alleles with incomplete penetrance will not 454 be captured by our analyses. Other errors in genotyping, phasing or imputation also 455 likely contribute to reductions in power to detect lethal haplotypes. 456 Using sequence data 457 One potential solution to the limitations of array genotype data is to analyze 458 sequence-derived and/or imputed genotypes. These data may help with the 459 management of putative recessive lethal alleles. True causal variants can be tracked 460 more effectively than marker haplotypes. When the causative alleles have been 461 identified, gene editing may also present an efficient means of reducing the genetic 462 load of elite sires in a manner that is complementary to the current breeding system 463 [37]. 464 Sequence data also has potential advantages for the detection of recessive 465 loci. If appropriately processed to capture SNPs, large and small indels and 466 structural variants, they directly represent the pool of all recessive deleterious 467 alleles. However, in practice, sample sizes have been small and the identification of 468 large indels, particularly insertions, and complex structural variants has been 469 challenging. Our analysis of sequence data failed to identify any candidate causative 470 mutations in the marker-based haplotypes that were predicted to be lethal. The 471 sample size for the sequenced animals was not sufficient to conduct the frequency 472 or pedigree analysis with the genome-wide sequence variants. Furthermore, we did 473 not attempt to analyze many types of complex variation, such as large indels or 474 structural variants [38]. Large structural variants are enriched for deleterious 475 variation but can be complex to analyze with short read data [39]. Alternative 476 analyses of the sequence data to identify these variants or the use of methods which 477 generate longer reads may be necessary to capture the causative variants. 478 In this study we did not detect a particular variant within a putative 479 haplotype that was likely to cause a recessive lethal phenotype. However, the gene that an affected animal would be selected as a sire or dam. They would therefore be 495 highly unlikely to be included in our genotyped sample of Angus cattle. However, 496 identifying homozygous calves from those produced by mating carriers and 497 assaying their GBE1 functionality might be possible. 498

Mapping Candidate Variants without Sequencing 499
Rather than generating expensive sequence data for identifying recessive lethals, 500 two strategies might be useful: assay development and imputation. Novel variants of 501 all classes that are detected by sequencing can readily be incorporated onto 502 commercial genotyping platforms. This expedites fine-mapping within a known 503 lethal haplotype in a commercial population. Variants that had predicted deleterious 504 functional impacts based on bioinformatic analysis would also be excellent 505 candidates for inclusion on commercial genotyping platforms. 506 Imputation accuracy, particularly for rare variants, may not be sufficient to 507 identify candidate segregating recessive lethals. We previously analyzed the 508 accuracy of imputation using variants from Run4 of the 1000 bull genomes project 509 as a reference for our Angus BovineSNP50 genotyped population using the same 510 imputation methods that were used in this study [20]. Our 109 sequenced animals 511 had their BovineSNP50 genotypes imputed to the 1000 Bull Genomes Run4 512 sequence reference set as well as variants called directly from their whole-genome 513 sequences. Comparing these two sets of genotypes revealed correlations between 514 genotypes in the range of 80-90% for common variants, and 60-80% for variants 515 with MAF ≤ 10% [20]. We would expect the causal variants underlying these lethal 516 haplotypes to fall in the rare, more inaccurately imputed frequency class. This 517 greatly complicates the utility of imputation for this application, and the results of 518 imputation presented here are not appropriate for application to breeding 519 decisions. Had an interesting candidate locus with a plausible biological mechanism 520 emerged, it would have been a good target for further confirmation and possibly 521 immediate use. More sophisticated sequence imputation methods that provide 522 higher imputation accuracies across the allele frequency spectrum are now 523 becoming available. These have been used in human studies to identify individuals 524 that are homozygous for rare variants with predicted deleterious effects [16]. 525 Applying these methods to livestock populations with extensively described and 526 genotyped pedigrees is feasible, and may prove useful for fine-mapping within 527 candidate haplotypes. 528 these haplotypes have not been directly observed, but may be inconspicuous such as 535 in the event of early embryonic loss or may be unreported defects leading to the loss 536 of the calf. Efforts to identify causal mutations with a clear molecular impact from 537 sequence data were unsuccessful but interesting candidate genes such as GBE1 were 538 identified. Further validation of the impact of these haplotypes on fertility and the 539 direct observation of calves that are homozygous for these haplotypes could reveal 540 interesting biology. Our capacity to detect these loci will continue to improve as 541 increasingly large numbers of animals are genotyped and sequenced. The quality of 542 the reference genome assembly and methods for characterizing and imputing 543 structural variants are also improving and will improve the quality of this type of 544 analysis. Eventually, we will identify dozens of deleterious recessive loci, and can 545 use chip-based genotyping to manage matings, track alleles through lineages and 546 potentially use gene editing to remove them from elite animals. 547 548 Declarations 549 Ethics approval and consent to participate 550 The genotype data described in this manuscript was previously analyzed or 551 collected from commercially generated animal semen; as such no ethics or animal 552 welfare approval was required. 553

Consent for publication 554
Not applicable 555 Availability of data and material 556 Genotypes are available to scientists interested in non-commercial research upon 557 signing a Materials Transfer Agreement (MTA). All sequence data will be deposited 558 under NCBI Bioproject Accession PRJNA343262. 559