Weighted single-step GWAS identified candidate genes associated with semen traits in a Duroc boar population

Background In the pig production industry, artificial insemination (AI) plays an important role in enlarging the beneficial impact of elite boars. Understanding the genetic architecture and detecting genetic markers associated with semen traits can help in improving genetic selection for such traits and accelerate genetic progress. In this study, we utilized a weighted single-step genome-wide association study (wssGWAS) procedure to detect genetic regions and further candidate genes associated with semen traits in a Duroc boar population. Overall, the full pedigree consists of 5284 pigs (12 generations), of which 2693 boars have semen data (143,113 ejaculations) and 1733 pigs were genotyped with 50 K single nucleotide polymorphism (SNP) array. Results Results show that the most significant genetic regions (0.4 Mb windows) explained approximately 2%~ 6% of the total genetic variances for the studied traits. Totally, the identified significant windows (windows explaining more than 1% of total genetic variances) explained 28.29, 35.31, 41.98, and 20.60% of genetic variances (not phenotypic variance) for number of sperm cells, sperm motility, sperm progressive motility, and total morphological abnormalities, respectively. Several genes that have been previously reported to be associated with mammal spermiogenesis, testes functioning, and male fertility were detected and treated as candidate genes for the traits of interest: Number of sperm cells, TDRD5, QSOX1, BLK, TIMP3, THRA, CSF3, and ZPBP1; Sperm motility, PPP2R2B, NEK2, NDRG, ADAM7, SKP2, and RNASET2; Sperm progressive motility, SH2B1, BLK, LAMB1, VPS4A, SPAG9, LCN2, and DNM1; Total morphological abnormalities, GHR, SELENOP, SLC16A5, SLC9A3R1, and DNAI2. Conclusions In conclusion, candidate genes associated with Duroc boars’ semen traits, including the number of sperm cells, sperm motility, sperm progressive motility, and total morphological abnormalities, were identified using wssGWAS. KEGG and GO enrichment analysis indicate that the identified candidate genes were enriched in biological processes and functional terms may be involved into spermiogenesis, testes functioning, and male fertility.


Background
In the pig production industry, artificial insemination (AI) plays an important role in enlarging the beneficial impact of elite boars. Sperm quality, affected by genetic and environmental factors simultaneously and with moderate to low heritability [1,2], is essential for guaranteeing the success of AI. For both academic researchers and AI station managers, understanding the genetic background and detecting genetic markers associated with semen traits can help in improving genetic selection for such traits and accelerate genetic progress. Biologically, the process of spermatogenesis, which mainly contains three steps including mitotic phase, meiotic phase, and spermiogenesis, involves complex coordination among many genes and different cell types and occurs in the seminiferous tubules of the testes [3].
Mutations in related genes and status changes in corresponding cells will potentially affect the sperm quality and male fertility. From the aspect of genetic control and animal breeding, detecting candidate genes responsible to sperm quality and fertility is an essential step.
With the fast development of sequencing technology and commercially availability of dense marker panels, researchers can precisely identify quantitative trait loci (QTL) by searching for association between genetic markers and phenotypic records, which is known as genome-wide association study (GWAS) [4]. Though many challenges, such as the need for efficient study design especially for replication efforts and technologies for capturing genetic variation, the missing heritability problem, reducing the bias introduced into a dataset, and utilizing of new resources available, remained to be addressed [5], GWAS has been successfully implemented in detecting genetic risk factors for human diseases and mapping QTL for economically important traits in both animal and plant breeding populations. Among the established GWAS approaches, the weighted single step GWAS (wssGWAS) [6] is preferable for association study in domesticated animals, for which large numbers of individuals are usually phenotyped but less genotyped. Technically, the wssGWAS calculates genomic estimated breeding values (GEBVs) of individuals by solving mixed model equations using H (a blend of pedigree derived relationship matrix A and weighted genomic relationship matrix G w , which is constructed in the manner of weighting SNPs according to their genetic variances, [7,8]) as relatedness matrix, and converts the GEBVs into marker effects in the genotyped subpopulation based on the underlying equivalency between marker effect models and breeding value models. Genetic variance of certain chromosome window is subsequently calculated as the variation of the genetic values possessed by SNPs located in that window (i.e., Var(Z window g window ), Z window and g window represent genotypes and marker effects of SNPs located in the window, respectively).
In several previous studies related to boar semen traits, associated genetic markers or candidate genes have been reported for both sperm quantity (i.e. number of sperm cells per ejaculation) and quality (i.e. sperm motility and morphological abnormalities) traits, [9][10][11][12][13]. However, the genetic background of breeds or boar populations, number of boars used for association study, and density of genetic marker panels varied a lot among the previous studies. The Duroc boars have been used as terminal sire in most of the modern three-way crossbreeding pig production system. While due to the relatively small numbers of Duroc boars in one herd and difficulty in phenotypic data collecting, seldom association studies have been conducted. In this study, we utilized wssGWAS to detect genetic regions and further candidate genes associated with semen traits in a Duroc boar population. In order to further understand the genetic control of semen traits, post GWAS analyses, KEGG and GO enrichment were performed to find biological processes and functional terms in which the candidate genes were involved.

Results
In this study, we identified genomic regions associated with semen traits in a Duroc boar population via wssGWAS [6]. In order to investigate the genetic background of the semen traits under consideration, we estimated the heritabilities of these traits by fitting model (1) using the pedigree derived numerator relationship matrix. Variance component and heritability estimations are presented in Table 1. The estimated heritabilities were 0.18, 0.16, 0.16, and 0.26 for the number of sperm cells, sperm motility, sperm progressive motility, and total morphological abnormalities, respectively. The heritability ranged from moderate to low, which were at the similar magnitude compared to Marques et al., (2018Marques et al., ( , 2017 while higher than that of Wolf (2010), in which relatively smaller Large White and Landrace boar populations were used for heritability estimation. Figure 1 shows the proportion of variances explained by each 0.4 Mb windows for the traits under study. Table 2 shows the first three most important QTL regions and the candidate genes, which have been previously reported to be associated with mammal spermiogenesis, testes functioning, and male fertility, for the four traits. Overall, the first three QTL regions together explained about 5.12~11.97% of the genetic variances of the semen traits under study ( Table 2). For each trait, the most important windows explained approximately 2%~6% of the total genetic variances ( Table 2). Furthermore, the identified windows (windows that explained more than 1% of total genetic variances) explained totally 28.29, 35.31, 41.98, and 20.60% of genetic variances (not phenotypic variance) for number of sperm cells, sperm motility, sperm progressive motility, and total morphological abnormalities, respectively ( Fig. 1 and Additional file 1: Table S1, Additional file 2: Table S2, Additional file 3: Table S3 and Additional file 4: Table S4). From the aspect of pig breeding, the relatively high proportions of genetic variances explained by the identified QTL regions may imply the possibility of integrating these QTL mapping results into estimating of genetic breeding values of semen traits [14,15]. Table 3 shows the GO terms and KEGG pathways in which the identified genes enriched. Four GO terms and one KEGG pathway were enriched for the number of sperm cells. The enriched GO terms are involved in positive regulation of myeloid cell differentiation (GO: 0045639), steroid hormone mediated signaling pathway (GO:0043401), granulocyte colony-stimulating factor receptor binding (GO:0005130), and enzyme regulator activity (GO:0030234). The enriched KEGG pathway is involved in thyroid hormone signaling (ssc04919). For sperm motility, three GO terms were enriched for the candidate genes, while no KEGG pathway was targeted. The GO terms play roles in cellular calcium ion homeostasis (GO:0006874), protein phosphorylation (GO:0006468), and protein kinase activity (GO: 0004672). For progressive motility, seven GO terms and two KEGG pathways were enriched for the identified candidate genes. These GO terms and KEGG pathways are involved in nucleotide metabolic and biosynthesis processes. Three GO terms were enriched for the total morphological abnormalities, while no KEGG pathway was targeted. These GO terms are related to transition between slow and fast fiber (GO:0014886), sensory perception of sound (GO:0007605), endochondral ossification (GO: 0001958), cullin family protein binding (GO:0097602), RNA polymerase II transcription factor activity, and sequence-specific DNA binding (GO:0000981).

Discussion
In this study, we conducted association study for boar semen traits in a large Duroc population. To our knowledge, this is the first time that genome-wide association study results are reported in such a large Duroc boar population. Based on the literature (cited above), some of the identified candidate genes have been previously reported to be involved in mammal spermiogenesis, testes functioning, and male fertility. Among the candidate genes identified in this study, DNAI2 was previously  reported to be associated with sperm motility of large white boars [9], while no more genes overlapped with that identified in large white and landrace boars. Similarly, in the study conducted by Marques et al., [9], where wssGWAS was also used, no overlapping regions were identified between large white and landrace boars. However, gene network analysis revealed that genes identified in different lines were involved in the same biological processes [9]. Therefore, it is hard to draw the conclusion that the identified candidate genes in this study are specific to Duroc line. Nevertheness, functional validation experiments of homologous genes in human and model organisms (literature mentioned in the following paragraphs) in the literature provided evidence of role of the identified genes in semen traits. Considering genetic improvement of semen traits, the identified candidate genes may provide knowledge for developing genomic prediction models based on the trait-specific genomic best linear unbiased prediction (GBLUP) [15] by assigning higher genetic variances to genetic markers located in these regions. For number of sperm cells, 18 relevant QTL regions located on SSC2 (SSC: Sus scrofa chromosome), 3,5,9,10,12,14,15,16, and 17 were identified. Sixty-five genes were annotated in these genomic regions according to the National Center for Biotechnology Information (NCBI) (Additional file 1: Table S1). Among the identified genes, some has been previously reported to be associated with sperm biological function. The Tudor domain-containing protein 5 (TDRD5) and quiescin Q6/sulfhydryl oxidase 1 (QSOX1), located in the region of 121.15-121.95 Mb on SSC9, were considered as candidate gene for the number of sperm cells (Table 2). TDRDs are an evolutionarily conserved family of proteins involved in germ cell development, and TDRD5 is required for retrotransposon silencing, chromatoid body assembly, and spermiogenesis in mice [16]. The QSOX family was reported to have biological functions in male germ cell development [17]. BLK gene, located in the region of 14.49-15.29 Mb on SSC14, is an Src-related tyrosine kinase in the events leading to proper sperm formation. Previous study reported that BLK is involved in sperm function and integrity [18]. The members of tissue inhibitor of metalloproteinase (TIMP), among which TIMP3 was identified in the region of 11.71-12.51 Mb on SSC5, were reported to exist in human seminal plasma as complexes and may play roles in sperm function [19]. The thyroid hormone receptor alpha (THRA), identified in the region of 22.19-22.56 Mb on SSC12, is responsible for sertoli cell development, sperm production, and testis weight [20] by regulating the function of triiodothyronine (T3), which played a key role in Sertoli cell development [21]. Colony stimulating factor 3 (CSF3), another gene located in the region of 22.19-22.56 Mb on SSC12, is involved in reactive oxygen species (ROS) of sperm cells and associated with abnormalities in sperm histone transition [22]. A third gene located in the region of 22.19-22.56 Mb on SSC12, Zona pellucida binding protein 1 (ZPBP1), a spermatid and spermatozoon protein, is responsible for sperm morphology. Lin et al., (2007) reported that male mice lacking ZPBP1 were sterile, with abnormal round-headed sperm morphology and no forward sperm motility.
For sperm motility, 20 relevant QTL regions located on SSC1, 2, 3, 4, 7, 8,9,12,13,14,15,16, and 17 were identified. Fifty-nine genes were annotated in these genomic regions (Additional file 2: Table S2). Within the window explained the largest proportion of genetic variance of sperm motility (1.29-2.09 Mb on SSC13), six genes were targeted, while none of them have ever been previously reported to be associated with semen traits. The protein phosphatase 2 regulatory subunit B beta gene (PPP2R2B), located in the region of 147.71-148.51 Mb on SSC2, was indicated to be associated with biology of sperm in a Co-Expression Network Analysis conducted in Bos taurus and Sus scrofa [23]. For the NIMA related kinase 2 (NEK2) gene, located in the region of 131.55-132. 35 Mb on SSC9 and showed biased expression in reproductive tissues (testis of human and ovary of swine), several studies have demonstrated its role in the process of meiotic cell cycle and the first meiotic division in mouse spermatocytes [24][25][26][27]. Ndrp3, a member of the N-myc downstream regulated gene family (NDRG), is enhanced expressed specifically in germ cells and involved in regulation of the male meiosis in mouse [18]. The ADAM metallopeptidase domain 7 (ADAM7), which located in the region of 8.49-8.86 Mb on SSC14, is associated with epididymosomes, integrated into sperm plasma membrane, and related to fertility and sperm integrity [28,29]. S-phase kinase associated protein 2 (SKP2) was identified in the region of 21.48-21.86 Mb on SSC16. The Skp2 protein is the receptor subunit of an SCF-type ubiquitin ligase and is a major regulator of the progression of cells into S phase of the cell cycle, Skp2-deficient mice shown reduction in the number of mature gametes and thus decreased fertility [30]. Moreover, Skp2 was reported to be involved in the manipulation of boar sperm production [31]. The ribonuclease T2 (RNASET2) gene, located in the region of 1.77-2.16 Mb on SSC1, have been reported to be involved in sperm motility impairment [32] and interacted with AKAP4 in human sperm tail and subsequently reduced sperm motility by suppressing PKA/PI3K/calcium signaling pathways [33].
For progressive motility, 24 relevant QTL regions located on SSC1, 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16, and 17 were identified (Additional file 3: Table S3). Seventy-three genes were annotated in these regions. Within the first three genomic regions that explained the largest proportions of genetic variance, 11 genes were identified but none of them have been previously reported to be directly associated with semen traits in the literatures. However, in the region of 18.51-18.87 Mb on SSC3, SH2B adaptor protein 1 (SH2B1) was identified. Previous study in mice have shown that expression of Sh2b1 on the mRNA level in obesity mice was decreased, which resulted in significant reduction in male fertility and exacerbated reproductive toxicity and germ cell mutagenicity [34]. Src family tyrosine kinase (BLK), also identified for the number of sperm cells, is an Src-related tyrosine kinases in the events leading to proper sperm formation, was reported to potentially involved in sperm function and integrity [18]. Laminin subunit beta 1 (LAMB1) gene, which directs the expression of beta-galactosidase during development of the mouse testis and ovary [35], was identified in the region of 107.72-108.07 Mb on SSC9. Acuolar protein sorting 4 homolog A (VPS4A) gene, located in the region of 17.44-17.8 Mb on SSC6, was previously reported to be involved in human sperm acrosome reaction [36]. Sperm associated antigen 9 (SPAG9) gene, which codes an acrosomal molecule that plays role in acrosome reaction and sperm-egg binding [37], was identified in the region of 27.39-27.78 Mb on SSC12. Lipocalin 2 (LCN2) gene, located in the region of 268.37-268.73 Mb on SSC1, is involved in sperm maturation and fertilization [38]. Dynamin 1 (DNM1), another gene located in the same region on SSC1, acts as a key mediator of sperm acrosome formation and function, is essential for mammalian spermatogenesis [39].
For total morphological abnormalities, 16 relevant QTL regions located on SSC1, 3,4,8,9,10,11,12,13,15,16, and 17 were identified (Additional file 4: Table  S4). Sixty-seven genes were annotated in these regions according to NCBI. The growth hormone receptor (GHR) gene, which shows significant association with embryos survival rate and involved in fertilization rate of dairy cattle [40], was identified in the region of 26.76-27.56 Mb on SSC16 (Table 2). Selenoprotein P (SELENOP), another gene located in the same region, is abundantly expressed in the testis and affected the sperm quality of rats [41]. Solute carrier family 16 member 5 (SLC16A5) gene, for which a protein-coding variation is associated with cisplatin-induced ototoxic effects and involved in human germ cell testicular cancer [42], was identified in the region of 6.20-7.00 Mb on SSC12. SLC9A3 regulator 1 (SLC9A3R1), another gene located in the same region, is important for mouse sperm capacitation through functional interactions with SLC26A3, SLC26A6, and CFTR [43]. Dynein axonemal intermediate chain 2 (DNAI2) gene, of which mutations may resulted in defects of sperm flagella in human [9,44,45], was identified on SSC12.

Conclusions
In conclusion, we identified candidate genes associated with the number of sperm cells, sperm motility, sperm progressive motility, and total morphological abnormalities in a Duroc boar population using wssGWAS. Functional validation experiments of homologous genes in human and model organisms (literature mentioned in the Discussion section) in the literature provided evidence of role of the identified genes in semen traits. The identified candidate genes may provide knowledge for developing genomic evaluation models by assigning higher weights to genetic markers located in these regions.

Phenotypes, genotypes, and pedigree
The studied Duroc boar population was a mixed herd consisted of boars from pure Danish lines, pure American lines, and progenies from the crosses between Danish and American Duroc lines. For more detailed description of the conditions where the boars were raised, please refer to our recently published paper [46]. Overall, 5284 pigs (12 generations) were included in the pedigree, of which 2693 boars with semen trait records. Four traits, total sperm morphological abnormalities (ABN), number of sperm cells (Ncells), sperm motility (MOT, defined as proportion of moving sperms), and sperm progressive motility (PMOT, defined as proportion of sperms moving in strait line) were recorded in the years from 2015 to 2018. The fresh semen was evaluated immediately after ejaculate using the UltiMateTM CASA system (Hamilton Thorne Inc., Beverly, MA, USA). Number of sperm cells was calculated as the product of the semen volume (mL) and concentration (10 6 mL − 1 ). For the traits of total morphological abnormalities, number of sperm cells, and motility, 143,113 individual ejaculates (53 ejaculates per boar on average) of 2693 boars were used for the association study (Table 4). For the trait progressive motility, 29,526 observations of 1304 out of the 2693 boars were used for association study (Table 4).
Total DNA of 1733 pigs were extracted (semen of boars and ear tissue samples of sows) using the genome extraction kits produced by Wuhan NanoMagBio Technology Co., Ltd. Those pigs were genotyped using the GGP 50 k SNP array (GeneSeek, US), which contains 50, 703 SNPs. Overall, 1627 out of the genotyped pigs were related to boars with phenotypic records through pedigree and used for further analysis. SNPs, for which the Sscrofa10.2 physical positions were provided by the SNP array producer, were remapped to Sscrofa11.1 using the genome remapping procedure available in NCBI (https://www.ncbi.nlm.nih.gov/). SNPs with unknown positions (9405 SNPs) and that unable to map to Sscrofa11.1 (1842 SNPs) were removed from the dataset. Moreover, SNPs on sex chromosomes (2469 SNPs) were excluded to avoid sex bias in constructing genomic relationship matrix and calculating genetic breeding values (1231 boars and 392 sows were included in the genotype file and used for constructing the genomic relationship matrix). Autosome SNPs were filtered via plink [47] using the following criteria: individual call rate ≥90% (4 pigs removed); SNP call rate ≥90% (673 SNPs removed); minor allele frequency ≥0.01 (5345 SNPs removed); Hardy-Weinberg equilibrium p-value ≥10 −6 (2680 SNPs removed). After quality control, 1623 pigs, among which 1231 had semen evaluation data, and 28,289 SNPs remained for further analysis. Missing genotypes were finally imputed using Beagle software (version 4.1) [48,49].

Statistical model
The weighted single step GWAS procedure [6,50] was used for association study. Briefly, genomic estimated breeding values (GEBVs) were estimated via a weighted single step GBLUP and marker effects were calculated from the GEBVs. QTL regions were selected based on the proportion of genetic variance explained by certain chromosome segments.
The following repeatability model was used in this study.
where y denoted the response; X, Z, and W denoted the design matrices; b was fixed effects (overall mean and year-season of ejaculation); a$Nð0; Hσ 2 a Þ was a vector of genetic values; p$Nð0; Iσ 2 p Þ was a vector of random permanent boar effect; covariates Age and Intv denoted the age of the boars at months when ejaculating and collection interval (days), respectively. e$Nð0; Iσ 2 e Þ denoted the residuals. I denoted the identity matrix. The inverse of H matrix, a blend of pedigree and genetic marker derived matrices [7,8], was calculated as Variance components and further heritability of the studied traits were obtained via average information restricted maximum likelihood (AI-REML) [54] procedure using pedigree. Marker effects and further weights for constructing G were calculated in an iterative way used by Wang et al., [6,50]. An iteration procedure with the following steps was used for association study. Step3: Marker effects calculation, obtain marker effects viaĝ ðtÞ ¼ λD ðtÞ Z 0 G −1 ðtÞâ ; Step4: SNP weights calculation, get SNP weights for the next iteration via d iðtþ1Þ ¼ĝ 2 iðtÞ 2p i ð1−p i Þ; Step5: SNP weights normalization, rescale the weights to keep the total genetic variance constant via D ðtþ1Þ ¼ trðD ð1Þ Þ trðD ðtþ1Þ Þ D ðtþ1Þ ; Step 6: Weighted G construction, calculate G for the next iteration via G (t + 1) = λZD (t + 1) Z ′ ; Step7: Let t = t + 1 and loop to step 2.
The procedure was run for three iterations, as used in Wang et al., [50] and Marques et al., [9], and marker effects obtained from the third iteration were used for calculating proportions of genetic variances explained by The iteration steps mentioned above were run with BLUF90 software families [51] in a iterative way, AIR-EMLF90 for variance components estimation, BLUPF90 for GEBVs calculation, and postGSf90 [56] for marker effects converting, respectively.
Candidate genes detection and functional enrichment analysis QTL regions were selected according to genetic variance of chromosome windows. Windows explained higher than 1% genetic variances were selected as candidate QTL regions, within which candidate genes were searched. The threshold of 1% was set based on literature [9,[57][58][59] and the expected average genetic variance explained by single chromosome segment (0.027, 100% divided by 3707 windows). Consecutive windows with midpoints less than 0.4 Mb apart were treated as overlapped and merged. The first three windows that explained the largest amount of genetic variances were further extended to 0.4 Mb flanking regions of the midpoints on both upstream and downstream.