A genome-wide single nucleotide polymorphism and copy number variation analysis for number of piglets born alive

Background In this study we integrated the CNV (copy number variation) and WssGWAS (weighted single-step approach for genome-wide association) analyses to increase the knowledge about number of piglets born alive, an economically important reproductive trait with significant impact on production efficiency of pigs. Results A total of 3892 samples were genotyped with the Porcine SNP80 BeadChip. After quality control, a total of 57,962 high-quality SNPs from 3520 Duroc pigs were retained. The PennCNV algorithm identified 46,118 CNVs, which were aggregated by overlapping in 425 CNV regions (CNVRs) ranging from 2.5 Kb to 9718.4 Kb and covering 197 Mb (~ 7.01%) of the pig autosomal genome. The WssGWAS identified 16 genomic regions explaining more than 1% of the additive genetic variance for number of piglets born alive. The overlap between CNVR and WssGWAS analyses identified common regions on SSC2 (4.2–5.2 Mb), SSC3 (3.9–4.9 Mb), SSC12 (56.6–57.6 Mb), and SSC17 (17.3–18.3 Mb). Those regions are known for harboring important causative variants for pig reproductive traits based on their crucial functions in fertilization, development of gametes and embryos. Functional analysis by the Panther software identified 13 gene ontology biological processes significantly represented in this study such as reproduction, developmental process, cellular component organization or biogenesis, and immune system process, which plays relevant roles in swine reproductive traits. Conclusion Our research helps to improve the understanding of the genetic architecture of number of piglets born alive, given that the combination of GWAS and CNV analyses allows for a more efficient identification of the genomic regions and biological processes associated with this trait in Duroc pigs. Pig breeding programs could potentially benefit from a more accurate discovery of important genomic regions. Electronic supplementary material The online version of this article (10.1186/s12864-019-5687-0) contains supplementary material, which is available to authorized users.


Background
Reproductive performance in pig production systems is usually quantified by several economically important traits. The number of piglets born alive is an important trait in pig breeding programs due to its significant impact on the production efficiency; however, this is a difficult trait to improve because of low prediction accuracy and heritability [1][2][3]. Although the heritability of reproductive traits is moderate to low, genetic improvements can be obtained using genomic tools to explore the chromosomal regions and genes that explain the variation in reproductive traits. In addition to that, genomic acts as an extra source of information increasing prediction accuracy even for lowly heritable traits [4].
Until now, a total of 27,465 quantitative trait loci (QTL) have been mapped in the porcine genome for 663 different traits (Pig QTLdb Release 35). From about 2058 QTL described for reproduction traits, a total of 1004 QTL were described for litter traits, of which 163 QTL are related to number of piglets born alive [5].
Studies have reported the association of several genes and genomic regions with number of piglets born alive in several pig breeds [1][2][3][6][7][8][9][10]. Genome-wide association studies (GWAS) became very powerful tools to investigate genetic architecture of economically important traits in livestock, allowing detection of genomic regions and genes controlling polygenic traits related to reproductive performance in pigs [2,3,8].
The combination of GWAS with other genomic approaches can provide a better understanding about genes and pathways involved in complex traits [11]. In several livestock species, it has been demonstrated useful to integrate GWAS and copy number variation (CNV) analysis to advance the knowledge of economically important complex traits [12,13].
Structural variants such as CNV represent an important source of genomic variation in mammalian genomes. They can be defined as segments of DNA (conventionally > 1 Kb) that differ in number of copies compared to a reference genome [14]. Compared with SNPs, CNVs cover wider chromosomal regions and may potentially be responsible for changes in gene structure, modifications in gene regulation, changes in gene dosage, and exposing recessive alleles, resulting in large phenotypic effects. A well-characterized phenotypic variation affected by CNV in pig is the white coat phenotype generated by the duplication of the KIT gene [15].
In animal breeding and genetics, the commercial populations are usually large and comprise phenotypes, pedigree, and genotypes for a fraction of pedigreed animals. Single-step genomic best linear unbiased prediction (ssGBLUP) was developed to predict breeding values when this type of data is available [29]. This is the method of choice for such populations because traditional GWAS methods cannot be implemented directly due to the necessity of combining results with pedigree structure to create pseudo-observations [30,31]. When the structure of the genotyped dataset is complex, problems such as double counting of contributions from pedigree and phenotypes, and preselection bias [32] reduce accuracy. Lately, ssGBLUP was extended to GWAS [33,34]. The GWAS under the ssGBLUP framework is called ssGWAS and allows the combination of phenotypes, pedigree and genotypes in one single analysis with no need of calculating pseudo-observations [35,36]. In ssGBLUP, the main assumption is that all SNP explain the same proportion of additive genetic variance; however, this is not biologically true, especially when the traits are affected by large QTL. To account for the fact SNP may explain different proportion of variance, weighted ssGBLUP can be used. The WssGBLUP (weighted single-step approach for genome-wide association) method weighs SNP according to their effects in an iterative way [33]. The WssGWAS is fast, accurate and simple to implement for genome-wide association studies [33].
The aim of this study was to perform a WssGWAS to effectively identify genomic regions and biological processes related to number of piglets born alive in the Duroc breed and perform a CNV analyses to detect potential regions affecting the phenotype through changes in gene dosage. The elucidation of genes and molecular mechanisms controlling this trait should result in a better understanding of the genetic regulation of reproductive performance.

CNV detection
After applying stringent filtering criteria, 653 samples did not pass the filtering and were discarded. The 2867 remaining samples were explored to search for CNV.
A total of 46,118 CNVs (4892 gains and 41,226 losses) were detected by PennCNV, of which 8152 (645 gain and 7507 loss) were non-redundant CNVs among the total of 2865 samples. CNVs were not identified in two samples. All CNVs detected by PennCNV were used to infer CNVR by aggregating overlapping CNVs. Thus, a total of 425 CNVRs were obtained (Additional file 1).
The size of the CNVRs averaged 463,621 bp and ranged from 2552 bp to 9718,410 bp ( Fig. 1). Among these regions, 342 corresponded to copy losses, 19 to copy gains and 64 to both (the same fragment showed losses or gains in different animals). It corresponds to a loss:gain ratio of 4.89.
The CNVRs inferred in our study covered 197,038,894 bp (7.01%) of the autosomal genome sequence, and their frequencies ranged from 0.5 to 53.61% in this Duroc population (Table 1).
Although CNVRs were identified in all autosomes, the number and proportion of chromosomes covered by CNVRs varied considerably (Fig. 2, Table 1). Chromosome 1 showed the largest number of CNVRs (49 CNVRs), which covered only 3.94% of its sequence; the lowest coverage observed for a single chromosome. Although the SSC11 showed the highest coverage of a chromosome sequence (16.03%), this chromosome showed a small number of CNVR (22), but with bigger sizes. The SSC17 presented the smallest number of CNVR (10), covering only 5.05% of its sequence.

Weighted single-step genome-wide association study
The estimated heritability for number of piglets born alive was 0.11 ± 0.008. The additive genetic direct and permanent environmental variances were 0.78 and 0.57, respectively, whereas the residual variance was 5.68.
A total of 16 windows were detected each explaining more than 1% of the additive genetic variance for number of piglets born alive on chromosomes 2, 3, 4, 11, 12, 13, 14, 15, 16, and 17 ( Fig. 3, Table 2). These significant windows explained a total of 22.54% of additive genetic variance for number of piglets born alive in the Duroc breed.
We identified common regions between GWAS and CNVR analyses on SSC2 (4.2-5.2 Mb), SSC3 (3.9-4.9 Mb), SSC12 (56.6-57.6 Mb), and SSC17 (17.3-18.3 Mb), comprising a total of 56 protein-coding genes ( Table 2). These regions are very interesting because their gene content could affect the number of pigs born alive through changes in gene dosage. The windows on SSC2, SSC3 and SSC17 are in deletion CNVR, whereas the window on SSC12 is in deletion and duplication CNVR.

Discussion
The number of animals used in this study is greater than previous in studies on pigs [12,22,23,25,27,28,37,38], which could decrease the amount of false-positive CNVs.
Several studies also detected deletions more frequently than duplications in pigs [17,18,22,23,39] which could be partially explained by biological factors because non-allelic homologous recombination tends to create more deletions than duplications [40], and partially by technical bias  because deletions tend to be under stronger purifying selection, i.e., deletions are more deleterious than duplications [41].
Although several studies have reported CNVs in the porcine genome, CNVs have been described to have breed-specific characteristics [22,23,25,27,28,37,39,42]. By using the Illumina Porcine SNP60 BeadChip to detect CNVR, a total of 170 CNVRs (7 gains, 161 losses and 2 both) were detected in 293 Large White pigs [39], 249 CNVRs (70 gains, 43 losses and 136 both) in 585 Large White X Minzhu pigs [42], 65 CNVRs (21 gains, 32 losses and 12 both) in 223 Iberian pigs [37], and 348 CNVRs (88 gains, 243 losses and 17 both) in 302 animals from ten Chinese pig breeds [22]. By using arrayCGH technology, 37 CNVRs (18 gains and 19 losses) were identified in 12 Duroc boars [43]. Besides breed, it is important to highlight that the number of samples analyzed, as well as the differences in calling technology (NGS, SNP genotyping or arrayCGH), resolution, genome coverage and/or quality control to filter CNVs used in the aforementioned studies directly influence the results obtained, as described previously [44][45][46]. Some genes identified in common regions between GWAS and CNVR analyses were highlighted according to their function. The KDM2A, ACTN3, and RHOD genes are mapped on SSC2 (4.2-5.2 Mb) region. The KDM2A gene, also known as FBXL11 and JHDM1A, plays an important role in gene silencing, cell cycle, and cell growth through histone demethylation modification. This gene was identified as differentially expressed in porcine embryonic skeletal muscle, being therefore involved in skeletal muscle development and growth [47]. The ACTN3 gene also encodes a protein which exhibits an important function in muscle metabolism [48]. The RHOD gene is a regulator of reorganization of the actin cytoskeleton and consequently, regulates several cellular processes such as vesicle trafficking, chemotaxis, cell migration and proliferation [49].
The BMP2 gene is the only gene mapped on SSC17 window (17.3-18.3 Mb). BMP proteins exhibit wide spectrum of activities in several tissues (cartilage, bone, blood vessels, liver, lung, kidney, heart and neurons) and perform multiple roles in regulation of growth, differentiation, and apoptosis, playing important functions during embryonic development and tissue morphogenesis [59]. The BMP2 gene plays a critical role in mesenchymal cells influencing adipogenesis, myogenesis, chondrogenesis, and osteogenesis [60][61][62][63]. Homozygous knockout mice for BMP2 gene exhibits embryonic lethality with defects in extra-embryonic and embryonic tissues [64], whereas heterozygous knockout mice have defects in cartilage, bone and heart development [63]. As the BMP2 is mapped to a window that explains part of the additive genetic variance for number of piglets born alive, which is located on a deletion CNVR, the BMP2 becomes an important gene to influence the number of piglets born alive, especially because subnormal levels of BMP2 negatively impacted embryonic development [63] and in the adult, it is required for uterine decidual response during embryo implantation [65].
Additionally, we highlighted the SSC12 (54.0-55.0 Mb) and SSC14 (13.0-14.0 Mb) chromosome regions identified only by WssGWAS due its gene content. The ZMYND15, YBX2 and TMEM95 genes are mapped on SSC12 (54.0-55.0 Mb) and are related with reproductive traits. The ZMYND15 gene is primordial for sperm production and male fertility [66]. The YBX2 gene codifies a protein required for mammal development and fertility because knockout mice for this gene presented disruption of spermatogenesis and oogenesis [67][68][69]. The TMEM95 gene codifies a protein located at the surface of spermatozoa and deficiency of TMEM95 severely compromises male reproductive performance, resulting in subfertility in cattle [70]. The ELP3 gene, mapped on SSC14 window (13.0-14.0 Mb), plays important roles in embryonic stem cell maintenance and early development in mouse [71].
Deleted in Azoospermia (DAZ) gene family encodes proteins with essential roles in male and female gametogenesis. The protein encoded by DAZL gene, related with metabolic process (GO:0008152), was detected in fetal germ cells and developing oocytes [76]. Mutations in DAZL gene have been associated to infertility in man and women [77,78].
The FGF11, FGRF1 and WNT8A genes are related to developmental process (GO:0032502), response to stimulus (GO:0050896), and cellular process (GO:0009987). The FGF11 and FGRF1 gene encodes a member of the fibroblast growth factor (FGF) family and a member of the fibroblast growth factor receptor (FGFR) family, respectively. The FGF family members and their receptors influence mitogenesis and differentiation with essential roles in biological processes, including preimplantation of embryos, embryonic development, and organogenesis [79]. The WNT8A gene, expressed during early embryogenesis, belongs to WNT gene family that has been related to developmental processes, including differentiation, proliferation, apoptosis, regulation of cell fate and patterning during embryonic development [80].
The TNK1, PNOC and AKAP11 genes are related with biological regulation (GO:0065007) and cellular process (GO:0009987). The TNK1 gene encodes a tyrosine protein kinase family member highly expressed in fetal tissues signaling pathways widely utilized during fetal development [81]. Tyrosine protein kinases are crucial regulators of intracellular signal transduction pathways, cell growth, differentiation, survival, and migration [82]. The AKAP11 gene is highly expressed during spermatogenesis and in mature sperm with assumed role in spermatogenesis and sperm functions, which may play functions in cell cycle control of germ and somatic cells [83,84]. The PNOC gene encodes the precursor for biologically active peptides, such as nociceptin, which have been related to several physiological roles in the central nervous system. Additionally, a study suggested that nociceptin plays a key function in meiosis during spermatogenesis [85].

Conclusion
In this study we integrated WssGWAS and CNV analyses to improve the investigation of genetic factors determining number born alive in Duroc pigs. Our study was the first to provide a map of 425 CNV regions in the pig genome, which is a substantial source of information for further studies that aim to explore the association between reproductive traits and CNV regions. The overlapping regions between WssGWAS and CNVR analyses harbor important causative variants related to pig reproductive traits based on their critical roles in fertilization, development of gametes and embryos, which may be valuable for additional validation and consideration in future selection programs aiming to improve number of piglets born alive and other reproductive traits.

Phenotype and pedigree information
The phenotypic information was collected by Smithfield Premium Genetics from five farms of Duroc pigs. Number of piglets born alive was recorded from sows born between 2008 and 2017. A total of 39,427 records from 13,845 sows spanning 1 to 12 parities were used, and number of piglets born alive in each parity ranged from 1 and 19. Animals were grouped into nine contemporary groups which were formed by concatenating farm, month and year of farrowing. Pedigree information was available for 772,779 animals.

SNP genotyping and quality control
A total of 3892 DNA samples were genotyped using the GeneSeek® Genomic Profiler Porcine HD (https://support. illumina.com/downloads/geneseek-ggp-porcine-hd-productfiles.html) which contained 68,528 SNPs across 18 autosomes and two sex chromosomes. Aiming to eliminate poor-quality DNA samples and decrease false-positive CNVs, only the samples with a call rate greater than 98% and call frequency greater than 90% were retained. The SNPs mapped to the sex chromosomes and those not mapped to any of the chromosomes were discarded. A total of 3520 samples genotyped for 57,962 SNPs remained after quality control for CNV and WssGWAS analyses.

CNV detection and statistical analysis
Individual-based CNV detection was conducted using PennCNV software [86] based on a hidden Markov model, which is widely used for detecting CNV based on SNP array data due its relatively low false-positive rate [87].
Multiple sources of information were used simultaneously for obtaining accurate CNV detections such as distance between SNPs, log R ratio (LRR), population frequency of the B allele (PBF), and B allele frequency (BAF). The LRR and BAF measures were automatically computed by GenomeStudio software v2.0 (Illumina, Inc., USA) from the signal intensity files of the SNP data. The PBF file was calculated from the signal files using the compile_pbf.pl routine present in the PennCNV software [86]. In addition, we performed a wave adjustment procedure for genomic waves due to guanine-cytosine content effect applying the gcmodel option in the PennCNV software to eliminated false positive CNVs detected from the differentiating signal intensities generated by genomic waves [88,89]. The porcine gcmodel was generated by calculating the guanine-cytosine content in genomic regions surrounding each SNP (500 kb each side).
A quality control of signal intensity was performed to reduce false-positive CNVs originated from poor-quality DNA and to increase the confidence in CNV identification, which included a BAF drift < 0.01, standard deviation of LRR < 0.30 and/or GC wave factor < 0.05 (after genomic waves were corrected by guanine-cytosine content) to generate raw CNV calls. The CNVs identified in only one sample and with less than three consecutive SNPs were discarded.
In order to eliminate inconsistent calling of CNV boundaries, CNV regions (CNVR) were inferred through concatenation of filtered individual CNVs identified in more than one animal. Regions with less than 0.5% allele frequency and with very low density of overlapping (recurrence < 0.1) were discarded for a more precise definition of CNVR.
Weighted single-step genome-wide association study (WssGWAS) Variance components for number of piglets born alive were estimated by AIREMLF90 software [90], which uses the Average Information Restricted Maximum Likelihood method to estimates variance components as well as solutions for fixed and random effects. The single-trait model included contemporary group as fixed effect, random animal genetic effect (containing inbreeding), permanent environmental effect, and the residual effect. In matrix notation, the model is described as: y¼XbþWaþKpeþe where y is the vector of phenotypic records; b is the vector of fixed effect of contemporary groups; a is the vector of direct additive genetic effects, pe is the vector of permanent environmental effects; and X, W, and K are the incidence matrices for the effects contained in b, a, and pe, respectively. Narrow sense heritability was estimated as h 2 ¼ σ 2 a σ 2 a þσ 2 pe þσ 2 e , where σ 2 i is the variance of the i-th effect. The same animal model as described previously was used to estimate the genomic breeding values using the ssGBLUP (single-step genomic BLUP) approach [91], which combines genomic and pedigree relationships into a realized relationship matrix (H). Therefore, the difference between the regular BLUP and ssGBLUP is that the inverse of the pedigree relationship matrix (A − 1 ) is replaced by H − 1 , which is represented as follows: where G −1 is the inverse of the genomic relationship matrix and A 22 −1 is the inverse pedigree relationship matrix for genotyped animals. The G matrix was constructed as in [30]: where Z is a matrix of genotypes centered by twice the current allele frequencies of each SNP (p); i is the ith locus; D is a diagonal matrix of weights (variances) for SNP, which is an identity matrix for the regular ssGBLUP. To avoid singularity problems, G was blended with 5% of A 22 . After genomic breeding values (GEBV) were estimated by ssGBLUP, they were back solved to obtain SNP effects as described by [33]: whereâ g is GEBV for genotyped animals; λ is the ratio of SNP to additive genetic variances ( The weight for each SNP was calculated based on SNP effects as follows [33]: where d i is the weight for the i-th SNP. The WssGWAS is an iterative procedure that involves the following steps [33]: (i) set D = I; (ii) construct G matrix as described by [30]; (iii) estimate GEBVs for all animals using ssGBLUP; (iv) estimate the SNP effect; (v) estimate weight for each SNP individually; (vi) normalize D to maintain the additive genetic variance constant; (vii) iterate from step ii.
The analyses were performed using BLUPF90 software [92] and the results were obtained as the percentage of additive genetic variance explained by 1 Mb sliding SNP-windows. The percentage of the additive genetic variance explained by i th window was calculated as in [34]: where a i is genetic value of the i-th region consisting of 1 Mb window length physical size, σ 2 a is the total genetic variance, z j is vector of genotype of the j-th SNP for all animals,û j is SNP effect of the j-th SNP within the i-th region, n is the number of SNP in a window of 1 Mb.

Gene annotation and enrichment analysis
The genomic regions exhibiting more than 1% of the additive genetic variance were prospected for possible QTL related to number of piglets born alive. The gene content of the windows was identified using the Ensembl Biomart tool [93]. The search for biologically relevant functions was performed with Panther v.13.1 database [94] selecting a 500 Kb window around a significant region (upstream and downstream) and using Sscrofa10.2 assembly as genome reference. P-values generated by Panther were Bonferonnicorrected for the number of conducted comparisons.