General discussions about significant CNVRs associated with 4 main trait categories
For the reproduction phenotypes, we detected 35 CNVRs to be associated with eight reproduction traits including Dtr_Preg_Rate, Heifer_Conc_Rate, Cow_Conc_Rate, gestleng, Dtr_Calv_Ease, Sire_Calv_Ease, Dtr_Still_Birth, Prod_Life. Of note, 24 of CNVRs mentioned above were significantly associated with DPR. For example, CNVR423 (chr21: 20,323,800 - 20,338,200) and CNVR120 were associated with both DPR and somatic cell score (SCS), which was similar to the previous SNP-based GWAS study [19]. On the other hand, we found that DPR also shared many common CNVRs (CNVR659, 423, 661,120, 213, 941, 584, and 386) with heifer or cow conception rate. Productive life measures a cow’s longevity in the herd and is affected by production, health and reproduction traits. We identified four CNVRs (CNVR661, 408, 350, 714), which were associated with productive life. Three of them were shared with health traits including livability, displaced abomasum, and metritis, while two (CNVR661 and CNVR714) were shared with fertility traits (DPR and related conception rates) and one (CNVR661) shared with production trait (protein yield). Consistently, previous studies have shown that productive life is more related to health and fertility traits than to production and calving traits [19]. In addition, productive life also shared significant CNVRs with body conformation traits, such as Fore_udder_att, teat length and rump width. For significant CNVRs for production, health, and body conformation phenotypes, please see Additional file 2: Table S4’s notes.
Three significant CNVR examples
Among the 87 significant CNVRs, 39 loci were simultaneously related to at least 2 traits, suggesting their pleiotropic effects. Notably, 10 CNVRs were significantly associated with ≥5 traits. For the first example, CNVR661 (chr3: 99,808,706 - 99,846,316) was associated with seven different phenotypes related to production (Protein), female fertility (Dtr_Preg_Rate, Heifer_Conc_Rate), reproductive (Prod_Life, net merit, livability), body conformation (rump width) (Fig. 2A). Compared to a previous study [13], CNVR661 was also harbored by CNVR44 identified in Chinese bulls. In terms of explained proportions of phenotypic variance explained, CNVR661 contributed 5.30% to Dtr_Preg_Rate, suggesting CNVR661 could be an important genetic variance for cattle female fertility. CNVR661 partially overlapped with the coding regions of CYP4A11 (cytochrome P-450 4A11, chr3: 99,806,653 - 99,820,784), which is a major lauric acid (medium-chain fatty acid) omega hydroxylase in human liver. CYP4A11 is involved in fatty-acid metabolism, blood pressure regulation, kidney tubule absorption of ions; and can convert arachidonic acid to 20-hydroxyeicosatetraenoic acid (20-HETE). Of interest, CYP4A11 was specifically highly expressed in kidney and liver according to gene expression atlas (Fang et al., 2018, in preparation) (Fig. 2C). In Chinese cattle, Yang et., al [30] has demonstrated the positive effect of CYP4A11 copy number on body size traits, which may be due to the dosage effects of CYP4A11 copies on the gene expression level in liver, kidney, muscle and adipose. These evidences indicated that the strong associations between CNVR661 and multiple phenotypes could involve its effect on the CYP4A11 function. The second example is CNVR213 (chr15: 42,483,000 - 42,495,000), which may have impacts on reproduction (Dtr_Preg_Rate, Heifer_Conc_Rate, Cow_Conc_Rate), on body status (dairy form, udder depth), and on milk production (Fig. 2B), with effects ranging from 2.27 to 4.96%. Interestingly, CTR9 (CTR9 homolog, Paf1/RNA polymerase II complex component) was found to locate in CNVR213 and was highly expressed in tissues related to blood immune (e.g. thymus, white blood cells, CD4, CD8 cells), male reproduction (testes), and sperm (Fig. 2D). CTR9, a key component of the PAF1 complex, associates with RNA polymerase II and functions in transcriptional regulation and elongation [31]. PAF1 complex also plays a role in the modification of histones and has multiple functions during transcription by RNA polymerase II [32]. The CTR9 showed high expression in thymus and immune related cells. CTR9 has been demonstrated to involve in cord blood-associated megakaryopoiesis [33]. These evidences suggested that CNVR213 might have dosage effects on CTR9 gene expressions in the related tissues, therefore affect CTR9’s regulatory role in reproduction traits. The third example is CNVR758 (chr5: 91,755,000-91,765,800), which explained considerable proportions for Dtr_Preg_Rate (14.30%). We noticed that CNVR758 was located at the upstream of gene CAPZA3 (the capping actin protein of muscle Z-line alpha subunit 3), which encodes an actin capping protein and is one of the F-actin capping protein alpha subunit family. F-actin-capping proteins play a role in the morphogenesis of spermatid. A previous study has demonstrated that CAPZA3 protein may be important in determining sperm architecture and male fertility [34]. Hence, we speculate that CNVR758 near CAPZA3 may affect the CAPZA3 transcription and thus lead to the phenotype effects on bull fertility.
Previous cattle QTLs overlapped with CNVRs
We also identified CNVRs that spanned potential cattle QTLs and OMIA genes influencing disease susceptibility. Among all CNVRs, 91 CNVRs were overlapped with 553 cattle QTLs (Additional file 2: Table S1). By querying against OMIA, we found 9 CNV-overlapped genes, which were related to coat color, health and diseases in ruminants (such as cattle, yak, goat, sheep), pig, and dog (Additional file 2: Table S1). To further investigate the associations of discovered CNVs with phenotypes, we overlapped CNVRs with previously reported cattle QTLs and observed five CNVRs that were overlapped with QTLs (CNVR659, CNVR120, CNVR26, CNVR459, and CNVR953) (Additional file 2: Table S4). Interestingly, some traits in the QTL database were also present in our CNV-based association results. For example, CNVR659 (chr3: 91,875,073 – 91,890,228) was markedly associated with eight complex traits and overlapped with nine QTLs. Among these traits, Dtr_Preg_Rate, dairy form and udder cleft were observed in both results. Using the genotyped markers, significant QTL for milk production traits have been identified on BTA3 [35]. In Danish Red breed, the QTL for milk yield traits has been found on BTA9 based on SNP markers [36]. Our results showed that CNVR953 (chr9: 56,013,697 - 56,026,693) was associated with milk yield and it overlapped with the milk yield QTL. In Holstein cattle, a preliminary estimate for relationship between CNVs and associated SNPs has showed that approximately three-quarter of CNVs could be captured by LD with nearby SNPs [27]. The consistencies between our results and other SNP-based GWAS results might be in part explained by the linkage between CNVs and tag SNPs that were identified as functional QTL. Therefore, in summary, our study provided multiple possible hypotheses to test for the functional impact of CNV on cattle economically important traits. Many novel CNVs identified in this study could function as potential additional markers.
Comparison with published results
To date, multiple studies about Holsteins CNV discovery have been published [12, 14, 15, 37]. Nevertheless, CNV discovery studies often produced large calling datasets with certain false positives. We then examined the overlaps of our results with the Holsteins CNVs identified in several previous reports using Illumina BovineSNP50 array [27], BovineHD SNP array [14, 15, 28] and high throughput sequencing technology [26]. Here, only the high confidence CNVRs after filtering by frequency and CNV length in previous studies were used for comparison analysis. In total, 22.13% (216/976) CNVRs within autosomes in our study with a total length of 15,866,448 bp (44.99%) overlapped with the merged CNVRs of the previous five studies by at least 50% overlapping length (Additional file 2: Table S1). Separately, the overlapped CNVR lengths (CNVR count in our study vs. previous study) were 1,900,016 bp (20 vs. 13) compared to 39 CNVRs in [27], 3,739,012 bp (51 vs. 47) compared to 191 CNVRs in [26], 4,165,080 bp (33 vs. 20) compared to 90 CNVRs in [28], 11,419,896 bp (124 vs. 93) compared to 198 CNVRs in [15] and 14,223,966 bp (152 vs. 106) compared to 230 CNVRs with frequency > = 0.05 in [14], respectively (Additional file 2: Table S1). Despite the small sample size of this study, considerable CNVs are still supported by high confidence CNVs in previous Holsteins CNV discovery studies, especially for those with BovineHD SNP array data. Among the 216 common CNVRs, 135 CNVRs were applied for GWAS analysis and 20 of them were observed to associated with Holstein phenotypes (Additional file 2: Table S4). Additionally, to validate the CNV calling results, our previous study has observed a moderate correlation (r = 0.429) between whole genome aCGH probe values and digital aCGH values in six Holstein individuals of this study [11]. Therefore, this study provided further evidences on common CNVs and discovered certain new CNVs.
Previous CNV-based GWAS studies have provided some evidences for CNV impacting phenotypes in Holsteins. Using Illumina BovineSNP50 arrays data, we identified 34 significant CNVs associated with milk production traits with Golden Helix SNP & Variation Suite (SVS) [27]. Based on the BovineHD genotyping data, we found 57 CNVs associated with phenotypes including feed efficiency and feed intake-related traits [28]. Using CNVs identified from both sequencing and the BovineSNP50 array, 15 CNVRs were associated with 7 economically important traits [26]. Compared to them, this study found 87 significant associated CNVs for 28 complex traits. Of note, 20 CNVRs have been supported by common CNVs in previous five studies we investigated and 17 of them were associated with at least two phenotypes. However, only limited parts of our findings have been identified before. For instance, CNVR120, 423, 661 have been reported in previous studies but none reported for their effects on the analyzed traits related to production and body type [26,27,28]. Therefore, this study described, for the first time, that these CNVRs were associated with DPR and other traits related to production, reproduction, health, and body conformation.
Advantages and disadvantages of this study
We considered these in four parts. First, this study used a unique array CGH platform. Normally, it is not straightforward to compare CNV results across different platforms. SNP arrays output normalized total intensities (Log R ratio, LRR) and allelic intensity ratios (B allele frequency, BAF), whereas CGH array normally do not consider BAF information. While PennCNV used for SNP arrays can provide accurate calculation of copy numbers when less than 4 copies [26, 38, 39], SNP arrays generally do not have the same sensitivity or resolution of dedicated CGH arrays for high copy number CNV discovery [16, 40]. The SNP chip has the inherent bias coverage against areas of the genome known to frequently harbor CNVs [17, 41], while CGH arrays shown better sensitivity signal-to-noise ratios and specificity, probably as a consequence of longer probes on the array CGH platform. Additionally, many CNVs missed by SNP arrays but detected by CGH arrays are in segmental duplication (SD) regions, which could be due to a combination of differences in probe coverage and the type of reference samples used. Using the single reference, the CGH arrays have greater sensitivity to detect small differences in copy number (e.g., 4 vs. 5) [16]. However, different reference samples may pose a problem in the detection of CNVs and result in the different relative copy numbers among test individuals. Thus, it is important to understand these tradeoffs and use the same reference sample within one study [41].
Secondly, we used different CNV calling algorithms. Multivariate method of SVS was used for SNP arrays in [27, 28], while the segment-calling algorithm (SegMNT) was used for array CGH in this study. Although both use similar segmentation algorithms, the multivariate method is designed for detecting small common CNVs based on multi-sample. As described in the SVS manual, the multivariate method (the pooled marker-level testing across samples) carries out association testing first between the phenotypes and raw intensities at the level of the individual marker, and then aggregates neighboring test results to identify CNVs associated with the phenotype [41]. For CGH-segMNT analysis, the Roche NimbleGen segMNT algorithm was used to call CNV segments in each animal compared to the reference animal. The segMNT algorithm identified copy number changes using a dynamic programming process that minimizes the squared error relative to the segment means, which showed increased accuracy and performance [42]. As for CNVtools method, robust quantitative trait association tests of CNVs were performed based on LRR of probes within each CNV regions. CNVtools then combined the information across a small number of CNV probes to obtain a one-dimensional signal using principal component and Bayesian information criterion for each sample. A copy number genotype was assigned to each locus for each individual to test for genetic association with a quantitative trait based on a standard regression approach [43].
Thirdly, we examined a larger number of complex traits for association analysis. Previous studies only investigated 5 milk production traits [27], 10 production or reproduction traits [28], and 7 production, functional and type traits in [26]. With more phenotypes than others, our study performed the CNV-based GWAS study for many complex traits for the first time and provides some new potential markers and promising information for dairy cattle breeding.
Finally, although this study unravels some reasonable and intriguing results, we must acknowledge that given that the limited sample size (n = 39) due to the high cost for array CGH, some associations for certain traits in this study could be less reliable. Therefore, further validation by other methods like long reads sequencing technology and larger sample size is necessary in the future.