Fine mapping of copy number variations on two cattle genome assemblies using high density SNP array

Background Btau_4.0 and UMD3.1 are two distinct cattle reference genome assemblies. In our previous study using the low density BovineSNP50 array, we reported a copy number variation (CNV) analysis on Btau_4.0 with 521 animals of 21 cattle breeds, yielding 682 CNV regions with a total length of 139.8 megabases. Results In this study using the high density BovineHD SNP array, we performed high resolution CNV analyses on both Btau_4.0 and UMD3.1 with 674 animals of 27 cattle breeds. We first compared CNV results derived from these two different SNP array platforms on Btau_4.0. With two thirds of the animals shared between studies, on Btau_4.0 we identified 3,346 candidate CNV regions representing 142.7 megabases (~4.70%) of the genome. With a similar total length but 5 times more event counts, the average CNVR length of current Btau_4.0 dataset is significantly shorter than the previous one (42.7 kb vs. 205 kb). Although subsets of these two results overlapped, 64% (91.6 megabases) of current dataset was not present in the previous study. We also performed similar analyses on UMD3.1 using these BovineHD SNP array results. Approximately 50% more and 20% longer CNVs were called on UMD3.1 as compared to those on Btau_4.0. However, a comparable result of CNVRs (3,438 regions with a total length 146.9 megabases) was obtained. We suspect that these results are due to the UMD3.1 assembly's efforts of placing unplaced contigs and removing unmerged alleles. Selected CNVs were further experimentally validated, achieving a 73% PCR validation rate, which is considerably higher than the previous validation rate. About 20-45% of CNV regions overlapped with cattle RefSeq genes and Ensembl genes. Panther and IPA analyses indicated that these genes provide a wide spectrum of biological processes involving immune system, lipid metabolism, cell, organism and system development. Conclusion We present a comprehensive result of cattle CNVs at a higher resolution and sensitivity. We identified over 3,000 candidate CNV regions on both Btau_4.0 and UMD3.1, further compared current datasets with previous results, and examined the impacts of genome assemblies on CNV calling.

CNV can be identified using various approaches, including comparative genomic hybridization (CGH) array, SNP array, and next-generation sequencing. Compared to other approaches, the advantages of SNP array include its low cost, dense coverage, and high throughput. Substantial genotyping data have been produced from genome-wide association studies, which can be directly exploited for the CNV analysis. A wide range of algorithms of CNV discovery based on SNP array has been developed, including CNVPartition, QuantiSNP [30], PennCNV [31], Birdsuite [32], Cokgen [33], and others. Reviews of the strengths and weaknesses of these algorithms have been published [34,35]. As one of the leading methods, PennCNV incorporates multiple sources of information, including total signal intensity and allelic intensity ratio at each SNP marker, the distance between neighboring SNPs, and the allele frequency of SNPs. PennCNV also integrates a computational approach by fitting regression models with GC content to overcome "genomic waves" [36,37]. Furthermore, PennCNV is capable of considering pedigree information (parent-offspring trios) to improve call rates and accuracy of breakpoint prediction as well as to infer chromosome-specific SNP genotypes in CNVs.
The availability of two alternative cattle reference genomes (Btau_4.0 and UMD3, [38,39]) has opened new avenues of cattle genome research. With the advent of next-generation sequencing, more high density SNP arrays were made commercially available, including Illumina BovineHD BeadChip with more than 750,000 SNPs (Van Tassell et al., unpublished; [40]), which is 15-fold denser as compared to the previous BovineSNP50 array. Furthermore, then-published CNV results were incorporated during the BovineHD design phase to increase its coverage in variable genomic regions.
Based on this high density BovineHD SNP array, our goals in this study were to perform high resolution CNV analyses on both Btau_4.0 and UMD3.1, to compare them with the previous results, and to examine the impacts of genome assemblies on CNV calling.

Results and Discussion
Cattle CNV identification As described previously [27], we performed CNV calling on both Btau_4.0 and UMD3.1 assemblies. Due to mapping uncertainty, we excluded chrX and chrUn from our analysis.
On the Btau_4.0 assembly, 34,311 CNVs were detected with an average of 51 events for each animal (Table 1, Additional file 1: Table S2, Figure 1A and Additional file 2: Figure S1). The average CNV length was 39,953 bp. For subspecies/groups such as the Taurine, Indicine, Composite (Taurine × Indicine) and African breeds, the average CNV events per sample were 45, 65, 53 and 66 respectively (i.e. T:I:C:A = 45:65:53:66). Indicine and African breeds had the most CNVs identified. Within each subspecies/group, the numbers of unique CNVs ranged from 2.3 to 5.1 per sample, indicating that the majority of CNVs were shared at least by two individuals within the same subspecies/groups.
When we merged CNVs into nonredundent CNV regions (CNVRs), a total of 3,346 events were identified covering 142.7 Mb of polymorphic sequence, corresponding to 5.61% of the autosomal genome sequence (142.7/2,545.9 Mb) and 4.89% of the whole cattle genome (142.7/2,918.0 Mb, Table 1, Figure 1A and Additional file 3: Table S3). These CNVRs were comprised of 2,051 loss events, 986 gain and 309 both (loss and gain within the same region), ranging from 1,018 to 5,552,622 bp (Additional file 3: Table S3). Loss events are approximately 2.1-fold more common than gain events, but have smaller sizes than gain events on average (28.5 kb vs. 37.6 kb).
Furthermore, 1,316 CNVRs were found in only one sample (Unique), 2,030 CNVRs were shared at least by two animals (Multiple or more), and 179 events had a frequency >5% (Additional file 3: Table S3). These results suggest that segregating CNVs exist among these subspecies, breeds and groups, which is consistent with our earlier results [26,27].
Strikingly, the mean and median CNVR lengths were significantly shorter, 42,653 and 15,794 bp respectively, when compared to values derived from our previous low density SNP50 array study (mean: 204,965 and median: 131,179 bp). With a similar total length of~140 Mb, our new dataset contains almost five times the number of detected CNVs (3,346) than our previous study (682), suggesting that the BovineHD array provides higher resolution and sensitivity for CNV discovery. Also the current CNV results seem to be more uniformly distributed, with more events detected in the centromere and telomere regions compared to the previous results (For example, chr10 and chr20 in Figure 1A).
Compared to the Btau_4.0 results, 44.86% more CNVs (49,704) were detected within the placed autosomes and an average of 54.90% more events (79) were obtained for each sample on UMD3.1 (Additional file 1: Table S4). The average length of CNV was also 18.58% larger (47,378 bp) when using the UMD3.1 assembly coordinates; however, we identified a similar number of CNVRs (3,438 events) compared to Btau_4.0 CNVRs. The total lengths of CNVRs were similar with comparable statistics (146,905,950 bp; only 2.93% larger, Additional file 4: Table S5). When we assessed the difference in CNVR type calls (gain, loss and both), counts for each category varied less than 70 between the Btau_4.0 and UMD3.1 assemblies.
These results were not unexpected in light of the attributes of these two assemblies. As both assemblies are based on the same raw whole-genome shotgun reads, the most obvious difference between the two is that the Btau_4.0 unplaced contigs from chrUn are now placed on UMD3.1. Another difference between them is that local duplication artifacts, or duplicated regions that were artificially created on the Btau_4.0 assembly, were removed on UMD3.1. As our previous segmental duplication analyses detected, 267 Mb of duplicated sequence on Btau_4.0 were considered to be artifacts from a failure to merge high identity alleles on the reference assembly [41]. Our results were supported by independent FISH experiments [42] and similar observations reported by Zimin et al [39]. In summary, UMD3.1 seems to be more favorable than Btau_4.0 in terms of placing unplaced contigs and removing the unmerged alleles. As a result, most of CNVs calls derived from chrUn of Btau_4.0 could be recovered in UMD3.1 results. It is interesting to note even though UMD3.1's CNV calls are more abundant and larger, the merged CNVR results were almost equivalent in terms of count and length. However, the exact effects of local assembly differences (besides UMD3.1's placing unplaced contigs and removing the unmerged alleles) on CNV calling (count and length) warrant more detailed investigations in the future.

Quality assessment of selected CNV Events
Since most existing cattle CNV studies were based on Btau_4.0, we first compared the identified CNV regions  with previous published CNVs dataset [23][24][25][26] and the segment duplication (SD) dataset [41] based on Btau_4.0 ( Figure 2). Since calls on chrX were not always included in these studies, we only compared the CNVRs detected within the autosomes. We found that 789 of our 3346 CNVRs (65.1 Mb) overlapped with all combined nonredundant published data. Detailed information of each comparison was displayed in Figure 2. Approximately 14% of our new CNV calls (482 out of 3346 CNVRs) overlapped with 51% (346 out of total 682 CNVR) of the CNV regions identified in 521 animals of 21 cattle breeds using BovineSNP50 arrays [27]. When considering the CNVR lengths, 36% of variable sequence space identified in this study intersected with 37% of the previous study, representing an overlap of 51.1 Mb. The discrepancy between these two studies could be due to a difference in samples as 674 animals of 27 cattle breeds were used in this study out of which, approximately 442 animals of 19 breeds (~65.6%) were shared by the later study. Additionally, SNP arrays that probed 763,572 SNPs were used in this study, while arrays with only 56,947 SNPs were used in the previous study. Only 46,475 SNPs were shared by both platforms.
After comparison with other existing datasets, we found that 54.38% (77.6 Mb) of our CNVR calls were not reported in the literature. In order to confirm these (See figure on previous page.) Figure 1 Cattle copy number variations derived from SNP arrays on Btau_4.0 and UMD3.1. A. On Btau_4.0, CNV regions (682 events, 139.8 Mb) derived from BovineSNP50 assay are shown in the outer circle in green (gain), red (loss) and dark blue (both), while the inner circle shows the CNV regions (3,346 events, 142.7 Mb) derived from BovineHD assay. B. On UMD3.1, CNV regions (3,438 events, 146.9 Mb) derived from BovineHD assay are shown in the outer circle in green (gain), red (loss) and dark blue (both), while the inner circle shows their frequencies.
novel CNVRs, we performed 56 quantitative PCR (qPCR) assays for 18 CNV loci in 12 animals (Additional file 5: Table S6). All primer coordinates on both assemblies were given in Additional file 5: Table S6, with the exception that primers for CNVR No. 1856 could only be placed on the Btau_4.0 assembly. Most of the CNV regions had two target amplicons placed near the probes in the CNVR regions. Out of 28 CNV loci and animal combinations, 23 (82.14%) had positive qPCR confirmations in at least one amplicon and 18 (64.30%) had positive results at both amplicons. If each PCR assay was counted separately, 73.21% (41/56) were in an agreement with the CNV status estimated by PennCNV (Additional file 5: Table S6), which is considerably higher than the previous validation rates between 48 to 60% [27].

CNVs overlap with segmental duplications and other genomic features
We also compared the identified CNV regions with the 2,952 SD (excluding chrX and chrUn) identified by WGAC and WSSD [41]. Agreeing with previous predictions regarding cattle SDs, a local tandem distribution pattern is predominant in our cattle CNVR dataset ( Figure 1A). About 10.07% (337/3346) of CNV regions directly overlap with cattle SDs with an overlapping span of 18 Mb (12.61% of the total 142.7 Mb). Approximately 16.57% (489/2952) of the SDs (excluding chrX and chrUn) identified by WGAC and WSSD exhibit CNVs ( Figure 2C). In comparison, 25.66% of the CNVRs (175/ 682) detected by using BovineSNP50 array overlap with cattle SDs, corresponding to 16.28 Mb (11.65% of the total 139.8 Mb) [27]. The fractions of CNVR calls that overlap with SDs were similar (12.61% vs. 11.65%) in both SNP array studies and they were significantly lower than the fraction of SD-overlapping CNVRs (58.90%, 96/ 163 or 60.56%, 15.2/25.1 Mb) as detected by using array CGH [26]. This lower overlap fraction probably reflects the fact that both SNP arrays have poor representation within cattle SD regions. SNP density on the Bovi-neSNP50 array drops by one-third (from 21 probes/Mb in unique regions down to 14 probes/Mb) in SD regions while SNP density on the BovineHD array drops by over 61% (from 277 probes/Mb in unique regions down to 107 probes/Mb) in SD regions. We also overlapped our CNVRs with cattle Online Mendelian Inheritance in Animals (OMIA), Online Mendelian Inheritance in Man (OMIM) and cattle QTL datasets (Additional file 3: Table S3).

Gene content of CNV regions
Since UMD3.1 placed unplaced contigs and removed unmerged alleles, we focused on further characterization of the 3,438 high-confidence CNV regions from UMD3.1 autosomes. Additionally, we used both Btau_4.0 and UMD3.1 assemblies and obtained similar Panther and IPA results, as presented in Additional file 6: Table S7 and Additional file 7: Table S8.
We assigned PANTHER accessions to the overlapped peptides for CNVRs identified on both Btau_4.0 and UMD3.1 in this study. Similar statistically significant overrepresentations were observed for multiple categories (Additional file 6: Table S7). This set of copy number variable genes encompasses a wide spectrum of molecular functions, biological processes, pathways, cellular components and Panther protein classes. For example, immune system process, cellular defense response, response to stimulus, antigen processing and presentation, natural killer cell activation were among several enriched biological processes. G-protein coupled receptor activity was enriched while transcription factor and regulation activities were under represented in the Molecular Function terms. Our UMD3.1 CNVRs overlapped with 3,828 unique in silico mapped human RefSeq genes (Table 2 and Additional file 4: Table S5), of which 3,669 were mapped as candidate genes for IPA analysis. A total of eleven networks with an IPA score greater than 10 were identified (Table 3). A score of 10 indicates that there was less than a 10 -10 probability that the genes in the network were associated together by chance. The identified regulatory networks covered a broad range of physiological functions and processes, including inflammatory response, cell-to-cell signaling and interaction, cell development and cycle, lipid metabolism, normal development and function (embryo, organism, reproductive system, hematological system, skeletal and muscular system), and genetic disorder (Additional file 7: Table S8). Immunological and defense pathways were particularly enriched in IPA Pathway and Function analyses, further confirming the Panther results. The results of our functional analyses are in agreement with the findings of previous CNV studies [23][24][25][26][27]. Using the Btau_4.0 assembly, the IPA results are generally similar, with slight differences likely attributable to differences in assemblies and their annotations.

Conclusion
In addition to single nucleotide polymorphisms (SNPs), CNVs have been revealed to be a substantial source of genetic variation in cattle. In this study, we performed two comprehensive CNV analyses based on Btau_4.0 and UMD3.1. When we compared our current and previous results across different SNP platforms on Btau_4.0, we detected higher resolution, sensitivity and PCR validation rate in these current CNV datasets. Our findings suggest that the use of high-density oligonucleotide arrays may allow more precise boundary information to be extracted for CNV detection. Therefore, the use of high-density SNP arrays combined with improved CNV calling algorithms seems to significantly improve the accuracy of CNV calling. As a consequence, more CNVs were identified and the CNVs identified from those SNP arrays were more accurate and with better defined boundaries.
In this study, we provide additional support to the observation that the distribution of CNVs varies among the four subspecies groups. In case of the complete absence of a particular allele of a SNP in a certain breed, the copy number of that SNP in the breed would be biased towards loss or deletion. While some of these breed differences could be related to the fact that the SNP markers were designed on the reference genome sequence (which was derived from the sequence of a Hereford cow of European origin; L1 Dominette 01449), our observation of differential CNV counts in different breeds and subspecies was largely consistent with their histories and divergences.
We also systematically evaluated current results across two existing cattle reference assemblies. Although we obtained comparable CNVR results, we found approximately 50% more and 20% longer CNVs on UMD3.1 as opposed to those on Btau_4.0. It is worth to note significant differences exist between different assemblies. Therefore it is critical to select appropriate assemblies and further validate the predicted variants experimentally.

Selection of cattle breeds and animals
Cattle selected in this study were composed of 686 individuals from 27 breeds originally, out of which, 674 distinct high quality genotyping results with call rate larger  [26]. It is worth noting that a portion of the genes in the bovine genome has not been annotated or has been annotated with unknown function, which may influence the outcome of this analysis. Overlapping between CNVRs and additional genomic features such as cattle OMIA, OMIM and cattle QTL datasets were performed as described [26]. In silico mapped human RefSeq genes in CNVRs were analyzed using Ingenuity Pathways Analysis (IPA) v9.0 (Ingenuity Systems, Redwood City, CA) as previously described [27]. The accessions of unique genes were imported into the software and subsequently mapped to their corresponding annotations in the Ingenuity Pathways Knowledge Base. The ''Core Analysis" function included in IPA (http://www.ingenuity.com/) was used to analyze these genes in the context of networks, biological functions and Pathways. The networks accommodated these unique genes (also called focus molecules) were identified in comparison with the comprehensive global networks developed by IPA. The molecule network was illustrated with an assigned relevance score, the number of focus molecules, as well as the top function of the networks. In the process of analysis, each network was set to have a maximum of 35 molecules by default. We used only human genes and all confidence levels, including evidences of experimentally observed, predicted high or moderate confidence. The top significant biological functions and Pathway were listed.

Additional files
Additional file 1: Table S1. Numbers of subspecies, breeds, animals and trios used to call CNVs genotyped by BovineHD assay. Table S2. The summary of CNVs or CNVRs for each specie/breed based on Btau_4.0. Table S4. The summary of CNVs or CNVRs for each specie/breed based on UMD3.1.
Additional file 2: Figure S1. Comparison of cattle copy number variations derived from BovineHD and BovineSNP50 assays on Batu_4.0.
Additional file 5: Table S6. The summary of PCR results.
Additional file 6: Table S7. Over/Underrepresentation of PANTHER terms (molecular function, biological process, pathway, cellular component and PANTHER protein class) on Batu_4.0 and UMD3.1. Additional file 7: Table S8: Network, Biological function and Pathway analyses using IPA on Batu_4.0 and UMD3.1. Author contributions GEL and YH conceived and designed the experiments. DMB, YH, CL, and JS performed in silico prediction, qPCR, and computational analyses. MLH, DAB, SF, AE, SD, GRW, TSS and CPVT collected samples and generated SNP array genotyping data. YH, GEL, and DMB wrote the paper. All authors read and approved the final manuscript.