Functional and population genetic features of copy number variations in two dairy cattle populations

Lee, Young-Lim; Bosse, Mirte; Mullaart, Erik; Groenen, Martien A. M.; Veerkamp, Roel F.; Bouwman, Aniek C.

doi:10.1186/s12864-020-6496-1

Research article
Open access
Published: 28 January 2020

Functional and population genetic features of copy number variations in two dairy cattle populations

Young-Lim Lee ORCID: orcid.org/0000-0003-1182-0197¹,
Mirte Bosse¹,
Erik Mullaart²,
Martien A. M. Groenen¹,
Roel F. Veerkamp¹ &
…
Aniek C. Bouwman¹

BMC Genomics volume 21, Article number: 89 (2020) Cite this article

4980 Accesses
24 Citations
1 Altmetric
Metrics details

Abstract

Background

Copy Number Variations (CNVs) are gain or loss of DNA segments that are known to play a role in shaping a wide range of phenotypes. In this study, we used two dairy cattle populations, Holstein Friesian and Jersey, to discover CNVs using the Illumina BovineHD Genotyping BeadChip aligned to the ARS-UCD1.2 assembly. The discovered CNVs were investigated for their functional impact and their population genetics features.

Results

We discovered 14,272 autosomal CNVs, which were aggregated into 1755 CNV regions (CNVR) from 451 animals. These CNVRs together cover 2.8% of the bovine autosomes. The assessment of the functional impact of CNVRs showed that rare CNVRs (MAF < 0.01) are more likely to overlap with genes, than common CNVRs (MAF ≥ 0.05). The Population differentiation index (Fst) based on CNVRs revealed multiple highly diverged CNVRs between the two breeds. Some of these CNVRs overlapped with candidate genes such as MGAM and ADAMTS17 genes, which are related to starch digestion and body size, respectively. Lastly, linkage disequilibrium (LD) between CNVRs and BovineHD BeadChip SNPs was generally low, close to 0, although common deletions (MAF ≥ 0.05) showed slightly higher LD (r² = ~ 0.1 at 10 kb distance) than the rest. Nevertheless, this LD is still lower than SNP-SNP LD (r² = ~ 0.5 at 10 kb distance).

Conclusions

Our analyses showed that CNVRs detected using BovineHD BeadChip arrays are likely to be functional. This finding indicates that CNVs can potentially disrupt the function of genes and thus might alter phenotypes. Also, the population differentiation index revealed two candidate genes, MGAM and ADAMTS17, which hint at adaptive evolution between the two populations. Lastly, low CNVR-SNP LD implies that genetic variation from CNVs might not be fully captured in routine animal genetic evaluation, which relies solely on SNP markers.

Background

Genetic variations exist in various forms in genomes. Although single nucleotide polymorphisms (SNPs) have been the choice of variants in numerous studies, there is a growing body of evidence that copy number variations (CNVs) can have functional impact. Copy number variations are DNA segments of 1 kb or larger, and are present in varying copy numbers, compared to a reference genome [1]. Since the initial discovery of large sub-microscopic CNVs (some hundred kb) [2, 3], rapid developments in detection platforms and algorithms have advanced knowledge about CNVs, mainly in humans [4, 5].

In the early phase of their discovery, CNVs were expected to resolve the missing heritability (significant SNPs identified from genome-wide association studies (GWAS) together account small part of the heritability) [6, 7]. It was because, as in terms of base pairs, they cover a larger proportion of the genome, compared to SNPs. With the accumulation of data and analyses, the occurrence of CNVs in the genome was shown to be biased outside of functional elements [5]. Nevertheless, numerous studies have shown that CNVs play a role in determining a wide range of human health conditions, from obesity to neurodevelopmental diseases [8,9,10,11]. For instance, high copy numbers of the CCL3L1 and CYP2D6 genes confer reduced susceptibility to infection with HIV and the development of AIDS [12]. Also, the role of CNVs in adaptive evolution is further exemplified by mean copy numbers of the AMY1 gene (which codes for amylase alpha1, an essential enzyme for starch digestion). The mean copy number of AMY1 gene was shown to differ in human populations depending on dietary starch composition [13]. These findings demonstrate that CNVs may contribute to adaptive potential, and thus contain information about population history.

Studies in livestock species also highlighted the role of CNVs in shaping various phenotypes. For example, several genes affected by CNVs determine coat colours of specific breeds. Duplications of the KIT gene in pigs are related to white coat, which is only shown in domestic pigs [14, 15]. In cattle, serial translocation of the KIT gene was related to a colour-sidedness phenotype [16]. Moreover, CNVs were shown to be associated with quantitative traits that are economically important in livestock breeding, in various cattle populations [17,18,19]. One study investigated whether trait associated CNVs are in linkage disequilibrium (LD) with, and thus are tagged by, SNP markers, and revealed that ~ 25% of CNVs were not in LD with SNP markers [17]. However, this study was based on Illumina BovineSNP50 array data, in which SNP density and CNV resolution were low.

Holstein Friesian (HOL) and Jersey (JER) are the two main commercial dairy cattle breeds that have been bred under different breeding schemes. Although there have been studies investigating the link between CNVs and individual production traits [17,18,19,20,21], in-depth assessment of functional impacts of CNVs in cattle genomes has been limited. Also, whether CNVs that have an impact on phenotypes are captured in genomic evaluation, in other words, whether CNVs are in sufficient LD with SNPs, is largely unexplored. Furthermore, CNVs have been shown to be useful in disentangling population history and provide valuable insights in understanding how populations have evolved over time [22,23,24,25]. However, population genetics analyses exploring CNVs, with their main focus on HOL and JER, have been sparse.

Here, we aimed at discovering CNVs in bovine genomes based on genome assembly ARS-UCD1.2 [26] using high density SNP array data, in two dairy cattle populations. Subsequently, we performed in-depth analyses on the functional impact of CNVs and further explored the population genetic features of CNVs by analysing population differentiation index (Fst) and LD.

Results

CNV discovery in the genome build ARS-UCD1.2

The data consisted of Illumina BovineHD BeadChip (Illumina, San Diego, CA, USA) genotypes from two distinct dairy breeds (Holstein Friesian – HOL (n = 331), Jersey – JER (n = 115)) and their crossbreds (n = 29). A previous study using PennCNV on BovineHD data, of which 47 HOL animals overlapped with our study, showed high rate of CNV confirmation based on qPCR validation (91.7% for CNVs found in multiple animals, 40% for singleton CNVs) [24]. Therefore, we chose to perform CNV detection on bovine autosomes using the PennCNV software [27]. The Bovine HD SNPs were aligned to genome assembly ARS-UCD1.2.

We discovered 14,272 CNV calls from 451 individuals that passed the quality control criteria (31.6 calls/individual). Deletion calls were 1.8 times more frequent but 40% shorter (n = 9171, mean length = 44.2 kb) than duplication calls (n = 5101, mean length = 74.6 kb; Additional file 2: Table S1 and Additional file 1: Figure S1). The mean probe density (number of supporting SNPs per Mb CNV) was 403 SNPs/Mb. The 14,272 CNV calls were aggregated into 1755 CNV regions (CNVRs), based on at least 1 bp overlap, following Redon et al. [28]. These CNVRs cover 2.8% of the autosomal genome sequence (69.6/2489.4 Mb; Fig. 1; A full list of CNVR is in Additional file 2: Table S2.). These CNVRs consist of 1125 deletion CNVRs (mean length = 29.2 kb), 513 duplication CNVRs (mean length = 36.8 kb), and 117 complex CNVRs (mean length = 152.7 kb). The distribution of CNVR length is exponential, where the majority CNVRs are short to medium length (< 100 kb, 93%), while only a few observations are made for long CNVRs (> 100 kb, 7%). The CNVRs are non-randomly distributed over the chromosomes: chromosome-wide CNVR coverage varies from 0.6% on BTA24 to 4.9% on BTA12 (Additional file 2: Table S3). BTA12 is most densely covered with CNVR in terms of bp (4.2 Mb), and especially enriched for complex type CNVRs (2.2 Mb). Allele frequency of CNVRs ranges between 0.001 and 0.21.

Since most cattle CNV studies used genome assembly UMD3.1, we also repeated the CNV detection procedures, using UMD3.1. Subsequently, we used these calls to assess our CNV discovery results with other cattle CNV papers. From the 447 individuals that passed the QC criteria, 24,264 CNVs were called (54.3 calls/individual) and the mean probe density was 326 SNPs/Mb. These CNVs were aggregated into 1866 CNVRs (1130 deletions, 593 duplications, and 143 complex CNVRs). The mean length of deletion, duplication, and complex CNVRs is 29, 36, and 193 kb, respectively (Additional file 2: Table S1). These CNVRs together cover 82 Mb (3.3%) of bovine autosomes. The chromosome-wide coverage varies between 1% on BTA24 and 10% on BTA12 (Additional file 2: Table S4 and Additional file 1: Figure S2). Compared to other cattle CNV studies conducted using the same SNP array and the genome assembly UMD3.1 [22, 24, 29,30,31,32], our CNV discovery results are in a similar range (Additional file 2: Table S5).

When we compared to our CNVs discovered based on UMD3.1 and ARS-UCD1.2, we observed several differences. Firstly, the number of CNVs called per individual based on ARS-UCD1.2 is 42% lower than what was obtained using UMD3.1. Also, the mean probe density increased from 326 SNPs/Mb in UMD3.1 to 404 SNPs/Mb in ARS-UCD1.2, indicating that with ARS-UCD1.2, CNVs are supported by more SNPs. Lastly, the mean length of complex CNVRs decreased by 40 kb, from 193 kb in UMD3.1 to 152.7 kb in ARS-UCD1.2. We further inspected BTA12:70–77 MB region where a large change between UMD3.1 and ARS-UCD1.2 was observed. This region was reported to have a large number deletion and duplication calls by other cattle CNV studies based on UMD3.1, regardless of the studied breeds [24, 29,30,31,32,33]. In our CNV discovery, we identified 7 CNVRs (total length of ~ 6.2 Mb) in this region based on UMD3.1, whereas ARS-UCD1.2 based results revealed 9 CNVRs that covered ~ 1 Mb. We compared the positions of BovineHD SNPs in UMD3.1 and ARS-UCD1.2 to see whether the changes in genome assemblies caused this discrepancy. The results showed that 43% of the SNPs located in BTA12:70-77 Mb based on UMD3.1 were either moved to unmapped contigs or reference and alternative SNPs were undefined. The genome-wide ratio of SNPs that were moved to different chromosomes or contigs was much lower (2.3%) than 43%. This indeed indicates that the two genome assemblies differ in this regions, and thus led to different CNV discovery results.

Functional impact of CNVRs

The expression of genes can be altered by CNVs. Deletions and duplications of a part of and/or complete gene can disrupt the gene expression and can potentially lead to changes in various phenotypes [34]. Therefore, identification CNVRs that coincide with genes can be a primary step to assess their functional impact. To achieve this, we explored CNVRs found based on ARS-UCD1.2 further. The overlap of CNVRs with Ensembl annotated genes were analysed, and among the 1755 CNVRs, 912 (52%) are genic and 843 (48%) are intergenic. Genic CNVRs overlap with 1739 genes out of 27,570 Ensembl annotated genes (6.3%) and 2936 out of 43,949 gene transcripts (6.7%). Among the 1739 genes that overlap with CNVRs, 957 (55%) are completely within the CNVRs and the rest (45%) are partially affected (genic features were inside the CNVRs). The following functional impact categories were assigned to each CNVR depending on types of overlap between CNVRs and genes (numbers in the brackets indicate number of CNVRs and genes respectively for each category; see materials and methods for detailed explanation for the classification): 1) intergenic (843 CNVRs; 0 genes), 2) intronic (214 CNVRs; 234 genes), 3) whole gene (253 CNVRs; 957 genes), 4) stop codon (147 CNVRs; 203 genes), 5) promoter regions (124 CNVRs; 187 genes), and 6) exonic (174 CNVRs; 165 genes). Then, these functional categories were intersected with other features of CNVRs such as types (deletion, duplication, complex), MAF (common, intermediate, and rare; see methods for detailed explanation), and the populations (HOL and JER; Fig. 2). The functional consequences of CNVRs differ depending on the type of CNVRs: Complex CNVRs were skewed towards genic regions (68% are genic), whereas deletions and duplication CNVRs were biased away from genic regions (51–52% are genic), and the difference is significant (chi-square test P < 10^− 13). Also, we observed that MAF have impact on different types of overlap between genes and CNVRs. Rare CNVRs tend to be genic more often (60%), whereas common CNVRs have less overlap compared to it (48%; chi-square test P < 0.002). However, when seen it separately for deletion CNVRs and duplication CNVRs, we saw a different pattern. Common deletion CNVRs are more often intergenic (61%), yet the common duplication CNVRs are often genic (68%). When CNVRs between HOL and JER are compared, common JER CNVRs are more often genic (51%), than common HOL CNVRs (44%). Subsequently, we performed permutation tests on overlaps between CNVRs and autosomal genes, to test whether the overlap is significantly higher than expected under a neutral scenario. The results show that CNVRs overlap with autosomal genes more often than what is expected from permutation tests with random genomic regions (P < 0.001). Nextly, gene ontology analyses were performed to understand the functions of the genes that overlap with CNVRs. Genes overlapping deletions, duplications, and complex CNVRs were tested for GO enrichment as separate classes (Table 1). Among the findings, genes overlapping with the complex CNVRs (n = 407) show a pronounced enrichment in response to stimulus (GO:0050896; FDR = 1.8 X 10^− 6), immune response (GO:0006955; FDR = 1.9 X 10^− 3), and detection of stimulus involved in sensory perception (GO:0050906; FDR = 1.1 X 10^− 2). These findings are similar to the findings from earlier cattle CNV studies [30, 33].

Table 1 Go enrichment results for different types of CNVR

Full size table

Population genetics of CNVRs

Population genetics analyses provide a framework to understand genetic variation seen in specific (cattle) populations. Understanding general properties of genetic variants is important, but further characterization of specific variants of interest can bring insights in recent adaptation and genome biology [35]. Although SNPs have been extensively used in characterizing various cattle populations [36], we explored the population genetic properties of CNVRs.

We focused our analyses on HOL (n = 315) and JER (n = 107) animals, derived from distinct origins and with a different breed formation history [37]. First, we coded the genotypes of our bi-allelic CNVRs (n = 1154 for HOL; n = 700 for JER) as “+/+”, “+/−”, and “−/−”. The CNVR allele frequency was classified as rare (MAF < 0.01), intermediate (0.01 ≤ MAF < 0.05) and common (0.05 ≤ MAF). In HOL, the allele frequency ranged from 0.002 to 0.29, and 5, 13, and 82% of the 1154 CNVRs were categorized as common, intermediate, and rare CNVRs, respectively. For the JER population, allele frequency ranged from 0.005 to 0.37, and 11, 20, and 69% of the 700 CNVRs were categorized as common, intermediate, and rare CNVRs, respectively.

We constructed site frequency spectra of CNVRs for HOL and JER separately (Fig. 3). For both populations, we observed that deletions and duplications have slightly different spectra, where deletions were more skewed towards rare CNVs, whereas duplications were observed relatively more frequent than deletions in each MAF class. We further explored the allele frequencies by applying Wright’s fixation index (Fst) [38] to characterize population structure [39] and detect loci that underwent selection [40], as done in Yali Xue et al. [41]. Given that HOL and JER have distinctive origins and breed formation history [37], we hypothesized that Fst on their CNVRs can reveal regions that underwent recent population differentiation. The Fst distribution followed an exponential decay pattern, as expected, underlining that majority of CNVRs have values close to 0, whereas only a few outliers (~ 3%) that are potentially under positive selection reached high Fst values (Additional file 2: Figure S3). We identified 32 highly diverged CNVRs (Fst > mean + 3 S.D.) of which 15 are genic and 17 are intergenic (Fig. 4 and Additional file 2: Table S6). Among the 17 intergenic CNVRs with high population differentiation (Fst = 0.12–0.44), 7 CNVRs had regulatory elements such as lncRNA and snoRNA within ~ 300 kb from the CNVRs. Among the genic CNVRs, CNVR 380 (Fst = 0.21; duplication), which is more frequent in JER (MAF = 0.24) than in HOL (MAF = 0.04), contains three genes, CLEC5A [42], TAR2R38 [43], and MGAM. The known functions of these genes include abnormal eating behaviour, bitter taste perception, and the synthesis of maltase glucoamylase, a starch digestive enzyme. Furthermore, CNVR 826, 1312, and 1458 overlap with genes that are known to regulate body size: LRRC49 [44], CA5A [45], and ADAMTS17 [46,47,48], respectively. Interestingly, these CNVRs are duplications and have a high allele frequency in JER (MAF = 0.08–0.37), and a low allele frequency in HOL (MAF = 0–0.06).

Subsequently, we calculated Vst statistic, which is a widely used statistic in CNV studies [23, 49]. This statistic is analogous to Fst, but using LRR values instead of allele frequencies [28]. The Vst statistic ranges between 0 and 1, where 1 indicates population differentiation. To strengthen our confidence in the high Fst outlier regions we compared Fst and Vst statistics. Firstly, we calculated Vst for 1464 CNVRs where Fst values are available. The Pearson correlation coefficient between Fst and Vst was low (0.22), and many selection candidate CNVRs that were found privately in Vst were either driven by rare CNVRs (less than 5 copies), or with a small number of SNPs (the numbers of average SNPs for top 20 Vst CNVRs and Fst CNVRs was 3.7 and 20.7 respectively; Additional file 2: Figure S4 A-C). To correct for this, we removed CNVRs with less than 5 CNVs are called from either HOL or JER population (n = 1154 CNVRs). We observed that this filtering removed outlier CNVRs that were private to Vst, that were consisting of a small number of SNPs. After this filter, the 32 high Fst CNVRs were kept and the correlation coefficient was 0.52 (n = 310 CNVRs; Additional file 2: Figure S4 D-F). Also, CNVR 1458 which overlaps with ADAMTS17, showed a high Vst of 0.17 (mean Vst mean = 0.03, Vst S.D. = 0.04). Furthermore, when the copy number filter was applied to both populations, and therefore both HOL and JER had more than five copies of CNVs at each CNVRs (n = 44), the correlation coefficient increased to 0.81 (Additional file 2: Figure S5).

Linkage disequilibrium of CNVRs

There has been a large number of genome-wide associations (GWAS) performed using SNPs in livestock species, aiming to unravel genomic regions related to phenotypes of interest [50]. This approach exploits a large number of tagging SNPs that are in sufficient LD with causal variants. Under this framework, genetic variation caused by the causal variants is captured by the tagging SNPs, without knowing the exact causal variants. Thus, the genome-wide level of LD between SNP markers and causal variants is an important foundation of GWAS [51]. We showed that CNVRs overlap with genes more often than would be expected by chance, and that CNVs are thus likely to have an influence on phenotypes. The important follow-up question is whether the variations from CNVs are already captured by SNPs typed on commercial arrays, which are commonly used in livestock breeding programmes. We, therefore investigated pairwise LD between bi-allelic CNVRs and neighbouring SNPs on the BovineHD SNP chip. We observed generally low r², close to zero, regardless of the distance between CNVRs and SNPs (results not shown). Subsequently, we categorized CNVRs by their allele frequency and type to investigate whether these factors influence the degree of LD. Common CNVRs have markedly higher LD (r² = ~ 0.1 for deletion CNVRs at ~ 10 kb distance), compared to other CNVR categories (Additional file 2: Figure S6). As common CNVRs had higher LD than the rest, we compared the LD of common CNVRs with the LD of SNPs in the same MAF range (0.05 ≤ MAF < 0.29 for HOL and 0.05 ≤ MAF < 0.37 for JER). We observed distinctive difference in LD decay patterns between the CNVR-SNP pairs and SNP-SNP pairs (Fig. 5a and b). SNP-SNP LD follows a typical LD decay pattern where strong LD is observed with SNPs in vicinity and gradual decline as the distance increases, whereas CNVR-SNP LD does not follow this pattern. Also, compared to the CNVR-SNP LD (r² = ~ 0.1 at ~ 10 kb distance), the frequency matching SNP-SNP LD was stronger (r² = ~ 0.5 at ~ 10 kb distance). Afterwards, we used another metric, taggability, to assess LD. Taggability is the maximum r² among the r² values that are obtained from a variant of interest and SNP pairs. We calculated taggability for SNP-SNP pairs and CNVR-SNP pairs. For the CNVR-SNP pairs, we considered common deletion CNVRs only, as they showed the highest LD in the previous analyses. Then, mean taggability for each MAF class (bin size = 0.05) was plotted (Fig. 5c and d). The mean taggability of common deletion CNVRs is low (< 0.1) when MAF is below 0.05, and it increases as MAF increases. The SNP mean taggability follows the same pattern as shown in common deletion CNVRs. However, in spite of the similar pattern, common deletion CNVRs taggability is below the level of the SNP taggability. This shows that there is a gap in SNP taggability and CNVR taggability.

Interesting CNVR

A large number of QTLs has been identified from various GWAS on a wide range of traits. As most GWAS have been done using SNP markers, chances are that genetic variation caused by CNVs could have been captured by QTLs that are in a high-to-perfect LD (r² = ~ 1) with the CNVs. Hence, inspecting CNVRs that are in high LD with QTLs is a preliminary step to identify potentially causal CNVs. To identify candidate causal CNVs, we subset the CNVR-QTL pairs, from the total CNVR-SNP pairs, based on the QTL information from the animal QTLdb [52]. We then subset the CNVR-QTL pairs further based on r², and kept high LD CNVR-QTL pairs only.

In total ~ 100,000 bovine QTLs for various traits have been reported in the animal QTL database, and we identified 2519 QTLs to be paired with 679 CNVRs within a distance of 100 kb in the HOL population. Among these, CNVR 547 (BTA6:84,395,081-84,428,819, deletion, MAF = 0.24) had the highest LD with 13 QTLs (average r² = 0.59; max r² = 0.74). The 13 QTLs were associated with casein proteins, which constitute four out of six bovine milk proteins. The four genes coding for the casein proteins are located in the so called casein cluster, which is ~ 1 Mb distant region from CNVR 547 (BTA6:85.4–85.6 Mb). Given the degree of LD for CNVR 547 and the QTLs that is lower than perfect linkage, it is unlikely that the CNVR 547 is the causal variant for the casein protein traits. Nevertheless, CNVR 547 was an interesting variant as it was private to in HOL population with high MAF (0.24), and was close to the casein cluster that are highly relevant for dairy production.

Assuming that CNVR 547 is not the causal variant for the casein traits, a possible explanation for the high MAF can be selective sweeps. Selective sweeps increase allele frequencies of neutral variants that are in LD with the selection target variant, which in this case probably is the casein cluster. Two studies of Holstein populations support this hypothesis. Firstly, one selective sweep study in a German Holstein population revealed an extended range of LD in haplotypes that contain the casein cluster [53]. Secondly, GWA study on casein traits in a Danish Holstein population identified a broad GWAS peak (BTA6:60–100 Mb) that contains the casein cluster [54]. The broad GWAS peak also indicate high LD in this regions, that matched with the findings from Qanbari et al. [53]

Another explanation for the high MAF of CNVR 547 might be the direct selection on the variant itself. For instance, CNVR 547 overlaps with the UGT2B4 gene, which is involved in the detoxification pathway of exogenous compounds [55]. To see whether CNVR 547 overlaps with regulatory elements, besides overlapping with the upstream region of the UGT2B4 gene directly, we called promoters and enhancers from ChipSeq data from Villar et al. [56]. CNVR 547 overlaps not only with the upstream (a start codon and the first two exons), but also with the enhancer of UGT2B4 (BTA6: 84,413,246-84,413,740), and is thus likely to disrupt the function of the UGT2B4 gene. To summarize, our analyses imply that a high MAF of CNVR 547 might be due the selective sweep in the casein cluster or the consequence of direct selection on CNVR 547 itself due to the functional impact of the overlap with UGT2B4 and its enhancer. Nonetheless, we cannot exclude drift as a possible driver for the high allele frequency of the CNVR 547.

Discussion

In this study, we discovered CNVs using bovine high density SNP array data. Using CNVRs that are constructed using the CNVs, we reported the functional impact and population genetic features of the CNVRs. They are further discussed below.

CNV discovery in the genome build ARS-UCD1.2

We observed different CNV discovery results between UMD3.1 and ARD-UCD1.2. The different results were to be expected, given the different sequencing platforms used for the assemblies. Long-read sequencing platforms are shown to perform better in retrieving repeat regions, which is considered to be challenging in short-read sequencing [57]. Among others, the most intriguing difference was observed for the BTA12:70–77 MB region. Based on the changes in BovineHD SNPs between UMD3.1 and ARS-UCD1.2, we postulated that the two genomes assemblies differ in this regions largely. Subsequently, the changes in the genome assemblies led to different CNV discovery results. We, then, further postulated that this region (BTA12:70-77 Mb in UMD 3.1) might contain repeated sequences, rather than the reported CNV, for two reasons. Firstly, the SNP density in this region is a quarter of the genome-wide average SNP density in UMD3.1 (71 SNPs/Mb and 292 SNPs/Mb, respectively; Additional file 2: Figure S2). SNP probes in repeat regions can reduce specificity of hybridization, and hence are often filtered out during SNP probe selection [58,59,60], which can explain why some regions show a sharp decrease in SNP density. Secondly, SNP probes in segmental duplications (sequence identity > 90%) can induce confounded deletion calls due to cross-hybridization of paralogous sequences [61]. Our data set based on UMD3.1 was indeed enriched for a large number of deletion calls in this region. We regard this large difference as evidence underlining the importance of the quality of the reference genomes and the impact this has on CNV calling results.