Analysis of copy loss and gain variations in Holstein cattle autosomes using BeadChip SNPs
© Seroussi et al; licensee BioMed Central Ltd. 2010
Received: 17 May 2010
Accepted: 29 November 2010
Published: 29 November 2010
Skip to main content
© Seroussi et al; licensee BioMed Central Ltd. 2010
Received: 17 May 2010
Accepted: 29 November 2010
Published: 29 November 2010
Copy number variation (CNV) has been recently identified in human and other mammalian genomes, and there is a growing awareness of CNV's potential as a major source for heritable variation in complex traits. Genomic selection is a newly developed tool based on the estimation of breeding values for quantitative traits through the use of genome-wide genotyping of SNPs. Over 30,000 Holstein bulls have been genotyped with the Illumina BovineSNP50 BeadChip, which includes 54,001 SNPs (~SNP/50,000 bp), some of which fall within CNV regions.
We used the BeadChip data obtained for 912 Israeli bulls to investigate the effects of CNV on SNP calls. For each of the SNPs, we estimated the frequencies of occurrence of loss of heterozygosity (LOH) and of gain, based either on deviation from the expected Hardy-Weinberg equilibrium (HWE) or on signal intensity (SI) using the PennCNV "detect" option. Correlations between LOH/CNV frequencies predicted by the two methods were low (up to r = 0.08). Nevertheless, 418 locations displayed significantly high frequencies by both methods. Efficiency of designating large genomic clusters of olfactory receptors as CNVs was 29%. Frequency values for copy loss were distinguishable in non-autosomal regions, indicating misplacement of a region in the current BTA7 map. Analysis of BTA18 placed major quantitative trait loci affecting net merit in the US Holstein population in regions rich in segmental duplications and CNVs. Enrichment of transporters in CNV loci suggested their potential effect on milk-production traits.
Expansion of HWE and PennCNV analyses allowed estimating LOH/CNV frequencies, and combining the two methods yielded more sensitive detection of inherited CNVs and better estimation of their possible effects on cattle genetics. Although this approach was more effective than methodologies previously applied in cattle, it has severe limitations. Thus the number of CNVs reported here for the Holstein breed may represent as little as one-tenth of inherited common structural variation.
The Holstein-Friesian breed is the world's highest-producing dairy cattle; much of its outstanding milk production was gained by selection of elite artificial insemination (AI) bulls based on breeding values that were estimated by progeny testing. Genomic selection is a newly developed tool for the estimation of breeding values through the use of genome-wide genotyping of single nucleotide polymorphisms (SNPs). Over 30,000 Holstein bulls have been genotyped with the Illumina BovineSNP50 BeadChip , which includes 54,001 SNPs (~SNP/50,000 bp). This chip may capture any genetic variance that is genetically linked to these markers, as well as copy number variations (CNVs) [2, 3]. A CNV is a structural variation, including deletion, duplication, translocation or inversion. CNV has been recently identified in human and other mammalian genomes, and it is now recognized that CNV might be a major source of heritable variation in complex traits . In humans, over 14,478 CNV loci have been recorded based on 89,427 different entries that cover about one-third of the genome. Of these entries, 65% include CNVs that range mostly between 1 and 10 kb and 34% are indels in the range of 100 bp to 1 kb http://projects.tcag.ca/variation/. CNV regions (CNVRs) encompassing adjacent or overlapping losses or gains cover 12% of the human genome. Hence, this source of variation has more nucleotide content per genome than SNPs . However, assuming an average spontaneous CNV mutation rate of 1/10,000 per locus , it is expected that a considerable portion of the reported entries arise from de novo CNVs of a sporadic nature.
Several algorithms for CNV identification from SNP arrays are available . Following reports that PennCNV was the most reliable algorithm in the detection of CNVs from Illumina BeadChip data [7, 8], we chose this software to analyze signal intensity (SI) data. PennCNV is a CNV detection tool that incorporates multiple sources of information, including the ratio of total SI to allelic intensity at each SNP marker. This software was originally developed for Illumina whole-genome BeadChip arrays .
The introduction of AI to modern dairy herd management has resulted in a loss of genetic diversity in Holsteins and the effective size of the Holstein population is low (e.g. 39 in the USA ). Since the Israeli Holstein population has been under intensive selection for 50 years, its genetic pool is expected to have similar characteristics. Although it is now accepted that genomes vary more at the structural level than at the nucleotide-sequence level, little has been published on CNVs in Holsteins.
In a study that validated the quality of BovineSNP50 BeadChip performance , population-wide genotyping of Israeli Holstein bulls was initiated in order to introduce genomic selection into the Israeli breeding program. Our study makes use of these data to describe the frequent CNV in Holsteins and investigate its effect on BeadChip calls. We propose to combine Hardy-Weinberg equilibrium (HWE)-based and SI-based methods to reliably detect CNVs of the deletion and duplication types that are not de novo or sporadic CNVs, which are less likely to be of any economic value.
We used the data obtained for 912 Holstein bulls to investigate the effects of CNVs on BovineSNP50 BeadChip calls. For each of the SNPs, we estimated the frequency of occurrence of deletions and insertions using a generalization of the Hardy-Weinberg principle for more than two allele frequencies (p, q) by assuming presence of a third allele (r). Under the assumption of three-alleles, expected HWE frequencies are obtained by the trinomial expansion of (p + q+ r)2 = 1. Defining 'n' as the number of individuals sampled, 'pqo' as the detected number of individuals with phenotype similar to the pq heterozygote phenotype divided by n, 'po' as the number of allele p-like homozygotes divided by n, and 'qo' as the number of allele q-like homozygotes divided by n, the solution for this expansion in the case of a null allele r is: rl = [0.25 - 0.25pqo + poqo/pqo]0.5 - 0.5; in the case of gain of an allele which consists of both types, r is: rg = [po + qo + pqo]0.5 - po 0.5 - qo 0.5 (see Additional file 1 for a detailed mathematical solution and Additional file 2 for allele distribution, χ2 test, and rl and rg values for all 54,001 SNPs of the Illumina BovineSNP50 BeadChip). Average values of rl and rg for autosomal markers were -0.3% ± 2.7 and 0.5% ± 2.1, respectively. For non-pseudoautosomal markers on the X chromosome (positions 28,044-86,115,497), where the model of inheritance does not fit the model under which the formulas were developed, average values of rl and rg were 361% ± 258 and -25% ± 16, respectively. Thus, encountering extreme values (above 100%) for rl, when analyzing autosomal markers, may indicate an error in the mapping of markers that are actually located on sex chromosomes. Although the HWE deviation is an important factor in CNV occurrence, other reasons than erroneous positioning of markers of sex chromosomes may also exist; for example, systematic problems in distinguishing the alleles, due to technical failures. However these are unlikely, because of the high quality of the Beadchip technology .
Using the PennCNV detection module, we analyzed the autosomes of each of the 912 bulls for CNVs. From the output of this analysis, which contained the chromosomal positions and copy numbers of the detected CNVs, the frequency of loss or gain for each SNP marker was calculated (Additional file 2). Average loss and gain values for autosomal markers were lsi = 0.02% ± 0.2 and gsi = 0.12% ± 0.15, respectively.
As the Holstein population has a very low effective population size, it was expected that some of the CNV alleles within our sample would be common. However, all of the CNVs detected by PennCNV were relatively rare (Figure 1A),) and we suspected that this may arise from setting up input parameters that are too stringent. Therefore, we examined several setups for running this program using different HMM input files and different length restrictions on the CNV chromosomal size (data not shown). We then calculated the correlations for each setup, with no limit on CNV size. We adopted the setup that gave the highest correlations between the SI-based and HWE-based methods with the expected direction signs (Figure 1B). Since "negative loss" is equal to gain, it was generally expected that negative correlations would be displayed when comparing the loss and gain methods. SI gain and loss involves a simple count of events, and therefore could only yield positive correlations when CNV exhibited both LOH and duplication alleles. A negative correlation (r = -0.87) was indeed observed between the HWE-based gain and loss methods, indicating that the equations presented for calculation of loss and gain frequencies also function well for loci that do not fit the model for which they were developed. For example, in a locus where LOH was frequent, the absolute value of rg would be similar to rl but with a negative sign. Correlations between the LOH/CNV frequencies predicted by the HWE-based and SI-based methods was low (up to r = 0.08). Nevertheless, a number of loci displayed above-average CNV frequencies by both HWE-based and SI-based methods simultaneously. The discrepancy between the methods may be explained by random deviations from HWE (minor deviation from HWE could have also arisen of selection); by de novo CNVs that do not affect HWE; and by the conservative thresholds on the detection of CNVs by PennCNV, which presents moderate power with a low false-positive rate .
A total of 221 markers displaying frequencies that were more than one standard deviation (SD) above average for LOH (2.33% and 0.22% for rl and lsi, respectively) were included in the data set for regions of LOH variation. A total of 515 markers displaying frequencies that were more than one SD above average for duplication (2.67% and 0.27% for rg and gsi, respectively) were included in the data set for copy gains. Since these markers tended to cluster together, and CNVs may affect expression of genes that are up to 0.5 kb away , such adjacent markers were assigned to the same CNVR (Additional file 3). These two data sets were combined and compared to the available CNV annotations in cattle (Additional file 3, LOH and CNV sets are labelled in red and blue, respectively; yellow and white labels indicate previously published CNVs that were confirmed to be within 0.5 kb, or not detected in this study, respectively). The actual length of the predicted CNVs cannot be accurately assigned using the BeadChip data, and CNV of a region that is evident from a single marker may belong to a region shorter than 1000 bp, which is usually referred to as an indel. In total, we detected 169 indels/CNVs/CNVRs of copy losses (LOH) and 246 of copy gains. These were compared to 86 documented cattle CNVs with recorded frequency above 2.5% [2, 3], and with 141 frequent CNVRs . The latter study analyzed only 20 individuals, with an average frequency of detection of 3 ± 2: we assumed that CNVRs detected in three or more individuals are likely to be frequent. Defining confirmation as co-occurrence of a documented CNV within 0.5 kb of a CNV detected here, 32 (37%) of the CNVs reported in  and , and 28 (20%) of those reported in  were confirmed by the present study. Another line of evidence supporting our list of LOH variations was that most (68%) of these markers had significantly high rates of missing calls. The average for "no calls" was 15 ± 39 out of 912 bulls genotyped for the autosomal markers, while for the 221 selected LOH markers, the average was 201. An increase in no-call rate is expected with an increase in the frequency of null alleles, as individuals that are homozygous for the null allele should fall within the no-call category and the expected number of individuals with no call should be ≥nrl 2 (see Additional file 1). The higher than expected frequency of SNPs with deviation from the expected HWE frequencies calculated using the χ2 test was also an indication of CNVs. There were 47,154 autosomal polymorphic SNPs, of which 4,486 (9.5%) had probabilities <0.05 for HWE. Despite selection against non-HWE SNPs during the BeadChip preparation , their fraction is nearly double that expected by chance. While overall, 9.5% of the autosomal polymorphic BeadChip markers had probabilities <0.05, frequencies for markers meeting this criterion in the lists of LOH (221) and CNV gain (515) were 83% and 51%, respectively.
In our study, some of the detected CNVs may have arisen from spontaneous CNV mutations. High rate of de novo mutations for several human diseases caused by CNV has been observed . De novo mutations may also explain the low rate of verification (<10%) of candidate CNVs detected by SNP arrays reported in human studies . To reliably detect CNVs that are expected to have a functional impact and are not de novo or sporadic, we targeted those that are frequently detected by both HWE-based and SI-based methods. Nevertheless, the sporadic occurrence of CNVs in the genomes of ancestral key bulls in regions that are neutral for selection are expected to result in some frequent CNVs that have no function. Indeed, we could not associate any functional gene with 61 (15%) of the common CNVs that we reported. These are indicated in the corresponding gene column as 'none' or pseudogene (Additional file 3).
Gene ontology (GO) categories significantly overrepresented in Holstein CNV.
Human ref. gene #
CNV gene #
Wnt signalling pathway1
Alzheimer disease-presenilin pathway1
Cadherin signalling pathway1
Cell surface receptor-mediated signal transduction2
The CNRQs of two sires (3981, CNRQ = 0.28 ± 0.06; 7133, CNRQ = 0.2 ± 0.08) suggested that they were homozygous for the deletion. In the sample of 132 sires an estimate of frequency of the deletion allele can be derived as 12%, based on the occurrence of two homozygotes and assuming a Hardy-Weinberg distribution of genotypes. Since the number of homozygotes will have a Poisson distribution, the 95% confidence interval will not be symmetric, and extends from 0.6 to 7.2 homozygotes, which is equivalent to a confidence interval of 6.7% to 23.4% for the deletion allele. Thus, for this CNVR, the average LOH frequencies of the HWE-based and SI-based methods (Figure 3), was within the confidence interval.
Significant positive correlation (r = 0.5) was observed between CNRQs of the two amplicons in 100 sires that passed quality thresholds for both amplicons. This indicates that a portion of the sires analysed displayed CNV that spanned both amplicons, yet some of them may had other CNV alleles with borders that excluded one of the amplicons. Hence, the qPCR analysis of both amplicon supported the prediction of CNVR #456 in this work.
Expansion of HWE and PennCNV analyses enabled an estimation of LOH/CNV frequencies, and combining these methods yielded better detection of inherited CNVs. Correlation between LOH/CNV frequencies predicted by the HWE-based and SI-based methods was low (up to r = 0.08). The highest correlation was observed for the minimal CNV length of 1 SNP for the PennCNV analysis. Under these conditions, 418 locations displayed significantly high frequency by both methods. Efficiency of designating large genomic clusters of ORs as CNVs was 29%. Frequency values for copy loss were distinguishable in non-autosomal regions and for the values obtained for BTA7 positions 76,944,037-77,340,598, suggesting misplacement of the X chromosomal region containing CNV of the melanoma antigen gene onto BTA7 in the current bovine genome assemblies. Analysis of BTA18 placed important net merit QTLs in regions rich in segmental duplications and CNVs. Enrichment of transporters in CNV loci suggested their potential effect on milk-production traits. Although our approach for identifying common CNVs was more effective than previous methodologies applied in cattle, it has severe limitations. Thus the number of CNVs reported here for the Holstein breed may be part of a 10-fold larger repertoire of inherited structural variation that has yet to be described.
DNA was extracted from the semen of 912 Holstein bulls used for AI in Israel http://www.icba-israel.com/cgi-bin/bulls/en/bl_main.htm. These included sires born in Israel, as well as international sires originating from France (4), Germany (2), the Netherlands (26) and the USA (27). The sires' DNA was genotyped using BovineSNP50 BeadChip (Illumina, Inc.), which included 54,001 SNPs as described previously .
Frequencies for copy loss (rl) and gain (rg) were calculated using the formulas rl = [0.25-0.25pqo+poqo/pqo]0.5 - 0.5 and rg = [po+qo+pqo]0.5 - po 0.5 - qo 0.5 (see Additional file 1 for a detailed explanation). The rl value could also be calculated from the number of missing calls in cases in which no calls were observed as a result of a homozygous null allele, and not as a result of technical problems. However, the low correlation (r < 0.2) between rl values calculated by these two methods indicated that technical problems did in fact play a role in most of the observed missing calls. This prompted us to routinely use a sample size (n) computed as the number of sires that were successfully called according to the default settings of GenomeStudio. Using this n value in cases of frequent copy loss led to a slight increase in rl over the value obtained when using the total sample size (n = 912).
PennCNV input SI files for each bull were prepared from an Illumina report containing the SNP name, sample ID, B-allele frequency and log R ratio, using the split option of the kcolumn.pl program in the PennCNV package. For a list of the names of these 912 intensity files see list.txt; CNVs were detected using the B-allele frequency file (BovineSNP50K.pfb), the HMM parameter file distributed with the PennCNVhttp://www.openbioinformatics.org/penncnv/penncnv_download.html, and the following command line: perl detect_cnv.pl -test -hmm example.hmm -pfb BovineSNP50K.pfb -conf -log 1.log -out 1.rawcnv -minsnp 1 -lastchr 29 -listfile list.txt. Frequencies for loss and gain of each genetic marker were calculated using a Perl script that counted the total number (Nt) of copies detected for each marker. Two copies were assumed for each marker that was not included in the PennCNV report (1.rawcnv), which contained copy numbers (1, 3, 4, 5) for the CNVs of each bull. Percent frequencies (100|1-Nt/1824|) are reported based on dividing this count by the number of expected chromosomes (1824).
Determination of the relative copy number of two amplicons within CNVR #456 (additional file 3) on BTA18 was conducted using a qPCR analysis. The genomic sequence near the extreme SNP positions of this CNVR was analysed for repetitive elements and presence of SNPs http://www.ensembl.org. The primer pairs were designed in repeat and SNP free sequences and were as follows: amplicon I (100 bp) near SNP position 41,760,794 (5'-CTGTTCCTCCAGCATTTCGT-3'; 5'-TTCCTTTTCCCCAGGACTTT-3') and amplicon II (151 bp) near SNP position 41,907,693 (5'-CCATCAGGTTTAAGGGACACA-3'; 5'-CCCCGAAGGTAGAAGTGACA-3'). Gene copy number was normalized to an amplicon of 96 bp of an autosomal reference gene bovine RPP30 (GeneID:615098, BTA26, positions 12,893,277-12,893,372) using PCR primers (5'- TGCTTCCATTGTTTCCTGATGA-3'; 5'-TGGGACCAGGTTCCATGATC-3'). RPP30 is used as a reference gene in human CNV studies . No CNV was reported for this gene region in previous studies of CNV in cattle [2, 3, 13] including this study. Copy number was determined as previously described . Briefly, 5-point standard curve (0.1-62.5 ng of DNA) was generated in duplicate for a mixture of ten reference individuals. Test individuals were assayed in duplicates using 30 ng of DNA per reaction. Absolute Blue SYBER Green ROX Mix (Thermo Fisher scientific, UK) Kit was used for nucleic acid detection. Reactions were performed at 95°C for 15 min followed by 40 cycles of 95°C for 15 s and 60°C for 1 min using an ABI Prism® 7000 sequence detection system. Amplification was followed by a dissociation curve analysis to confirm the presence of a single product and the absence of primer dimers. The qbasePLUS software (Biogazelle, Ghent, Belgium) was used for calculation and quality control of relative quantities using RPP30 for normalization. Samples that did not pass quality control because of excessive variance between replicates (>0.5 standard error in number of copies per haploid genome) were excluded from further analysis.
Gene content of CNVRs was determined using Ensembl http://www.ensembl.org/Bos_taurus/Location/ and Gene Entrez http://www.ncbi.nlm.nih.gov/gene. Genes located up to 250 kb from the CNV borders were regarded as part of the CNVR in cases for which no genes were identified within that region. Since the bovine genome is not well annotated compared to the human genome, we used the human orthologs for gene ontology analyses. The corresponding human GeneIDs were identified using NCBI HomoloGene. When no orthologs were identified using HomoloGene, selection of alternate orthologs was based on BLAST similarity. The PANTHER classification system http://panther6.ai.sri.com/tools/compareToRefListForm.jsp was used to assess the probability of overrepresentation within the list of human orthologs of certain pathways, biological processes and molecular functions using the default Bonferroni correction for multiple testing.
Contribution from the Agricultural Research Organization, Institute of Animal Science, Bet Dagan, Israel, No. 565/10. This research was supported by grants from the Israel Milk Marketing Board; the European Sixth Research and Technological Development Framework Programme, Proposal No. 016250-2 SABRE, and Research Grant Award No. IS-4201-09 from BARD, The United States - Israel Binational Agricultural Research and Development Fund. We acknowledge the contribution of M. Ron in providing the BeadChip experiment data. Genotyping was performed by A. Schein and N. Avidan, Pharmacogenetics and Translation Medicine Center, The Rappaport Institute for Research in the Medical Sciences, Technion, Haifa, Israel.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.