Estimation of linkage disequilibrium in four US pig breeds
© Badke et al; licensee BioMed Central Ltd. 2012
Received: 8 September 2011
Accepted: 17 January 2012
Published: 17 January 2012
Skip to main content
© Badke et al; licensee BioMed Central Ltd. 2012
Received: 8 September 2011
Accepted: 17 January 2012
Published: 17 January 2012
The success of marker assisted selection depends on the amount of linkage disequilibrium (LD) across the genome. To implement marker assisted selection in the swine breeding industry, information about extent and degree of LD is essential. The objective of this study is to estimate LD in four US breeds of pigs (Duroc, Hampshire, Landrace, and Yorkshire) and subsequently calculate persistence of phase among them using a 60 k SNP panel. In addition, we report LD when using only a fraction of the available markers, to estimate persistence of LD over distance.
Average r 2 between adjacent SNP across all chromosomes was 0.36 for Landrace, 0.39 for Yorkshire, 0.44 for Hampshire and 0.46 for Duroc. For markers 1 Mb apart, r 2 ranged from 0.15 for Landrace to 0.20 for Hampshire. Reducing the marker panel to 10% of its original density, average r 2 ranged between 0.20 for Landrace to 0.25 for Duroc. We also estimated persistence of phase as a measure of prediction reliability of markers in one breed by those in another and found that markers less than 10 kb apart could be predicted with a maximal accuracy of 0.92 for Landrace with Yorkshire.
Our estimates of LD, although in good agreement with previous reports, are more comprehensive and based on a larger panel of markers. Our estimates also confirmed earlier findings reporting higher LD in pigs than in American Holstein cattle, especially at increasing marker distances (> 1 Mb). High average LD (r 2 > 0.4) between adjacent SNP found in this study is an important precursor for the implementation of marker assisted selection within a livestock species.
Results of this study are relevant to the US purebred pig industry and critical for the design of programs of whole genome marker assisted evaluation and selection. In addition, results indicate that a more cost efficient implementation of marker assisted selection using low density panels with genotype imputation, would be feasible for these breeds.
The extent of non-random association of gametes at different loci, or linkage disequilibrium (LD), has become the focus of many recent studies in both humans and animals [1–4]. Gaining knowledge of the distribution of LD in livestock populations is important for genetic mapping of economically important traits such as disease resistance , and it can reveal population history and breed development [6, 7]. Moreover, genome wide association (GWAs) studies as well as genomic selection in livestock rely on the existence of LD between causative variants and genetic markers [8, 9]. Recent advances in genotyping technology allow high density genotyping of single nucleotide polymorphisms (SNP) for several livestock species such as cattle , chickens  and pigs . Obtaining high density genotypes from a sample of individuals allows for the estimation of genome-wide LD and persistence of phase among breeds .
Previous studies have shown that the extent and persistence of LD in livestock [14–16] is much larger than that found in human populations , due to selection and smaller effective population size in livestock species [1, 17]. Using dense markers to cover the genome increases the likelihood of SNP markers to be in high LD with causative genes altering important production phenotypes . Meuwissen et al.  proposed that the merit of these markers in livestock would be in the parallel use of all markers to derive genomic breeding values (GEBV) as a composite score of all individual SNP effects rather than improving mapping of quantitative trait loci (QTL).
The implementation of genomic selection using GEBV has been successful in dairy cattle [8, 20, 21], and is currently being tested in laying chickens , and pigs . The reliability of GEBV prediction relies on the level of LD between markers and QTL, the origin of such LD (either within family or population-wise), the number of animals used in the training population as well as heritability of the trait . In this study it is our objective to estimate and describe genome wide levels of LD in four pig breeds using high density genotypes. We also estimate population-wise LD for a variety of panels with lower marker density in order to estimate the number of markers needed to reach a given level of LD. We estimate persistence of phase between the four breeds in this study as a measure of relationship between these populations.
Average r 2 at various distances in four breeds
Average r 2 between adjacent SNP for sparse marker panels
% of SNP kept1
average r 2 2
average distance in kb 2
average r 2 2
average distance in kb 3
average r 2 2
average distance in kb 3
average r 2 2
average distance in kb 3
Pairwise breed comparison of correlation of phase and proportion of phase agreement at various distances
Proportion of r with opposite sign2
Correlation of r ij(k) and r ij(k') 3
Duroc - Hampshire
Duroc - Landrace
Duroc - Yorkshire
Hampshire - Landrace
Hampshire - Yorkshire
Landrace - Yorkshire
Current effective population size of the breeds used in this study was previously estimated, using pedigree information, to be between 74 (Landrace) and 113 (Duroc, Yorkshire) . Consistent with having the largest current effective population size, we find that long range r 2 (10 Mb, Figure 1) estimated from our data was smallest for Duroc and Yorkshire (0.035, 0.03). In Hampshire, a smaller effective population of 109 corresponded to higher r 2 at 10 Mb (0.046). Due to the similar long range r 2 (0.035) at 10 Mb we would have expected the Landrace population to have an effective population size comparable to that of Duroc and Yorkshire. However, using pedigree data Welsh et al.  estimated the current effective population size of Landrace to be 74. The reason for this discrepancy remains unknown. Several previous studies investigated LD in pigs using reduced numbers of microsatellite markers and fewer animals from commercial populations [17, 27]. Nsengimana et al.  found relatively large estimates of LD (D') from 0.29 to 0.41 using 15 microsatellite markers. In contrast, using r 2 instead of D' and thereby correcting for minor allele frequency, Harmegnies et al.  found r 2 ranging from 0.15 to 0.19 for marker distance < 1 cM and 0.10-0.12 for markers spaced between 1 cM and 5 cM, using 29 microsatellite markers on SSC15, comparable to our results of r 2 between 0.16-0.22 for markers spaced between 1 and 5 Mb. Du et al.  estimated r 2 from 4,500 SNP markers in six commercial lines of pigs and found estimates of average r 2 = 0.51 for markers less than 0.1 cM apart, and estimates of 0.21 and 0.07 at marker distances of 1 cM and 5 cM respectively. Similarly, our populations had average r 2 of 0.15 to 0.20 and 0.05 to 0.08 at marker distances of 1 Mb and 5 Mb, respectively. A recent study conducted by Uimari & Tapio  used the same genotyping platform as our study to estimate r 2 and effective population size in Finnish Landrace and Yorkshire populations. Uimari & Tapio  found average r 2 of 0.43 and 0.46 for adjacent markers in the Finnish Landrace and Yorkshire populations, respectively, which was higher than our results of 0.36 for Landrace and 0.39 for Yorkshire. In addition, Uimari & Tapio reported that the r 2 for markers spaced at 5 Mb decreased to 0.09 and 0.12 in the Finnish Landrace and Yorkshire breeds, respectively. In the present study, r 2 declined further to 0.05-0.06 at 5 Mb marker spacing for Landrace and Yorkshire (Table 1). The higher average r 2 for distant (r 2 > 0.2 for 1 Mb) markers in the Finnish populations could be explained by smaller effective population size of the Finnish populations, causing higher r 2 on average. This is partially confirmed by comparing the estimated effective size of the Finnish populations (N e = 91, 61 for Landrace and Yorkshire, respectively) , to estimated effective population sizes of the populations used in the current study reported by Welsh et al.  (N e = 74, 113 for Landrace and Yorkshire, respectively), where the current effective population size for Finnish Yorkshire is approximately half that of our Yorkshire population. Compared to recent estimates from Canadian populations we found estimates of average r 2 for markers with pairwise distance below 100 kb to be consistent in Landrace (US: 0.34, Canadian: 0.31) and Yorkshire (US: 0.37, Canadian: 0.32) . However, in Duroc estimates of average r 2 for markers with pairwise distance below 100 kb were considerably higher in the US population (0.42) compared to the Canadian population (0.31) .
Persistence of phase can be used to infer upon the history of a species and relatedness of breeds within that species as well as on reliability of across population GWA and GEVB prediction . Persistence of phase was previously reported for three Canadian swine breeds (Duroc, Landrace, Yorkshire) . For SNP with pairwise distance below 50 kb we estimated persistence of phase to be 0.88 between Landrace and Yorkshire and 0.82 for both Landrace and Yorkshire with Duroc. In the Canadian breeds persistence of phase also indicates a closer relationship between Landrace and Yorkshire (0.82) and a more distant relationship between Landrace/Yorkshire and Duroc . We found correlation of phase of 0.82 for Landrace/Yorkshire with Duroc, while the Canadian breeds had 0.66/0.65, indicating less agreement of phase even at short pairwise distance . Our results showed that correlation of phase for the pig breeds in this study ranged between 0.87 for Duroc-Yorkshire and 0.92 for Landrace-Yorkshire for markers with pairwise distance < 10 kb. Previous research in Australian cattle breeds  showed correlation of phase between 0.68 for Australian Angus-New Zealand Jersey to 0.97 for Dutch Holstein-Black and White. At increasing marker distance, correlation of phase for the pig breeds in this study decreased (range in r: 0.41 to 0.57 at an average pairwise marker distance of 1 Mb). This decrease however was less than the decrease de Roos et al.  observed in all but two of the cattle breeds they considered (< 0.4 for markers spaced 1 Mb). While correlation of phase was similar between these pig breeds and dairy cattle at short range (< 10 kb), the pig breeds showed generally larger correlation of phase than the dairy cattle  at increasing marker distances.
If two populations diverged from a common ancestral population, their correlation of phase can be expressed as r 0 2 (1-c) 2T , where r 0 2 is a measure of LD in the common ancestral population, c is the recombination distance between markers and T is time since breed divergence in generations . For markers as close as 10 kb the recombination distance c will be almost 0, so that correlation of phase at those short distances can serve as an estimation of r 0 2 in the common ancestral population. Since correlation of phase was comparable in the pig populations (0.87-0.92) for markers with pairwise distance below 10 kb to that reported in Australian cattle (0.80-0.97) , LD in the common ancestral pig population is likely to be similar to that in the common ancestral population of Australian cattle breeds. Larger correlation of phase at increasing marker distance (1 Mb) in the pig populations used in this study (0.41-0.57) compared to Australian cattle breeds (< 0.40) suggests that T is smaller in our pig breeds than it is in the cattle breeds. The expected correlation of r between two breeds can be expressed as e -2cT . To estimate the time since breed divergence for the pig breeds in this study we used SNP with pairwise distance between 10 kb and 300 kb, and estimated correlation of phase for each 2.5 kb interval. We calculated the linear regression of the natural logarithm of the estimated correlation of phase onto the average pairwise distance c. The slope of this regression is an estimate of -2T. Consequently, the slope divided by -2 is the number of generations (T) since these two breeds have diverged . Results suggest that the pig breeds in this study diverged approximately 40-66 generations ago. The expected correlation of phase would decrease to 0.41 and 0.02 at 1 cM and 5 cM distance respectively in the Yorkshire-Landrace comparison, assuming T of 40 and r 0 2 of 0.92. We observed a correlation of phase of 0.57 and 0.30 at 1 Mb and 5 Mb, respectively, between these two breeds, indicating that a T of 40 may overestimate the actual time since breed divergence. One possible cause of this observation is admixture between these two breeds, causing more common LD between them than what would be expected from fully diverged breeds . We obtained the date of herd book closure for each of the breeds in this study, and assuming a generation interval of approximately 2 years , Duroc, Hampshire, Landrace, and Yorkshire have existed as distinct breeds for at least 38.5, 44.5, 31.5, and 30.5 generations, respectively. The time of herd book closure does not directly indicate the time since breed divergence, since distinguishable breeds must have existed before herd book closure. Nevertheless, the time of herd book closure further supports our observation that Landrace and Yorkshire have developed as separate breeds later than Duroc and Hampshire.
Our results have several important implications for future implementation of genomic selection in swine. Accuracy of prediction of genome wide marker assisted selection can be directly affected by the chosen marker density (resulting in average r 2 between markers and QTL), and the size of the training population .
The currently used marker panel, containing approximately 40,000 usable markers, had average r 2 of approximately 0.4 between adjacent markers for all four breeds. That exceeds the level of r 2 = 0.2 simulated by Meuwissen et al.  to reach prediction GEVB accuracy around 0.85. Furthermore, our results indicated that reducing the original marker panel to 10% of the markers (3,000-4,000 SNP) still resulted in average r 2 for adjacent markers exceeding 0.2 in all four breeds. On the other hand, recent research in Australian Holstein Friesian cattle has shown  that using subsets of 3000-5000 SNP to estimate direct genomic breeding values (DGV) could only reach 80% of the prediction accuracy previously estimated using approximately 42,000 SNP. Such a reduction in prediction accuracy will be unacceptable for most practical implementations. However, the accuracy of GEBV predicted by low density panels can be increased through the use of genotype imputation , where high density genotypes are imputed using low density SNP genotypes and a high density reference panel of haplotypes . Weigel et al.  used approximately 10% of 2,693 SNP from Bos Taurus chromosome 1 to impute the full SNP set in a Jersey population. They found that using a high density reference genotype panel (n = 2,542 animals), the imputation accuracy of the non-typed markers was between 0.86 and 0.94. Average r 2 in our populations ranged from 0.36 to 0.48 for markers less than 100 kb apart, comparable to average r 2 = 0.38 for markers spaced at < 100 kb in the Jersey population . Assuming a comparable decline of LD for increasing marker distance between the Jersey population and our pig populations, we would expect to accurately impute approximately 90% of the high density genotypes, using a low density panel containing 10% of the markers. More recent results reported even higher average accuracy of imputation (approximately 95%) when imputing 42,000 SNP in the Bovine 50 K using the 3 K subset in Holstein cattle . To assess the accuracy of GEBV estimated from imputed genotypes Weigel et al. , used the same Jersey population from their previous study , and they found that the accuracy of GEBV based on imputed markers was 95% of the accuracy of the GEBV estimated using the observed genotypes . As noted above average r 2 is similar between the American Jersey population and our pig populations, suggesting that future research in genomic selection in swine should explore the use of imputed low density genotypes to increase cost efficiency. Previous research in humans , and European Holstein cattle  indicated that combining haplotypes from closely related populations can increase the accuracy of genotype imputation, while research in sheep suggests that breed specific reference haplotypes would yield better accuracy . The success of combined haplotypes for genotype imputation depends on the relatedness between the populations. Further research is necessary to determine if persistence of phase is large enough in our pig populations to increase imputation accuracy when combining reference haplotypes across breeds. As noted by Goddard , the accuracy of GEBV prediction can be expressed as a function of the LD between marker and QTL and the accuracy of estimated SNP effects. The loss in accuracy of GEVB prediction caused by imputing instead of observing genotypes could be compensated by increasing the number of animals used to estimate SNP effects. If not enough animals are available for the estimation of SNP effects, animals from different, but closely related, populations could be combined to estimate SNP effects for GEBV prediction in both populations [13, 39]. The squared short-range (< 10 kb) correlation of phase can also serve as the accuracy with which we can predict a marker-QTL association in one population using known marker-QTL associations from another population. For the pig breeds reported in this study the squared correlation of phase for close markers (0-100 kb) ranged from 0.53 to 0.67. To evaluate whether these accuracies would warrant the use of a combined training population to estimate SNP effects accurately for both populations we refer to a simulation study conducted by de Roos et al.  estimating the accuracy of GEBV prediction for combined training populations of highly, moderately and lowly related populations. Correlation of phase for populations diverged approximately T=30 generations ago was reported to be below 0.80 for markers with pairwise distance below 0.055 cM . We found correlation of phase between Landrace-Yorkshire of around 0.80 at a corresponding marker distance. De Roos et al.  concluded that reliability of GEBV prediction could be increased between 0.05-0.10 points in two populations, when approximately 40,000 marker genotypes are available, heritability is h 2 =0.3 or higher, 1000 animals from each population were used to estimate SNP effects, and under the assumption that QTL effects are the same for both populations . In addition, they found that for genetically distant populations, at least 1,000 animals with genotypes and phenotypes available in each population were needed to avoid a decrease in the reliability of prediction . When SNP effects estimated in one population are used to calculate GEBV for another population which diverged approximately T=30 generations ago, the reliability of the predicted GEBV was 0.65 assuming both high marker density (M = 40,000) and heritability h 2 = 1 . Consequently, combining animals into a multi-breed panel to estimate SNP effects is likely to be only marginally beneficial for the pig breeds in this study, given the estimated correlation of phase and the large number of animals and markers required .
We used the PorcineSNP60 chip  to obtain high density genotypes (34,000-40,000 SNP) from pig trios in four breeds. From this data we estimated r 2 as a measure of LD across the genome as well as correlation of r, which measures phase agreement between breeds. We found r 2 of approximately 0.4 for markers less than 100 kb apart, which is higher than comparable estimates reported for North American Holstein cattle  as well as various Australian cattle breeds . The same was true for average r 2 between markers with pairwise distance larger than 1 Mb, indicating a smaller past effective population size of these pig breeds. We also report a relatively slow rate of decay of LD over distance, observing r 2 around 0.2 at 1 Mb. The comparably high long range LD is an indicator that good accuracy can be expected for future implementations of GEBV in pigs using 10% (3,000-4,000) of the SNP used in the current assay or less, along with genotype imputation. We would encourage future research in genomic selection in swine to especially focus on the possible benefits of the combined use of reduced marker panels and genotype imputation. To successfully promote the use of genomic selection in swine it will be necessary to increase cost efficiency while maintaining high accuracy of prediction. Currently no low density panels for SNP genotyping are publicly available for swine, but the presented results will be available to aid in the development of efficient SNP platforms. Relatively low persistence of phase reported here implies that the use of multi-breed panels estimating SNP effects for genomic selection will likely be limited, especially when using low density genotypes, but the merit of combining reference haplotypes for genotype imputation should be further investigated.
For this study sire/dam/offspring trios of the Duroc, Hampshire, Landrace and Yorkshire breeds were selected from the National Swine Registry (NSR) pedigree. Selected parents were unrelated for at least two generations. All animals were genotyped using the Illumina PorcineSNP60 (Number of markers M = 62,163) Genotyping BeadChip (Illumina Inc.)  at a commercial laboratory (GeneSeek, a Neogen Company, Lincoln, NE). All SNP showing Mendelian inconsistencies for a trio were set missing in that particular trio. If one or more animals within a trio had missing genotypes in more than 10% of the SNP that trio was eliminated from further analysis. Similarly, SNP were removed if they did not have genotypes available for at least 90% of the samples across all breeds (M CallRate < 0.9 = 5080). Only autosomal SNP were considered in this study, leading to the exclusion of all SNP with an uncertain map position on build 10 of the pig genome sequence, as well as SNP on the sex chromosomes (M non-autosomal = 9308). To exclude non-segregating SNP from the analysis we removed markers with minor allele frequency (MAF) below 5% within each breed separately. The number of fixed SNP varied substantially between breeds: we excluded M MAF < 5% = 13,646 SNP in Duroc, M MAF < 5% = 15,405 SNP in Hampshire, M MAF < 5% = 7,631 SNP in Landrace and, M MAF < 5% = 8,665 SNP in Yorkshire. Additionally, SNP were excluded for failure to meet Hardy Weinberg Equilibrium (p < 0.001) within breeds causing M HWE < 0.001 = 117, 85, 146, and 176 SNP to be discarded in Duroc, Hampshire, Landrace, and Yorkshire respectively. After applying the described filtering criteria, a total of 30, 26, 29, and 32 trios were included for the Duroc, Hampshire, Landrace and Yorkshire breeds, respectively. And a total of 34,129, 32,370, 40,144 and 39,110 SNP spaced at an average distance of 70, 74, 60 and 61 kb satisfied the SNP selection criteria for Duroc, Hampshire, Landrace and Yorkshire, respectively.
Haplotypes were obtained for the founder animals using the trio option of BEAGLE , phasing the genotypes by chromosome. Sampling animals in trios was shown to yield improved accuracy of estimated haplotypes . To further increase haplotype accuracy, BEAGLE was set to run 100 iterations of the phasing algorithm and sample 100 haplotype pairs for each individual per iteration. Additionally, a short simulation experiment was conducted showing that for MAF above 5% average r 2 can be reliably estimated from the current sample size (results not shown). Alleles for each SNP were re-coded as 0/1, keeping the reference allele constant across all four populations, allowing for later determination of phase agreement. Haplotypes and code needed to reproduce these results are publicly available at https://www.msu.edu/~steibelj/JP_files/LD_estimate.html.
where is the average of all adjacent SNP across 18 autosomes (l), with M l SNP per chromosome. To estimate average r 2 between adjacent markers for different marker densities a certain percentage of markers (50%, 75%, 90%, 95%, 99%, and 99.5%) were removed before average r 2 was estimated using equation (2). To select markers, an increasing proportion of SNP were sequentially removed solely considering their map position, so that for instance: to reduce a panel to 50%, every second marker was kept for analysis, for 25% every fourth was kept and so on.
where R k, k' is the correlation of phase between r ij(k) in population k and r ij(k') in population k', s (k) and s (k') are the standard deviation of r ij(k) and r ij(k') respectively, and are the average r ij across all SNP i and j within interval p for population k and k' respectively.
This project was supported by Agriculture and Food Research Initiative Competitive Grant no. 2010-65205-20342 from the USDA National Institute of Food and Agriculture, and by funding from the National Swine Registry. Technical assistance of Whitney Hosier and Nancy Raney is appreciated. Computer resources and programming advice were provided by the Michigan State University High Performance Computing Center (HPCC).
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.