Genome position specific priors for genomic prediction
© Brøndum et al.; licensee BioMed Central Ltd. 2012
Received: 27 April 2012
Accepted: 5 October 2012
Published: 10 October 2012
The accuracy of genomic prediction is highly dependent on the size of the reference population. For small populations, including information from other populations could improve this accuracy. The usual strategy is to pool data from different populations; however, this has not proven as successful as hoped for with distantly related breeds. BayesRS is a novel approach to share information across populations for genomic predictions. The approach allows information to be captured even where the phase of SNP alleles and casuative mutation alleles are reversed across populations, or the actual casuative mutation is different between the populations but affects the same gene. Proportions of a four-distribution mixture for SNP effects in segments of fixed size along the genome are derived from one population and set as location specific prior proportions of distributions of SNP effects for the target population. The model was tested using dairy cattle populations of different breeds: 540 Australian Jersey bulls, 2297 Australian Holstein bulls and 5214 Nordic Holstein bulls. The traits studied were protein-, fat- and milk yield. Genotypic data was Illumina 777K SNPs, real or imputed.
Results showed an increase in accuracy of up to 3.5% for the Jersey population when using BayesRS with a prior derived from Australian Holstein compared to a model without location specific priors. The increase in accuracy was however lower than was achieved when reference populations were combined to estimate SNP effects, except in the case of fat yield. The small size of the Jersey validation set meant that these improvements in accuracy were not significant using a Hotelling-Williams t-test at the 5% level. An increase in accuracy of 1-2% for all traits was observed in the Australian Holstein population when using a prior derived from the Nordic Holstein population compared to using no prior information. These improvements were significant (P<0.05) using the Hotelling Williams t-test for protein- and fat yield.
For some traits the method might be advantageous compared to pooling of reference data for distantly related populations, but further investigation is needed to confirm the results. For closely related populations the method does not perform better than pooling reference data. However, it does give an increased accuracy compared to analysis based on only one reference population, without an increased computational burden. The approach described here provides a general setup for inclusion of location specific priors: the approach could be used to include biological information in genomic predictions.
KeywordsGenomic prediction combined populations Genomic location Bayesian prediction
Genomic predictions are now widely used in dairy cattle breeding, and have been proposed for breeding of crops and prediction of disease risk in humans [1, 2]. The accuracy of genomic estimated breeding values, depends on a number of factors, of which the size of the reference population used to estimate the marker effects is critical . In dairy cattle, the reference population usually consists of progeny-tested bulls. In small populations, such as Australian Jersey, the number of progeny-tested bulls available for the reference is limited. For genetically related populations such as the European Holstein populations and Nordic Red Cattle populations previous studies show large benefits from pooling reference populations [4, 5], but for more distantly related populations (e.g., Holstein and Jersey) this approach does not increase the accuracy to the same extent [6, 7]. Previous studies based on Single Nucleotide Polymorphism (SNP) markers from the Illumina 50K SNP chip  have reported that distances between markers would be too large for high persistence of linkage disequilibrium (LD) phase across breeds, and accuracies of across breed prediction were zero [9, 10]. With the new Illumina 777K chip, it is expected that distances between markers are small enough for successful genomic prediction using combined reference data from different dairy cattle populations, as the Quantitative Trait Loci (QTL)-SNP phase in such high density markers would be well preserved across breeds . However, a recent study demonstrated only limited support for this hypothesis, with relatively small gains resulting from pooling Australian Holstein and Jersey data to improve the accuracy of Jersey genomic predictions . This suggests that there are still differences in the patterns of LD between single markers and actual QTL across breeds, and thus pooling the data might dilute associations of markers with phenotypic traits.
In this study, we explore an alternative approach to pooling data across breeds. Previous studies have shown that some parts of the genome explain more variation than others. Assuming that the same causative mutations, or even the same gene regions but different causative mutations, act on traits of interest in different populations, it is expected that effects of chromosome regions on a trait could be consistent among populations, though the LD patterns between individual SNPs and QTLs could differ from one population to the other. At the extreme it was demonstrated that there was considerable overlap in gene regions affecting stature in humans and cattle . The aim of this study was to first map the variation explained by small segments of the bovine genome for production traits in three dairy cattle populations, and compare this variation across the populations. In particular, we explored the effect of segment size on the correlation of the variances across populations. Subsequently this information was used as genomic location specific priors in a new method for predicting genomic estimated breeding values. Developing a model with location specific prior information will also allow for differentiation between e.g. coding and non-coding regions of the genome, or other kinds of biological information.
Genotypic data was a mixture of true and imputed SNP markers from the Illumina 777K SNP chip. For HOL-AUS there were 843 Holstein heifers genotyped on the 777K SNP chip as well as 93 key ancestor bulls. For JER-AUS 93 key ancestor bulls were genotyped for the 777K SNP chip. Quality control steps included removing SNPs with very low minor allele frequencies, ambiguous or undefined map positions, and no heterozygote genotypes. For full details see . These animals were used as reference to impute the high density genotypes for the remaining 2204 Holstein and 447 Jersey bulls which were genotyped with the 50k chip.
For HOL-NOR 557 bulls from the EuroGenomics project  were genotyped using the 777K chip and these bulls were used as reference to impute the 777K markers for the bulls genotyped with the Illumina 50K SNP chip. After imputation LD of each marker with the previous one in the assembly was inspected. If two adjacent markers were in complete LD one of the markers was deleted, so that r2 of any pair of adjacent markers was less than one. The marker data was further edited by deleting markers with a minor allele frequency less than 0.01.
Imputation was done using Beagle  in all three populations. Since the purpose was to compare segments across populations and use this information for genomic prediction, the SNP datasets were further edited to only keep the markers that were in common across the populations. After data editing 465,542 markers remained for analysis.
Each of the datasets was split into a reference and validation set (Table 1) to allow for cross validation of the accuracy of DGV. In HOL-NOR the bulls were separated by birth date before or after 2001-10-01, and in JER-AUS and HOL-AUS the bulls were split by onset of progeny test before or after 2007. In both cases the younger animals were assigned to the validation set. This cross validation strategy was chosen as the resulting accuracy is the most meaningful in the context in which the genomic predictions will be used: in the dairy industry. Here reference sets of older bulls are used to predict the DGV for young bulls which are then selected for use based on these DGV. In the Australian data all of the available data was used to estimate segment variances to maximize the data. In the Nordic dataset only the reference set was used to estimate segment variances.
All genotypic and phenotypic data was obtained from pre-existing routine genetic evaluation data for the dairy cattle populations and required no ethical approval.
Estimation of genetic variances explained by different segments
Where A is the additive relationship matrix, σa2 is the variance of residual polygenic effects, and r y 2 is the reliability of DRP/DTD. The four-distribution mixture chosen for the SNP effects, does not reflect any biological hypothesis, but was chosen to allow for easier mixing between SNPs with no effect and SNPs with effects of different sizes. The Dirichlet prior on the proportions of different SNP variances with all parameters set to one, is actually a uniform prior, but specifying it in this manner reflects the fact that the posterior distribution on the proportions follows a Dirichlet distribution with a pseudo count of 1 from each of the four distributions. The prior is not uninformative in any statistical sense since it states that all distributions have equal probabilities, but it adds very little information compared to the posterior, as the data gives information on almost half a million counts, and the prior only adds 4, see  for more detail.
Where W s is the sub-matrix of W corresponding to the SNPs in segment s, and g s is the vector of estimated SNP marker effects for the same segment, i.e. the segment variance is the variance across individuals of the partial direct genetic values (DGVs, marker only estimated breeding values) belonging to segment s. Variances of the partial DGVs for all segments were calculated at each iteration in the Gibbs sampler, and the estimated segment variances were obtained as the posterior means. Segment variances were estimated for segment sizes of 10, 25, 50, 100, 250, 500, 1000, 2000 or 3000 SNPs and for entire chromosomes. The approach is similar to  where sliding windows of five consecutive SNPs are used to estimate the genetic variance of chromosomal regions. In our approach the windows are however not overlapping.
Posterior means of the parameters were obtained using a Gibbs sampler run for 20,000 iterations with a burn-in of 10,000 in the Holstein populations. For the Jersey population results were not consistent with only 20,000 iterations, so a chain length of 100,000 with a burn-in of 50,000 was used instead. The relatively poor mixing properties of the Gibbs sampler for the Jersey data could be due to the small size of the reference population. Lengths of the chains were based on preliminary runs and comparisons of the obtained segment variances. With 20.000 iterations the Holsteins showed a mean pairwise correlation between segment variances from 10 consecutive runs of 0.95, whereas the Jerseys showed a mean correlation between segment variances from 10 consecutive runs of 0.80. Increasing the number of iterations for the Jerseys to 100.0000 increased the mean correlation of segment variances between consecutive runs to .96.
Prediction using location specific prior information
Here π s is the probability vector for the four SNP effect distributions in segment s, and αs is the vector of prior parameters for the Dirichlet distribution in segment s. The model is similar to the original BayesR model, with the modification that the probability to sample SNPs from the four different distributions now depends on the segment. By setting the location specific information via the Dirichlet prior, instead of using constant proportions, the model estimates the proportions using both the data and the prior information. As this is a BayesR by segment approach, the model will be referred to as BayesRS.
JER-AUS with prior information from HOL-AUS.
HOL-AUS with prior information from HOL-NOR.
HOL-AUS (random) with prior information from HOL-NOR.
HOL-AUS (random) is a random subset of 500 animals from the HOL-AUS reference population, which was generated to test the hypothesis that the advantage of the BayesRS model would be greater in smaller populations. The second and third setups were tested using the same validation animals.
Validation of DGV accuracy
Where w ' k is the row of W belonging to animal k. Accuracies of the DGV were calculated as r(DGV, DTD) and validated in HOL-AUS and JER-AUS. Differences in accuracy between BayesR and BayesRS were tested for significance using a Hotelling-Williams t-test, which takes account of the number of individuals in the validation set .
Results and discussion
Distribution of SNP effects and proportion of expected total marker variance for each class
Top 10 segments ranked by proportion of variance
For JER-AUS no gain in accuracy was observed for milk yield when using prior information from HOL-AUS, for protein yield a small gain of around 1% is seen for the smallest segment size, and for fat yield gains in accuracy of up to 3.5% are seen when using the genomic location specific prior information compared to using BayesR. Compared with accuracies obtained with a simple pooling of reference data, the BayesRS approach leads to an extra gain of up to 1.5% for fat yield, but not for the other two traits. Although differences in accuracy were seen, none of the differences were significant at a 5% level, reflecting the small size of validation population.
In all three scenarios the highest gains in accuracy are found for a segment size of 100 markers, implying that using smaller segments gives a stronger advantage from the location specific priors. Furthermore, significant results are only found in two cases: the largest and smallest segments. For the largest segment size of 3000 markers, it is surprising that the increase in accuracy is significant although larger gains in accuracy are seen for smaller segments. However, this could be an artifact of the test chosen for the significance. With a large segment size the added information becomes very unspecific which could lead to results more similar to those obtained from the regular BayesR method. With a high correlation between DGVs from the two methods, the Hotelling Williams t-test would cause even small differences in accuracy to be significant.
The different scaling factors (weights) applied to the parameters in the Dirichlet priors, seems to make little or no difference on the accuracy of the BayesRS model, which suggests that the accuracies obtained from BayesRS could be random fluctuations. This is in many, but not all cases, supported by the lack of significance of the results.
To summarize, BayesRS gave accuracies comparable to, but not always higher than or significantly different from, a simple pooling of the data. For closely related populations pooling is expected to be superior. So a simple pooled multi-breed or multi-population reference could be a better approach in some cases, but not necessarily for all traits. For example, the BayesRS approach gave higher accuracies than a pooled reference for fat yield in JER-AUS. Further studies are needed to confirm the validity of the results in a larger validation population.
One advantage of the method presented here is a large reduction in computational demand. Since the BayesRS model only uses very condensed information from the other population, the increase in memory demand is negligible, and the extra complexity of the model only slightly increases the CPU run time. For JER-AUS running the BayesR model for 100.000 iterations required 33 hours, whereas the BayesRS model could be run for the same number of iterations in 39 hours. When using BayesR with the combined JER-AUS HOL-AUS reference data, 100,000 iterations takes about 150 hours, and more than quadruples the memory requirements.
Although the accuracies obtained using BayesRS in most cases cannot compete with pooling of the data, the results seem consistently better than when using only data from the target population and a non-informative prior, for example only the JER-AUS data. In some cases where the extra data itself is not available, the BayesRS model or a similar approach could improve the accuracy of genomic predictions using only summary statistics. This might be in cases when intellectual property issues prevents sharing of the raw data, but allows use of summary statistics as in this study. The approach could also be useful for meta-analysis of many data sets from different sources.
The model presented here would also allow the use of other prior information such as known QTL or expression pathways, by assigning a higher prior probability to sample large effects in the involved genomic regions. In this study segments were chosen arbitrarily with a fixed length, but another approach could be to define coding and non-coding regions of the genome as different segments and set different Dirichlet priors. A challenge here would, however, be how to choose the counts in the Dirichlet prior without sampling them from a different population. Previous results show that SNPs near genes found in both human and bovine genomes are significantly associated with stature . By considering evolutionary conserved regions as segments the method using external information sources presented in this study could be used for genomic prediction across species for traits of common interest such as growth in meat-production animals or production traits in dairy species.
Our results suggest that genomic location specific priors in BayesRS improve the accuracy of genomic prediction, when the priors are derived from another population. However, the higher accuracies were only found to be significantly better than a competing alternative without location specific priors in a few cases. This might be a result of the limited number of animals used in the validation sets, so further investigation is needed to confirm the validity of the method.
Results also show that some highly variable segments coincide with known genes and QTLs, suggesting that using actual biological information could be beneficial for the accuracy of genomic predictions. Finally the BayesRS setup might offer a possibility for higher accuracies of genomic predictions in cases with limited computer resources or issues with data sharing.
We thank the Danish Cattle Federation (Aarhus, Denmark), Faba Co-op (Helsinki, Finland), Swedish Dairy Association (Stockholm, Sweden), and Nordic Cattle Genetic Evaluation (Aarhus, Denmark) for providing data. This work was performed in the project “Genomic Selection—From function to efficient utilization in cattle breeding (grant no. 3405-10-0137)”, funded under Green Development and Demonstration Programme by the Danish Directorate for Food, Fisheries and Agri Business (Copenhagen, Denmark), the Milk Levy Fund (Aarhus, Denmark), VikingGenetics (Randers, Denmark), Nordic Cattle Genetic Evaluation (Aarhus, Denmark), and Aarhus University (Aarhus, Denmark).
- Wray NR, Goddard ME, Visscher PM: Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 2007, 17: 1520-1528. 10.1101/gr.6665407.PubMed CentralView ArticlePubMed
- Riedelsheimer C, Czedik-Eysenberg A, Grieder C, Lisec J, Technow F, Sulpice R, Altmann T, Stitt M, Willmitzer L, Melchinger AE: Genomic and metabolic prediction of complex heterotic traits in hybrid maize. Nat Genet. 2012, 44: 217-220. 10.1038/ng.1033.View ArticlePubMed
- Goddard ME: Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 2008, 136: 245-257.View ArticlePubMed
- Lund MS, Roos APWD, Vries AGD, Druet T, Ducrocq V, Fritz S, Guillaume F, Guldbrandtsen B, Liu Z, Reents R, Schrooten C, Seefried F, Su G: A common reference population from four European Holstein populations increases reliability of genomic predictions. GSE. 2011, 43: 43-10.1186/1297-9686-43-43.PubMed CentralPubMed
- Brøndum RF, Rius-Vilarrasa E, Stranden I, Su G, Guldbrandtsen B, Fikse WF, Lund MS: Reliabilities of genomic prediction using combined reference data of the Nordic Red dairy cattle populations. J Dairy Sci. 2011, 94: 4700-4707. 10.3168/jds.2010-3765.View ArticlePubMed
- Hayes BJ, Bowman PJ, Chamberlain AJ, Verbyla K, Goddard ME: Accuracy of genomic breeding values in a multi-breed dairy cattle population. Genet Sel Evol. 2009, 41: 51-10.1186/1297-9686-41-51.PubMed CentralView ArticlePubMed
- Pryce JE, Gredler B, Bolormaa S, Bowman PJ, Egger-Danner C, Fuerst C, Emmerling R, Solkner J, Goddard ME, Hayes BJ: Short communication: Genomic selection using a multi-breed, across-country reference population. J Dairy Sci. 2011, 94: 2625-2630. 10.3168/jds.2010-3719.View ArticlePubMed
- Matukumalli LK, Lawley CT, Schnabel RD, Taylor JF, Allan MF, Heaton MP, O’Connell J, Moore SS, Smith TPL, Sonstegard TS, Van Tassell CP: Development and characterization of a high density SNP genotyping assay for cattle. PloS one. 2009, 4: 5350-10.1371/journal.pone.0005350.View Article
- de Roos APW, Hayes BJ, Spelman RJ, Goddard ME: Linkage Disequilibrium and Persistence of Phase in Holsten-Friesian, Jersey and Angus Cattle. Genetics. 2010, 179: 1503-1512.View Article
- de Roos APW, Hayes BJ, Goddard ME: Reliability of genomic predictions across multiple populations. Genetics. 2009, 183: 1545-1553. 10.1534/genetics.109.104935.PubMed CentralView ArticlePubMed
- Erbe M, Hayes BJ, Matukumalli LK, Goswami S, Bowman PJ, Reich CM, Mason BA, Goddard ME: Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci. 2012, 95: 4114-4129. 10.3168/jds.2011-5019.View ArticlePubMed
- Pryce JE, Hayes B, Bolormaa S, Goddard ME: Polymorphic regions affecting human height also control stature in cattle. Genetics. 2011, 187: 981-984. 10.1534/genetics.110.123943.PubMed CentralView ArticlePubMed
- Browning BL, Browning SR: A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 2009, 84: 210-223. 10.1016/j.ajhg.2009.01.005.PubMed CentralView ArticlePubMed
- Kizilkaya K, Tait RG, Garrick DJ, Fernando R, Reecy JM: Whole genome analysis of infectious bovine keratoconjunctivitis in Angus cattle using Bayesian threshold models. BMC Proc. 2011, 5 (Suppl 4): S22-10.1186/1753-6561-5-S4-S22. Paris, FrancePubMed CentralView ArticlePubMed
- Dunn OJ, Clark V: Comparison of tests of the equality of dependent correlation coefficients. J Am Stat Assoc. 1971, 66: 904-908. 10.1080/01621459.1971.10482369.View Article
- Grisart B, Coppieters W, Farnir F, Karim L, Ford C, Berzi P, Cambisano N, Mni M, Reid S, Simon P, et al: Positional candidate cloning of a QTL in dairy cattle: Identification of a missense mutation in the bovine DGAT1 gene with major effect on milk yield and composition. Genome Res. 2002, 12: 222-231. 10.1101/gr.224202.View ArticlePubMed
- Pryce JE, Bolormaa S, Chamberlain AJ, Bowman PJ, Savin K, Goddard ME, Hayes BJ: A validated genome-wide association study in 2 dairy cattle breeds for milk production and fertility traits using variable length haplotypes. J Dairy Sci. 2010, 93: 3331-3345. 10.3168/jds.2009-2893.View ArticlePubMed
- Blott S, Kim JJ, Moisio S, Schmidt-Kuntzel A, Cornet A, Berzi P, Cambisano N, Ford C, Grisart B, Johnson D, et al: Molecular dissection of a quantitative trait locus: A phenylalanine-to-tyrosine substitution in the transmembrane domain of the bovine growth hormone receptor is associated with a major effect on milk yield and composition. Genetics. 2003, 163: 253-266.PubMed CentralPubMed
- Chamberlain AJ, Hayes BJ, Savin K, Bolormaa S, McPartlan HC, Bowman PJ, Van der Jagt C, MacEachern S, Goddard ME: Validation of single nucleotide polymorphisms associated with milk production traits in dairy cattle. J Dairy Sci. 2012, 95: 864-875. 10.3168/jds.2010-3786.View ArticlePubMed
- Yang J, Manolio TA, Pasquale LR, Boerwinkle E, Caporaso N, Cunningham JM, de Andrade M, Feenstra B, Feingold E, Hayes MG, et al: Genome partitioning of genetic variation for complex traits using common SNPs. Nat Genet. 2011, 43: 519-U44. 10.1038/ng.823.PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.