Data
The datasets used in the present study included 540 Australian Jersey bulls (JER-AUS), 2297 Australian Holstein bulls (HOL-AUS) and5214 Nordic (Danish, Swedish or Finnish) Holstein bulls (HOL-NOR) (Table 1). Phenotypic data was daughter trait deviations (DTD) for the Australian bulls and deregressed proofs (DRP) for the Nordic bulls. DTDs were given in kilograms, whereas the DRPs are standardized indices. The traits selected for the study were: protein yield, fat yield and milk yield, as these traits have the most data in the populations.
Genotypic data was a mixture of true and imputed SNP markers from the Illumina 777K SNP chip. For HOL-AUS there were 843 Holstein heifers genotyped on the 777K SNP chip as well as 93 key ancestor bulls. For JER-AUS 93 key ancestor bulls were genotyped for the 777K SNP chip. Quality control steps included removing SNPs with very low minor allele frequencies, ambiguous or undefined map positions, and no heterozygote genotypes. For full details see [11]. These animals were used as reference to impute the high density genotypes for the remaining 2204 Holstein and 447 Jersey bulls which were genotyped with the 50k chip.
For HOL-NOR 557 bulls from the EuroGenomics project [4] were genotyped using the 777K chip and these bulls were used as reference to impute the 777K markers for the bulls genotyped with the Illumina 50K SNP chip. After imputation LD of each marker with the previous one in the assembly was inspected. If two adjacent markers were in complete LD one of the markers was deleted, so that r2 of any pair of adjacent markers was less than one. The marker data was further edited by deleting markers with a minor allele frequency less than 0.01.
Imputation was done using Beagle [13] in all three populations. Since the purpose was to compare segments across populations and use this information for genomic prediction, the SNP datasets were further edited to only keep the markers that were in common across the populations. After data editing 465,542 markers remained for analysis.
Each of the datasets was split into a reference and validation set (Table 1) to allow for cross validation of the accuracy of DGV. In HOL-NOR the bulls were separated by birth date before or after 2001-10-01, and in JER-AUS and HOL-AUS the bulls were split by onset of progeny test before or after 2007. In both cases the younger animals were assigned to the validation set. This cross validation strategy was chosen as the resulting accuracy is the most meaningful in the context in which the genomic predictions will be used: in the dairy industry. Here reference sets of older bulls are used to predict the DGV for young bulls which are then selected for use based on these DGV. In the Australian data all of the available data was used to estimate segment variances to maximize the data. In the Nordic dataset only the reference set was used to estimate segment variances.
All genotypic and phenotypic data was obtained from pre-existing routine genetic evaluation data for the dairy cattle populations and required no ethical approval.
Estimation of genetic variances explained by different segments
Genetic variance attributed to each segment was estimated from effects of the SNPs in the segment. Prediction of single SNP marker effects was carried out using BayesR [11]. The statistical model was:
(1)
Where y is the vector of DTD or DRP, μ is the mean, a is the vector of residual polygenic effects, e is the vector of the residual errors, W is the incidence matrix of SNP genotype coefficients, and g is the vector of SNP allele substitution effects. Let X be a genotype matrix coded as 0/1/2, for respectively homozygote first allele, heterozygote and homozygote second allele, and let pj be the frequency of the second allele at locus j. Assuming Hardy-Weinberg equilibrium, entries of W are then obtained by centering and standardizing entries of X to:
(2)
Prior distributions for the parameters are given as:
(3)
(4)
Where A is the additive relationship matrix, σa2 is the variance of residual polygenic effects, and r
y
2 is the reliability of DRP/DTD. The four-distribution mixture chosen for the SNP effects, does not reflect any biological hypothesis, but was chosen to allow for easier mixing between SNPs with no effect and SNPs with effects of different sizes. The Dirichlet prior on the proportions of different SNP variances with all parameters set to one, is actually a uniform prior, but specifying it in this manner reflects the fact that the posterior distribution on the proportions follows a Dirichlet distribution with a pseudo count of 1 from each of the four distributions. The prior is not uninformative in any statistical sense since it states that all distributions have equal probabilities, but it adds very little information compared to the posterior, as the data gives information on almost half a million counts, and the prior only adds 4, see [11] for more detail.
To estimate the variance explained by each small chromosomal region, the entire set of SNPs was divided into segments of a fixed length (e.g. 100 markers each).The variance explained by segment s was calculated as
(5)
Where W
s
is the sub-matrix of W corresponding to the SNPs in segment s, and g
s
is the vector of estimated SNP marker effects for the same segment, i.e. the segment variance is the variance across individuals of the partial direct genetic values (DGVs, marker only estimated breeding values) belonging to segment s. Variances of the partial DGVs for all segments were calculated at each iteration in the Gibbs sampler, and the estimated segment variances were obtained as the posterior means. Segment variances were estimated for segment sizes of 10, 25, 50, 100, 250, 500, 1000, 2000 or 3000 SNPs and for entire chromosomes. The approach is similar to [14] where sliding windows of five consecutive SNPs are used to estimate the genetic variance of chromosomal regions. In our approach the windows are however not overlapping.
Posterior means of the parameters were obtained using a Gibbs sampler run for 20,000 iterations with a burn-in of 10,000 in the Holstein populations. For the Jersey population results were not consistent with only 20,000 iterations, so a chain length of 100,000 with a burn-in of 50,000 was used instead. The relatively poor mixing properties of the Gibbs sampler for the Jersey data could be due to the small size of the reference population. Lengths of the chains were based on preliminary runs and comparisons of the obtained segment variances. With 20.000 iterations the Holsteins showed a mean pairwise correlation between segment variances from 10 consecutive runs of 0.95, whereas the Jerseys showed a mean correlation between segment variances from 10 consecutive runs of 0.80. Increasing the number of iterations for the Jerseys to 100.0000 increased the mean correlation of segment variances between consecutive runs to .96.
Prediction using location specific prior information
The purpose here is to build a Bayesian prediction model that allows for a larger proportion of variation to be explained by certain segments, based on knowledge from previous experiments. One way to do this is to allow different segments to have different prior probabilities assigned to the four SNP effect distributions. Letting S denote the set of segments, the model used here is:
(6)
(7)
(8)
Here π
s
is the probability vector for the four SNP effect distributions in segment s, and αs is the vector of prior parameters for the Dirichlet distribution in segment s. The model is similar to the original BayesR model, with the modification that the probability to sample SNPs from the four different distributions now depends on the segment. By setting the location specific information via the Dirichlet prior, instead of using constant proportions, the model estimates the proportions using both the data and the prior information. As this is a BayesR by segment approach, the model will be referred to as BayesRS.
To test the BayesRS model, posterior means of the number of times in which the indicator variable fell in component i of the mixture were estimated in each segment in one population using BayesR, and subsequently used as the αs parameters in the target population in BayesRS. This was done for segment sizes of 100, 250, 500, 1000, 2000 or 3000 SNPs. Since the sum of counts in the alpha parameters in this setup is equal to the number of markers, this means that the prior on the proportions in the mixture, unlike in the regular BayesR, now has as much weight as the data (much higher weight than in BayesR). To test the impact of the weight of the prior, different scaling factors were tried, i.e. the αs parameters were multiplied by 0.2, 0.4, .0.6, 0.8, 1.0, 1.25 or 1.5. The model was tested in three different scenarios:
-
1.
JER-AUS with prior information from HOL-AUS.
-
2.
HOL-AUS with prior information from HOL-NOR.
-
3.
HOL-AUS (random) with prior information from HOL-NOR.
HOL-AUS (random) is a random subset of 500 animals from the HOL-AUS reference population, which was generated to test the hypothesis that the advantage of the BayesRS model would be greater in smaller populations. The second and third setups were tested using the same validation animals.
Validation of DGV accuracy
DGVs for the validation populations were predicted as
(9)
Where w '
k
is the row of W belonging to animal k. Accuracies of the DGV were calculated as r(DGV, DTD) and validated in HOL-AUS and JER-AUS. Differences in accuracy between BayesR and BayesRS were tested for significance using a Hotelling-Williams t-test, which takes account of the number of individuals in the validation set [15].