A hybrid expectation maximisation and MCMC sampling algorithm to implement Bayesian mixture model based genomic prediction and QTL mapping
 Tingting Wang^{1, 2, 3}Email author,
 YiPing Phoebe Chen^{1},
 Phil J. Bowman^{2, 3, 4},
 Michael E. Goddard^{2, 3, 5} and
 Ben J. Hayes^{2, 3, 6}
Received: 23 March 2016
Accepted: 10 September 2016
Published: 21 September 2016
Abstract
Background
Bayesian mixture models in which the effects of SNP are assumed to come from normal distributions with different variances are attractive for simultaneous genomic prediction and QTL mapping. These models are usually implemented with Monte Carlo Markov Chain (MCMC) sampling, which requires long compute times with large genomic data sets. Here, we present an efficient approach (termed HyB_BR), which is a hybrid of an ExpectationMaximisation algorithm, followed by a limited number of MCMC without the requirement for burnin.
Results
To test prediction accuracy from HyB_BR, dairy cattle and human disease trait data were used. In the dairy cattle data, there were four quantitative traits (milk volume, protein kg, fat% in milk and fertility) measured in 16,214 cattle from two breeds genotyped for 632,002 SNPs. Validation of genomic predictions was in a subset of cattle either from the reference set or in animals from a third breeds that were not in the reference set. In all cases, HyB_BR gave almost identical accuracies to Bayesian mixture models implemented with full MCMC, however computational time was reduced by up to 1/17 of that required by full MCMC. The SNPs with high posterior probability of a nonzero effect were also very similar between full MCMC and HyB_BR, with several known genes affecting milk production in this category, as well as some novel genes. HyB_BR was also applied to seven human diseases with 4890 individuals genotyped for around 300 K SNPs in a case/control design, from the Welcome Trust Case Control Consortium (WTCCC). In this data set, the results demonstrated again that HyB_BR performed as well as Bayesian mixture models with full MCMC for genomic predictions and genetic architecture inference while reducing the computational time from 45 h with full MCMC to 3 h with HyB_BR.
Conclusions
The results for quantitative traits in cattle and disease in humans demonstrate that HyB_BR can perform equally well as Bayesian mixture models implemented with full MCMC in terms of prediction accuracy, but with up to 17 times faster than the full MCMC implementations. The HyB_BR algorithm makes simultaneous genomic prediction, QTL mapping and inference of genetic architecture feasible in large genomic data sets.
Background
Genomic prediction of genetic merit, using SNP markers, is now routinely used in animal and plant breeding to identify superior breeding individuals and so accelerate genetic gain [1–3]. Genomic prediction methodology is also increasingly used in human disease studies for the inference of genetic architecture, the identification of causal mutations (QTL mapping), and prediction of disease risk [4–8].
Genomic predictions are often implemented using linear prediction models, especially best linear unbiased prediction (BLUP) or Genomic BLUP (GBLUP), which assume that all the SNPs contribute small effects to the trait and these effects are derived from a normal distribution [1, 4, 9]. While BLUP based genomic predictions have certainly been successful in increasing genetic gain in livestock and crops [10, 11], this approach does have some limitations. One limitation is that the prediction accuracy does not persist well across multiple generations, because the severe shrinkage in these models results in the effect of causative mutation being “smeared” across many markers encompassing long chromosome segments – in other words a linear combination of effects of a large number of markers is used to capture the effect of a QTL. After several generations, the association between markers and QTL might be broken down by recombination, thereby reducing prediction accuracy. The smearing of effect of causative mutations across many SNP also results in imprecise QTL mapping with BLUP methods.
To address these problems, Bayesian mixture models (nonlinear prediction e.g. Bayes A, B, C, and R) [1, 6, 12–15] assume nonnormal prior distributions of SNP effects. One example of a flexible approach, BayesR [14] defines a mixture model with SNP effects following a mixture of four normal distributions with zero, very small, small and moderate variances. In practice, the prediction accuracy of Bayesian mixture models (including BayesR) has been shown to be equal or superior to that of GBLUP for both human diseases and dairy cattle milk production traits [15–25].
In addition to the prediction of breeding values and future phenotypes using genotype data, Bayesian models (such as BayesR) can be used to map the causal polymorphisms (quantitative trait loci or QTL). For this purpose they have some advantages over traditional single SNP regression, which is commonly used to analyse genome wide association studies (GWAS) [16–24]. Single SNP regression fits one SNP at a time as a fixed effect and tests the significance of the association between the SNP and the trait, while ignoring all other SNPs. To protect against performing multiple tests, stringent Pvalues (P < 5*10^{−8}) are used. This method of analysis has three limitations:1) The effect of those SNPs declared significant is usually overestimated; 2) multiple SNPs in linkage disequilibrium with the same QTL may show an association with the trait leading to imprecision in mapping the QTL; 3) many QTL are not detected at all because no SNP reaches the stringent Pvalue for association with the trait. By comparison, Bayesian mixture models fit all SNPs simultaneously by treating the SNP effects as random effects drawn from a prior distribution. For example, the BayesR model has been implemented for QTL detection in the dairy cattle genome and for human disease traits [15]. The results show that BayesR can increase the power of identifying the known genes in contrast with GBLUP and GWAS.
Even though nonlinear models are attractive, one limitation is that nonlinear models typically require long computational times. Due to the hierarchical estimation of posterior distributions of SNP effects and their variances, nonlinear models have usually been implemented with Markov Chain Monte Carlo (MCMC). This requires a large number of iterations with time per iteration scaled linearly with the number of markers (m) and the number of individuals (n). Genomic data sets are now often very large and are rapidly becoming larger. For human, 300,000 to 9 million SNPs arrays genotyped on up to 253 K individuals [26, 27] are available for association studies and disease/fitness prediction. In dairy cattle, whole genome sequence data including 39 million variants has been published by the 1000 bull genomes project [28]. When confronted with such huge genomic data, Bayesian methods can be so computationally expensive that it is not possible to use them.
Two approaches have been used to develop more computationally efficient algorithms for implementing Bayesian mixture models. One is to modify MCMC with speedup schemes. For example, Moser et al. [8] introduced a “500SNPs” scheme to pick 500 SNPs with nonzero effects to be updated instead of all the SNPs. Such modification schemes could reduce the running time by 3 ~ 6 fold. Calus et al. [29] proposed a righthandside updating algorithm to cluster multiple SNPs (similar to a haplotype) to be updated as one during MCMC iterations. The results on 50 K SNP panels demonstrated up to 90 % reduction in computational time without reducing prediction accuracy. The other approach is to introduce heuristic methods (e.g. iterated conditional expectation, ICE; expectation maximisation, EM) as an alternative to MCMC. There are a wide range of fast versions of Bayesian approaches to genomic prediction using these methods (including fastBayesB, emBayesB, emBayesR) [30–35], which are several orders faster than MCMC implementations. However, none of these algorithms gives consistently as high prediction accuracy as their MCMC counterparts. The EM method of Wang et al. [30], emBayesR, gave higher accuracy than ICE based methods but still had a reduction in accuracy of 5 % ~ 7 % for traits with mutations of moderate to large effect. In other words, the heuristic approximations works best when there are no mutations of moderate to large effect, otherwise accuracy can be compromised. This is undesirable, especially when the largest advantage of the nonlinear Bayesian methods over BLUP is observed when there are mutations of moderate to large effect (where moderate effect can mean a QTL explaining 1 % of the variance if the data set is large)!
Motivated by the deficiency of both MCMC (long computing terms) and fast versions of nonlinear models (lower prediction accuracy with some genetic architectures), we hypothesise that a hybrid scheme, beginning with EM iterations and finishing with MCMC sampling iterations, would give similar prediction accuracy to a full MCMC implementation, while having a significant speed advantage. Here we propose a hybrid algorithm (termed HyB_BR) of ExpectationMaximisation (EM) (emBayesR) and MCMC under the BayesR model. The algorithm also incorporates a speedup scheme where only a proportion of SNP continue to be sampled in MCMC iterations. In comparison with emBayesR [30], the main improvement is that HyB_BR introduces a limited number of MCMC iterations after EM to improve the solutions from emBayesR.
To evaluate the predictive ability and computational efficiency of HyB_BR, prediction accuracy was compared with BayesR and GBLUP in two data sets. The first data set was 800 K SNP genotypes in 16,214 Holstein and Jersey bulls and cows. The prediction accuracy within these breeds and in a third breed (Aussie Red) was evaluated. The results showed that HyB_BR achieved very similar prediction accuracy to BayesR, while reducing the running time by up to 17 fold, and overcoming the limitations of slightly reduced accuracy of emBayesR. As a result of running the algorithm, the posterior probability of each SNP being in the model was derived, and this was used for QTL mapping. The resulting QTL regions were compared between the approaches and with previous literature reports. The results demonstrated that HyB_BR has enough power to detect the major known genes affecting milk production traits in dairy cattle as well as some novel regions. HyB_BR was also evaluated in a second data set  the Welcome Trust Case Control Consortium (WTCCC) human disease data set [27]. The results demonstrated that HyB_BR is a promising method for risk prediction and genetic architecture inference for human disease traits as well.
Methods
The mixture data model
β = vector of p fixed effects, following uninformative priors.
g = vector of m SNP effects. For each SNP, \( {g}_i\sim b\left(i,1\right)\times N\left(0,0*{\sigma}_g^2\right)+b\left(i,2\right)\times N\left(0,0.0001*{\sigma}_g^2\right)+b\left(i,3\right)\times N\left(0,0.001*{\sigma}_g^2\right)+b\left(i,4\right)\times N\left(0,0.01*{\sigma}_g^2\right), \)in which σ _{ g } ^{2} is the genetic variance of the trait and b(i, k) is a scalar with two possible values {0, 1}, determining whether or not the effect of the i ^{ th } SNP is derived from the k ^{ th } distribution.
Pr = vector of probabilities where Pr _{ k } = scalar with the value range between 0 and 1. The parameter Pr defines the proportion of all the SNPs following each of four normal distributions k. Pr _{ k } is assumed to follow a Dirichlet distribution with the parameter α = (1, 1, 1, 1)^{T}.
v = vector of q polygenic effects (breeding values, for the proportion of variance not explained by the SNP), with v ~ N(0, A σ _{ a } ^{2} ), A is the q × q pedigreebased relationship matrix, σ _{ a } ^{2} is the polygenic variance, q is the number of individuals in the pedigree.
e = vector of n residual errors. For cattle data, e ~ N(0, Eσ _{ e } ^{2} ), where E is a n × n diagonal matrix so that the error variance associated with different records can vary. For example, for bulls, the phenotype would be daughter yield deviations, which would have a lower error variance than the trait deviations (TD) of cows [36]. For human data where all phenotypes have the same magnitude of error, E matrix can be replaced by the identity matrix I.
X = n × p design matrix, allocating phenotypes y to fixed effects β.
Z = a n × m genotype matrix with elements \( {\mathbf{z}}_{\mathrm{ij}}=\left({\mathbf{s}}_{\mathrm{ij}}2{p}_i\right)/\sqrt[2]{2{p}_i\left(1{p}_i\right)} \), in which s _{ij} is the genotypes of the j ^{ th } individual for the i ^{ th } SNP (0, 1 or 2 copies of the second allele), and p _{ i } is the allele frequency of each SNP i.
W = n × q design matrix, aims at allocating the q × 1 vector of polygenic effects to y.
Note that model (1) extends the model used by Wang et al. [30] to include fixed effects, polygenic effects and weights.
Where, δ(g_{i}) denotes the Dirac delta function with all probability mass at g_{i} = 0.
The implementation of HyB_BR with the mixture model defined above consists of two components: 1) The ExpectationMaximization module. HyB_BR first implements the EM iterations under the mixture Gaussian model (Eq. 2) to give approximations for the parameter set including SNP effects g, proportion of SNP in each distribution Pr, error variance σ _{ e } ^{2} , and polygenic variance σ _{ a } ^{2} . For the estimation of each SNP effect, the PEV (predicted error variance) correction is introduced to account for the errors which are generated from the estimations of all other SNP effects (detailed in Additional file 1). 2) MCMC module. Once the EM steps are converged, the output values of the parameters are used in the modified MCMC iterations as the start values. For the final step, a MCMC scheme is implemented with a limited number of iterations.
EM module
 i)
Define the log likelihood f(yθ) of the data under the data model (1).
 ii)
Derive the log posterior function of the parameters using Bayes’ theorem. Following Bayes’ theorem, the log posterior distribution of the parameter sets θ is based on the rule logp(θy) ∝ logf(yθ) + logp(θ), with p(θ) the prior for the parameter.
 iii)
Take the expectation on the posterior function over the missing data.
 iv)
Differentiate the expected posterior function and solve for this equal to zero to obtain MAP (Maximum A Posterior) of the parameter set θ.
 1)
Estimation of SNP effects g
We estimate of ĝ _{ i } by the value of g _{ i } that maximises the posterior probability \( \mathrm{P}\left({g}_i\Big\mathbf{y},\widehat{\mathbf{P}}\mathbf{r},\widehat{\upsigma_e^2}\right) \) where \( \widehat{\mathbf{P}}\mathbf{r} \) and \( \widehat{\upsigma_e^2} \) are the MAP estimates of Pr and σ _{ e } ^{2} conditional on y.
To perform this, we first introduce “missing data” (b(i, k), β, v, u _{ i }), and then “integrate them out” via the ExpectationMaximisation steps. In detail, the marginal posterior distribution of each SNP effect g _{ i } can be written as:
\( p\left({g}_i,b\left(i,k\right)\Big\mathbf{y},\boldsymbol{\upbeta}, \mathbf{v},{\mathbf{u}}_{\boldsymbol{i}},\widehat{\upsigma_{\mathrm{e}}^2},\ {\widehat{Pr}}_k\right)\propto p\left(\mathbf{y}\Big{g}_i,b\left(i,k\right),\boldsymbol{\upbeta}, \mathbf{v},{\mathbf{u}}_{\boldsymbol{i}},\widehat{\upsigma_{\mathrm{e}}^2},\ {\widehat{Pr}}_k\right)p\left({g}_i,b\left(i,k\right)\Big{\widehat{Pr}}_k\right). \)
Under the model (3), the first term \( p\left(\mathbf{y}\Big{g}_i,b\left(i,k\right),\boldsymbol{\upbeta}, \mathbf{v},{\mathbf{u}}_{\boldsymbol{i}},\widehat{\upsigma_{\mathrm{e}}^2},\ {\widehat{Pr}}_k\right) \) is obtained according to the following normal distribution:
y − Xβ − Z _{ i } g _{ i } − Wv − u _{ i } ~ N(0, Eσ _{e} ^{2} ),
Where, e* = y − Xβ − Wv − u _{ i }.
Therefore, the term \( p\left(\mathbf{y}\Big{g}_i,b\left(i,k\right),\boldsymbol{\upbeta}, \mathbf{v},{\mathbf{u}}_{\boldsymbol{i}},\widehat{\upsigma_{\mathrm{e}}^2},\ {\widehat{Pr}}_k\right) \) can be written as: \( p\left(\mathbf{y}\Big{g}_i,{\mathbf{u}}_{\boldsymbol{i}},b\left(i,k\right),\boldsymbol{\upbeta}, \mathbf{v},\widehat{\upsigma_{\mathrm{e}}^2},\ {\widehat{Pr}}_k\right)=\frac{1}{{\left(2\pi \widehat{\sigma_e^2}\right)}^{\frac{n}{2}}}\frac{1}{\left\mathbf{E}\right} exp\left[\frac{1}{2\widehat{\sigma_e^2}}\left({\mathbf{e}}^{*}{\mathbf{Z}}_{\mathrm{i}}{g}_i\right)\mathit{\hbox{'}}\;{\mathbf{E}}^{1}\left({\mathbf{e}}^{*}{\mathbf{Z}}_{\mathrm{i}}{g}_i\right)\right] \)
Where, \( P\left(i,k\right)=E\left(b\left(i,k\right)\Big\mathbf{y},\ {\widehat{Pr}}_k\right) \). The term P(i, k) can be derived as in the Additional file 2.
 2)
Estimation of parameter Pr
Where,
 3)
Estimation of fixed effects (β) and the error variance (σ _{ e } ^{2} )
In theory, PEV(e) ≠ PEV(e*) due to e = e* + Z _{i} g _{ i }. However, since each SNP effect is shrunk severely towards zero by GBLUP [4], we approximate PEV(e) ≅ PEV(e*). The calculation of the term PEV(e*) is detailed in the Additional file 1.
 4)
Estimation of polygenic effects (v)
Where, for simplicity, the variance σ _{a} ^{2} . will be fixed as the specified value from GBLUP estimation.
The list of all the estimated parameters including the possibility for each SNP (P(i, k)), the proportion parameter (Pr), each SNP effect (g _{ i }), error variance (σ _{ e } ^{2} ), fixed effect (β), and polygenic effects v and the according equation derived from EM steps
Parameters  The data model  According equations derived from EM 

\( {E}_{e^{*}} logP\left(i,k\right) \)  The expected likelihood parameters for P(i, k)  Equation (S3) 
P(i, k)  SNP effects related parameters under the extended model (3)  Equation (S4) 
Pr  Equation (10)  
g _{ i }  Equation (8)  
σ _{e} ^{2}  The overall model (1)  Equation (12) 
β  Equation (13)  
v  Equation (19) 
Steps for EM module

Step EM_①: Initialise the parameters g, Pr, σ _{ i } ^{2} and Construct X, A, G, E, W matrices. Similar to emBayesR [30], the starting values of g and Pr were set as g = 0.01 and Pr = {0.5, 0.487, 0.01, 0.003}, while σ _{ i } ^{2} = {0, 0.0001 * σ _{g} ^{2} , 0.001 * σ _{g} ^{2} , 0.01 * σ _{g} ^{2} }. The genetic variance σ _{g} ^{2} and polygenic variance σ _{a} ^{2} are obtained from GBLUP. Both variances won’t be updated during EM iterations.
The n × 3 matrix X is a design matrix, allocating the phenotypes to fixed effects. In our case, the matrix X is set up with the first column being the mean, the second and third columns defining the breeds (Holstein or Jersey) and sex (bulls or cows) of the cattle. For example, if the i ^{ th } animal is a Holstein bull, then x _{ i,2} = 1 with x _{ i,3} = 0.
The Pedigree relationship matrix A is built using Henderson’s rules [37]; while the genomic relationship matrix G is constructed using the equation G = ZZ ' /n. Diagonal error matrix E is calculated following Garrick et al. [36], and the index matrix W maps individuals in the pedigree to the phenotypes if they have ones.

Step EM_②: Calculate the PEV matrix under model 1 (Additional file 1). Then using PEV matrix, calculate \( tr\left({\mathbf{E}}^{1}{\mathbf{Z}}_{\mathbf{i}}{\mathbf{Z}}_{\mathbf{i}}^{\mathit{\hbox{'}}}{\mathbf{E}}^{1}\mathbf{P}\mathbf{E}{\mathbf{V}}_{{\mathbf{u}}_1}\left(\mathbf{e}\right)\right) \) which is used in the equation for \( {E}_{e^{*}} logP\left(i,k\right) \) (Additional file 2). In theory, the calculation of this term should be updated each EM iteration, which is time consuming. To avoid huge computational burden, the PEV matrix is treated as constant value for the term \( tr\left({\mathbf{E}}^{1}{\mathbf{Z}}_{\mathbf{i}}{\mathbf{Z}}_{\mathbf{i}}^{\mathit{\hbox{'}}}{\mathbf{E}}^{1}\mathbf{P}\mathbf{E}{\mathbf{V}}_{{\mathbf{u}}_1}\left(\mathbf{e}\right)\right) \) in front of EM loop.
Then for each SNP i (i in 1 to m)

Step EM_③: Correct y for the effects of all other SNPs except current SNP i with equation \( {\mathbf{y}}^{\dagger }=\mathbf{y}{\displaystyle {\sum}_{\mathrm{j}\ne \mathrm{i}}\kern0.5em {\mathbf{Z}}_{\mathbf{j}}}\kern0.5em {\widehat{\mathrm{g}}}_{\mathrm{j}}\mathbf{X}\widehat{\mathbf{b}}\mathbf{W}\widehat{\mathbf{v}}. \)

Step EM_④: Estimate the probability that the effect of SNP i is from one of four normal distributions \( {E}_{e^{*}} logP\left(i,k\right) \) with the equation (S3). After this, P(i, k) is calculated with the equation \( exp\left({E}_{e^{*}} log{P}_{ik}/{\displaystyle {\sum}_{k=1}^4 exp\kern0.5em }\left({E}_{e^{*}} log{P}_{ik}\right)\right) \) (S4).

Step EM_⑤: the SNP effect ĝ_{ i } is updated via Eq. (8).
After effects have been estimated for all SNP,

Step EM_⑥: Estimate σ _{ e } ^{2} with Eq. (12), fixed effects β with Eq. (13), update Pr _{ k } with Eq. (10), and update polygenic effects v with the Eq. (19).

Step EM_⑦: Assess convergence criterion (ĝ ^{ l } − ĝ ^{ l − 1}) ' (ĝ ^{ l } − ĝ ^{ q − 1})/((ĝ ^{ l'} ĝ ^{ l }) ≤ 10^{− 10} with l being the EM iterations number. If not converged, then return to Step ③ for the next EM iteration; otherwise, exit the EM iterations and return the estimates of parameters from the final iterations.
Modified MCMC module with speedup scheme
The outputs of the EM including SNP solutions, polygenic effects, error variance and genetic variance are used as starting values of parameters for the MCMC module, which allows MCMC to begin with no burn in.
The MCMC module of HyB_BR implements the same Gibbs sampling processes as BayesR [15] but modified with one speedup scheme as follows. Over the first 500 iterations, the average probability that the SNP effect is zero (p(i, 1)) is calculated. If p(i, 1) ≥ a, then the SNP effect is set to zero and is not updated in future iterations.

Step MCMC_①: sampling the error variance \( {\widehat{\upsigma}}_{\mathrm{e}}^2 \) according to the distribution \( {\widehat{\upsigma}}_{\mathrm{e}}^2\sim Inv{\upchi}^2\left(n2,\frac{{{\mathbf{y}}^{*}}^{\hbox{'}}{\mathbf{E}}^{1}{\mathbf{y}}^{*}}{n2}\right) \), with \( {\mathbf{y}}^{*}=\left(\mathbf{y}\mathbf{Zg}\mathbf{X}\widehat{\boldsymbol{\upbeta}}\mathbf{W}\widehat{\mathbf{v}}\right). \)

Step MCMC_②: sampling the fixed effects β from the distribution \( N\Big(\ {\boldsymbol{\upbeta}}_{\mu },{\left(\mathbf{X}\mathit{\hbox{'}}\;{\mathbf{E}}^{1}\mathbf{X}\right)}^{1}{\widehat{\upsigma}}_{\mathrm{e}}^2 \)), with \( {\boldsymbol{\upbeta}}_{\mu }={\left(\mathbf{X}\mathit{\hbox{'}}\;{\mathbf{E}}^{1}\mathbf{X}\right)}^{1}\mathbf{X}\mathit{\hbox{'}}\;{\mathbf{E}}^{1}\left(\mathbf{y}\mathbf{Z}\widehat{\mathbf{g}}\mathbf{W}\widehat{\mathbf{v}}\right) \).

Step MCMC_③: Polygenic variance is sampled \( {\widehat{\upsigma}}_{\mathrm{a}}^2\sim Inv{\upchi}^2\left(n2,\frac{\widehat{\mathbf{v}}\mathit{\hbox{'}}\;{\mathbf{A}}^{1}\widehat{\mathbf{v}}}{n2}\right) \).

Step MCMC_④: The polygenic effects are sampled from normal distribution N(μ, s), with the mean \( \mu =\widehat{\mathbf{v}} \) from Eq. (19) and the variance\( s={\left(\mathbf{W}\mathit{\hbox{'}}\;{\mathbf{E}}^{1}\mathbf{W}+{\mathbf{A}}^{1}{\upsigma}_{\mathrm{e}}^2/{\upsigma}_{\mathrm{a}}^2\right)}^{1}. \)
Then for each SNP i (i in 1 to m),

Step MCMC_⑤: Implement the speedup scheme : if (iterations > 500) and (P(i, 1) > 0.9), then stop updating this SNP i.
Else,

Step MCMC_⑥: Estimate the log likelihood that the effect of SNP i is from one of four normal distributions L(g _{ i }σ _{ i } ^{2} [k]). Following the derivation steps of Kemper et al. [15], the optimised equation of the log likelihood function is
\( \begin{array}{c}L\left({g}_i\Big{\upsigma}_i^2\left[k\right]\right)=\frac{1}{2}\left\{ \log \left({\upsigma}_i^2\left[k\right]{\mathbf{Z}}_{\mathbf{i}}^{\boldsymbol{\hbox{'}}}{\mathbf{Z}}_{\mathbf{i}}+{\upsigma}_{\mathrm{e}}^2\right)+{\left(\left({e}^{*}\right)\mathit{\hbox{'}}\;{\mathbf{E}}^{1}{\mathbf{Z}}_{\mathbf{i}}\right)}^2{\upsigma}_i^2\left[k\right]{\upsigma}_{\mathrm{e}}^{2}/\left({\upsigma}_i^2\left[k\right]{\mathbf{Z}}_{\mathbf{i}}^{\boldsymbol{\hbox{'}}}{\mathbf{E}}^{1}{\mathbf{Z}}_{\mathbf{i}}+{\upsigma}_{\mathrm{e}}^2\right)\right\}\\ {}+ log{ \Pr}_{\mathrm{k}},\end{array} \)
with e* = y − Xβ − u − Wv.
After this, P(i, k) is calculated with the equation:\( exp\Big(L\left({g}_i\Big{\upsigma}_i^2\left[k\right]\right)/{\displaystyle {\sum}_{k=1}^4 exp\kern0.5em }\left(L\left({g}_i\Big{\upsigma}_i^2\left[k\right]\right)\right) \)

Step MCMC_⑦: Sample ĝ_{ i } with N(μ, s), \( \mu ={\left[{\mathbf{Z}}_{\mathrm{i}}^{\mathit{\hbox{'}}}{\mathbf{E}}^{1}{\mathbf{Z}}_{\mathrm{i}}+\frac{\widehat{\upsigma_{\mathrm{e}}^2}}{\upsigma_i^2\left[k\right]}\right]}^{1}\left[{\mathbf{Z}}^{\mathit{\hbox{'}}}{\mathbf{E}}^{1}\ {e}^{*}\right] \), and \( s={\left[{\mathbf{Z}}_{\mathrm{i}}^{\mathit{\hbox{'}}}{\mathbf{E}}^{1}{\mathbf{Z}}_{\mathrm{i}}+\frac{\widehat{\upsigma_{\mathrm{e}}^2}}{\upsigma_i^2\left[k\right]}\right]}^{1} \).

Step MCMC_⑧: Update Pr ~ Dirichlet(β_{1} + 1, β_{2} + 1, β_{3} + 1, β_{4} + 1),where β_{1}, β_{2}, …, β_{4} are the number of SNPs in one of four normal distributions.
Return to MCMC step 1.
HyB_BR was written in the C++ programming language.
Data
Cattle
One thousand seven hundred fortyfive Holstein and Jersey cattle and 114 Australian Red bulls were genotyped with the 777 K Illumina HD bovine SNP chip. 15,049 Holstein and Jersey bulls and cows, 249 Australian red bulls and cows were genotyped with the 54 K Illumina Bovine SNP array. After stringent quality control and SNP filtering described in [14], there were 632,003 SNPs remaining for animals genotyped with the 777 K SNP panel, and 43,425 SNPs remaining for animals genotyped with the 54 K SNP array. Animals genotyped with the 43,425 SNPs, were imputed to 632,003 SNP genotypes using Beagle 3.0 [38]. Therefore, the total data set was 17,157 cattle of three breeds with real or imputed genotypes for 632,003 SNP.
The number of individuals in the reference sets and validations sets related to three traits including Milk yield (MilkY), Protein yield (ProtY), Fat Percent (Fat%) and Fertility
Traits  Reference Sets  Validation Sets  

Holstein  Jersey  Holstein Bulls  Jersey Bulls  Australian Red Bulls  
Bulls  Cows  Bulls  Cows  
MilkY/ProtY/Fat%  3049  8478  770  3917  262  105  114 
Fertility  2806  7838  716  3830  396  81  114 
To compare the computational time required by the different genomic prediction methods, we also used three reference sets with increasing different numbers of animals; Ref 1_ CATTLE had 3049 Holstein bulls; Ref 2_CATTLE had 11,527 Holstein bulls and cows, while Ref 3_CATTLE was the complete reference data set with 16,214 animals.
Three input variance parameters related to the reference data sets
Reference Set  Traits  σ _{e} ^{2}  σ _{g} ^{2}  σ _{v} ^{2} 

Holstein and Jersey bulls & cows  Milk yield  133284.0  108619.0  34925.6 
Protein yield  132.579  68.6635  29.1662  
Fat%  0.0180012  0.0575729  0.0127094  
Fertility  3283.80  31.6187  0.000332297 
The correlation between GEBV and DTD in the validation sets was used as a proxy for prediction accuracy. The regression of DTD on GEBV in the validation sets was used to investigate if any of the methods resulted in biased predictions.
Case/Control human disease trait data
The size and genetic architecture of seven combined control/case data sets
Diseases  Number of records  Number of markers  σ _{e} ^{2}  σ _{g} ^{2}  h ^{2} 

BD  4722  292,496  0.070509  0.17156  0.71 
CAD  4864  296,610  0.149782  0.09189  0.38 
CD  4577  301,579  0.073900  0.16056  0.69 
HT  4890  294,404  0.113621  0.12816  0.53 
RA  4704  295,890  0.070900  0.07120  0.50 
T1D  4824  296,228  0.064739  0.12567  0.66 
T2D  4722  294,641  0.099866  0.14497  0.59 
Results
Compute time comparison of HyB_BR and BayesR
The prediction accuracy was evaluated for milk yield with a reference set containing the Holstein and Jersey bulls&cows data.
The accuracy and bias of withbreed prediction of GBLUP, BayesR(BR), emBayesR (EM), and HyB_BR (HB)
Milk yield  Protein yield  Fat%  Fertility  

Acc.  Bias  Acc.  Bias  Acc.  Bias  Acc.  Bias  
Holstein reference to predict Holstein validation  
GBLUP  +Poly^{a}  0.57  0.96  0.63  0.98  0.73  0.96  0.43  1.26 
Poly^{b}  0.56  0.86  0.59  0.87  0.71  1.15  0.42  1.27  
BR  +Poly^{a}  0.63  0.91  0.64  1.01  0.79  1.06  0.43  1.19 
Poly^{b}  0.61  1.00  0.63  1.06  0.77  1.13  0.41  1.19  
EM  +Poly^{a}  0.62  0.79  0.63  0.85  0.77  0.98  0.42  1.15 
Poly^{b}  0.62  0.92  0.62  0.94  0.74  1.06  0.41  1.15  
HB  +Poly^{a}  0.63  0.93  0.63  0.97  0.79  1.09  0.43  1.19 
Poly^{b}  0.63  1.03  0.62  1.06  0.76  1.17  0.42  1.19  
Jersey reference to predict Jersey validation  
GBLUP  +Poly^{a}  0.59  0.93  0.65  0.91  0.54  0.71  0.15  1.05 
Poly^{b}  0.58  1.05  0.64  1.09  0.54  0.77  0.14  1.08  
BR  +Poly^{a}  0.64  0.94  0.68  0.93  0.71  0.87  0.15  1.02 
Poly^{b}  0.63  0.98  0.68  1.01  0.69  0.93  0.14  1.04  
EM  +Poly^{a}  0.64  0.87  0.68  0.92  0.69  0.75  0.15  1.09 
Poly^{b}  0.64  0.98  0.66  1.01  0.67  0.79  0.15  1.09  
HB  +Poly^{a}  0.64  0.97  0.68  0.90  0.71  0.89  0.15  1.02 
Poly^{b}  0.64  1.06  0.66  0.96  0.69  0.87  0.15  1.02 
The accuracy and bias of multibreeds prediction of GBLUP, BayesR(BR), emBayesR (EM), and HyB_BR (HB)
Milk yield  Protein yield  Fat%  Fertility  

Acc.  Bias  Acc.  Bias  Acc.  Bias  Acc.  Bias  
Holstein and Jersey reference to predict Holstein validation  
GBLUP  +Poly^{a}  0.63  0.83  0.65  0.85  0.74  0.85  0.44  1.66 
Poly^{b}  0.62  0.90  0.57  0.88  0.72  0.90  0.42  1.66  
BR  +Poly^{a}  0.68  0.84  0.68  0.88  0.81  0.90  0.44  1.53 
Poly^{b}  0.67  0.91  0.67  1.03  0.79  0.98  0.42  1.53  
EM  +Poly^{a}  0.68  0.90  0.68  0.79  0.77  0.83  0.44  1.27 
Poly^{b}  0.65  0.91  0.66  0.85  0.75  0.87  0.44  1.27  
HB  +Poly^{a}  0.68  0.82  0.67  0.88  0.81  0.94  0.44  1.33 
Poly^{b}  0.67  0.89  0.67  0.95  0.80  1.08  0.44  1.33  
Holstein and Jersey reference to predict Jersey validation  
GBLUP  +Poly^{a}  0.64  0.78  0.68  0.85  0.66  0.73  0.24  1.12 
Poly^{b}  0.64  0.90  0.69  1.02  0.64  0.80  0.24  1.12  
BR  +Poly^{a}  0.69  0.85  0.71  0.99  0.76  0.88  0.26  1.23 
Poly^{b}  0.68  0.95  0.71  1.09  0.74  0.94  0.25  1.24  
EM  +Poly^{a}  0.66  0.84  0.69  0.71  0.75  0.76  0.23  1.13 
Poly^{b}  0.63  0.86  0.68  0.73  0.70  0.82  0.23  1.13  
HB  +Poly^{a}  0.71  0.89  0.74  0.94  0.77  0.89  0.26  1.02 
Poly^{b}  0.69  0.98  0.73  1.02  0.73  0.97  0.26  1.02 
The accuracy and bias of acrossbreeds prediction of BayesR, GBLUP, and HyB_BR
Milk yield  Protein yield  Fat%  Fertility  

Acc.  Bias  Acc.  Bias  Acc.  Bias  Acc.  Bias  
Across breeds prediction on Australian red bulls  
GBLUP  0.16  0.54  0.11  0.51  0.32  0.90  0.29  0.97 
BayesR  0.22  0.60  0.12  0.49  0.45  0.92  0.27  1.03 
EmBayesR  0.24  0.70  0.12  0.42  0.41  0.89  0.29  1.10 
HyB_BR  0.23  0.74  0.17  0.49  0.50  0.90  0.30  0.98 
Across breeds prediction on Australian red cows  
GBLUP  0.15  0.71  0.08  0.13  0.50  1.19  0.08  0.79 
BayesR  0.26  0.80  0.17  0.51  0.54  0.94  0.08  0.68 
EmBayesR  0.24  0.79  0.16  0.53  0.51  0.89  0.08  0.74 
HyB_BR  0.26  0.81  0.16  0.57  0.55  0.91  0.08  0.70 
These timings were recorded on a server with Intel E52680 2.7GHz processors and 384GB of 1333 MHz RAM.
The accuracy and bias of withinbreeds, multibreeds and acrossbreeds prediction for four complex dairy traits
Genomic prediction with a single breed reference
For the withinbreed prediction (that is, when a Holstein reference was used to estimate SNP effects used for calculating GEBV in a Holstein validation set, and likewise for Jersey) in Table 5, HyB_BR performed as well as BayesR for all traits, including fat%. Both BayesR and HyB_BR had a 1 % ~ 6 % superiority of accuracy over GBLUP for Milk yield, Protein yield and Fat%, but had no advantage for fertility. Similarly, for the prediction of Jersey validation with Jersey reference, BayesR and HyB_BR had a consistent advantage over GBLUP for milk production traits, but not for fertility. Especially, for the trait Fat%, BayesR and HyB_BR gave very similar results, with a 17 % increase in accuracy (0.79 vs 0.73 in Holstein and 0.71 vs 0.54 in Jersey) of genomic prediction over GBLUP, as well as a 5 % increase in accuracy over emBayesR. HyB_BR and BayesR also gave regression coefficients closer to one than GBLUP for most traits.
Genomic prediction with a multibreed reference
When predicting the Holstein or Jersey validation with the combined Holstein and Jersey reference, HyB_BR had the same accuracy as BayesR, Table 5. Compared with GBLUP, BayesR and HyB_BR gave consistently higher accuracy increase for the milk production traits, though this was not observed for fertility. And for the prediction of Jersey validation set, BayesR and HyB_BR improved accuracy for the milk production traits by 11 % compared with GBLUP. The results show that there are small but consistent accuracy improvements as a result of using the multibreed reference (compare Tables 5 and 6), consistent with the results of Kemper et al. [15] and Hoze et al. [46].
Also, including polygenic effects (estimated using the pedigree) in the model can improve the prediction accuracy by 1 ~ 2 %, at least for milk production traits, Tables 5 and 6. However, for fertility the introduction of polygenic effects for all the prediction methods did not impact the accuracy at all.
Compared with GBLUP and emBayesR, BayesR and HyB_BR gave less biased predictions for milk production traits. However for fertility the regression values far from one indicate bias, from all methods – the low accuracy of fertility phenotypes, including in the validation set, likely contributes to this.
Genomic prediction across breeds
For predicting Australian Red validation bulls (an additional breed to those in the reference set), BayesR and HyB_BR gave higher accuracy than GBLUP for all traits (Table 7).
Across all the prediction results shown in Tables 5, 6 and 7, emBayesR had a 2 % ~ 5 % reduction in accuracy compared with BayesR and HyB_BR for fat%, while BayesR and HyB_BR gave almost identical accuracies in all cases.
Inferred genetic architecture and QTL mapping for dairy production and fertility traits
The number of SNPs in each of four distributions
Traits  The proportion (Pr)  BayesR  emBayesR  HyB_BR 

Milk yield  A. 0.01 * σ _{ g } ^{2}  8  6  8 
B. 0.001 * σ _{ g } ^{2}  47  17  327  
C. 0.0001 * σ _{ g } ^{2}  3886  1523  4039  
D. 0  628,062  630,457  627,629  
Protein yield  A. 0.01 * σ _{ g } ^{2}  5  4  6 
B. 0.001 * σ _{ g } ^{2}  32  37  297  
C. 0.0001 * σ _{ g } ^{2}  4431  1842  6604  
D. 0  627,535  630,120  625,096  
Fat%  A. 0.01 * σ _{ g } ^{2}  23  19  20 
B. 0.001 * σ _{ g } ^{2}  46  206  119  
C. 0.0001 * σ _{ g } ^{2}  2882  1206  1852  
D. 0  629,052  630,572  630,012  
Fertility  A. 0.01 * σ _{ g } ^{2} .  10  8  12 
B. 0.001 * σ _{ g } ^{2}  147  114  202  
C. 0.0001 * σ _{ g } ^{2}  3949  8572  7597  
D. 0  627,897  623,309  624,192 
QTL mapping for dairy production and fertility traits
Both BayesR and HyB_BR estimate the posterior probability that every SNP has a nonzero effect on the trait. This is useful for QTL mapping – SNP with very high posterior probabilities of having a nonzero effect should be strongly associated with causal mutations (e.g. Moser et al. [8], Kemper et al. [15]). Then, QTL mapping from BayesR and HyB_BR can be conducted by plotting the posterior probability of each SNPs having a nonzero effect on the trait by genome position, and then comparing the genome location of the effects with a high posterior probability of being in the largest distribution for each method.
The list of identified causal mutations by both BayesR and HyB_BR
Traits  Loci  Information (known genes)  Range (bp) [Start points ~ End points] 

Milk yield  Chr5:75786153  CSF2RB impacting milk yield [47].  [75724620 ~ 75745819] 
Chr6:88741491  GC, encoding the vitamin D binding protein, positively impacting the milk yield [48].  [88695940 ~ 88749180]  
Chr20:30116920  In association with CCL28/GHR impacting milk production [18].  [31890736 ~ 32199996]  
Protein yield  Chr6:87180731  CSN1S1 positively impacting protein yield [48].  [87141556 ~ 87159096] 
Chr11:103302351  PAEP impacting protein yield [19].  [103301664 ~ 103306381]  
Fat%  Chr5:93945655  MGST1 for Fat percent [49].  [93926791 ~ 3950162] 
Fertility  Chr18:57548213  In association with the gene CEACAM18, Detected by Pryce et al. [50], Cole et al. [51].  ~57MBP 
Chr21:53500339   Control the percentage of unassisted births in first calf heifers [52].  ~53MBP  
Chr23:51131682  In the linkage with the known gene GMDS [53].  ~51MBP  
All the traits  Chr14:1801116  DGAT1 impacting Fat percent [16].  [1795351 ~ 1804562] 
The application of HyB_BR to predict the risk of Human disease traits and infer genetic architecture for these traits
The prediction performance evaluated by the Area under curve (AUC) of GBLUP, BayesR and HyB_BR on seven diseases
Diseases  GBLUP  BayesR  HyB_BR  

AUC  h ^{2}  AUC  h ^{2}  AUC  h ^{2}  
BD  0.63(0.0135)  0.71  0.63(0.0131)  0.63  0.64(0.0174)  0.63 
CAD  0.58(0.0116)  0.38  0.59(0.0118)  0.38  0.58(0.0131)  0.38 
CD  0.60(0.0134)  0.69  0.65(0.0159)  0.61  0.65(0.0158)  0.61 
HT  0.58(0.0125)  0.53  0.58(0.0131)  0.52  0.58(0.0140)  0.51 
RA  0.58(0.0109)  0.50  0.70(0.0104)  0.45  0.70(0.0107)  0.45 
T1D  0.64(0.0133)  0.66  0.86(0.0099)  0.63  0.86(0.0102)  0.63 
T2D  0.59(0.0139)  0.59  0.60(0.0117)  0.52  0.60(0.0122)  0.52 
The genetic architecture of human disease traits
The number of SNPs in each proportion of four distributions estimated by BayesR, and HyB_BR on seven human diseases
Compared with BayesR, HyB_BR detected the same number of SNPs with moderate variance (the variance 0.01 * σ _{ g } ^{2} ) but appeared to systematically detect more SNPs in the proportion of small variance (the variance 0.0001 * σ _{ g } ^{2} ), similar to the results observed for the comparison of BayesR and HyB_BR in in dairy cattle data (Table 8).
Discussion
We have presented a novel and computationally efficient algorithm termed HyB_BR for simultaneous genomic prediction and QTL mapping. A pure EM algorithm was less accurate for some traits, while pure MCMC requires very long computation times. Therefore, HyB_BR implements the EM algorithm followed by a limited number of MCMC iterations. In this way, the algorithm takes advantage of the features of an EM algorithm (rapid convergence) and the higher accuracy from MCMC implementations in a hybrid scheme. Our accuracies of genomic prediction for complex traits in human and cattle from HyB_BR are almost identical to those from the full MCMC implementation of the Bayesian mixture model, with a 10 fold or greater reduction in computing time required.
For the pure MCMC algorithm, the burnin stage can account for up to 50 % of the total running time. One of the key advantages of HyB_BR is that the EM module effectively replaces the burnin cycles that are usually required for MCMC. Based on the starting point from EM (with very limited number of iterations; less than 500 iterations), the running time of HyB_BR can be much reduced.
The pure EM algorithm, EmBayesR [30] has been demonstrated to be much faster than BayesR, but had lower accuracy for some traits, particularly those with mutations of moderate to large effect. For example, when implemented on the trait fat% in dairy cattle, emBayesR had a decreased accuracy of 5 % ~ 7 % compared to BayesR. One possible explanation is that emBayesR shrinks SNP effects too much (shown in Table 8). This could be because the PEV that is used to account for the error of the effects of all the other SNPs while estimating the effect of the current SNP is only an approximation. The introduction of PEV correction is based on one observation: previous fast algorithm studies (especially Iterative conditional expectation algorithms) assumed the effects of the other SNPs were estimated perfectly while estimating the effect of the current SNP, leading to poor performance [30]. Therefore, EmBayesR and the EM part of HyB_BR allow for the errors in the effect of other SNPs and other location parameters by using the PEV. The calculation of the PEV from GBLUP is carried out before the iterations to estimate the effects of each SNP. And since the normal priors from GBLUP model do not allow for SNPs of moderate to large effects, such PEV calculation is an approximation and this may be one reason for loss of accuracy in the EM. To deal with this, HyB_BR further implements a small number of MCMC iterations to improve the outcome of pure EM steps.
HyB_BR has three advantages. First, as the size of genomic data increases, the computational efficiency of HyB_BR without burnin stage (a small number of O(mn) iterations), is greater than BayesR by full MCMC. And when implemented with the speedup scheme described in the methods, computational time can be reduced even further, by sampling a reduced set of SNPs in the MCMC module, apparently with no loss of accuracy (but critically the information from the SNPs that are not sampled remains in the posterior proportions of SNPs in each distribution). Second, the prediction accuracy of HyB_BR is comparable to BayesR in all cases including dairy cattle and human disease prediction shown in Tables 5, 6 and 7, and Table 10. Third, HyB_BR, like BayesR, is flexible with respect to the genetic architecture of complex traits. As shown in Tables 5, 6 and 7, HyB_BR performs well on four different complex traits, with architecture ranging from highly polygenic architecture to genetic architecture controlled by major genes. In addition to the prediction on the continuous quantitative traits of dairy cattle, the investigation on the risk prediction of seven case/control human diseases with binary 0/1 phenotypes shows HyB_BR and BayesR perform on this type of data, Table 10. Finally, the posterior probabilities of SNP having a nonzero effect from HyB_BR can be used for QTL mapping, Fig. 6.
The predicted computational time (in hours) of HyB_BR and BayesR on high density data with different number of variants and the same number of individuals (16,214)
Different number of markers  

800 K SNP panel  1 million  2 million  30 millions  
BayesR  720 h  1139 h  2278 h  34,170 h 
HyB_BR  42 h  67 h  134 h  2010 h 
While HyB_BR performs well with computational efficiency and robust prediction accuracy, there are at least still two strategies that could be used to further improve efficiency. There is one key part of EM module that consumes running time and memory: the calculation of tr(E ^{− 1} Z _{ i } Z _{ i } ^{'} E ^{− 1} PEV) for each SNP in front of EM iterations. In detail, the calculation of tr(E ^{− 1} Z _{ i } Z _{ i } ^{'} E ^{− 1} PEV) requires the time complexity of \( O\left(\frac{1}{2}m{n}^2\right) \), which accounts for almost 2/3 of the computational time required for the EM module even though the calculation is made in front of the EM iterations. Therefore, a future task is to implement a multithreaded version to improve speed. The threshold of limiting the number of SNPs to be updated requires further study. Currently we define the threshold as T: if(P(i, 1) > 0.9), which is applicable for the current data. However, it’s uncertain whether or not such a threshold is suitable for other types of data.
HyB_BR has some features in common with other mixture methods such as BSLMM [6], and BOLTLMM [25]. All of these methods declared the merit of computational efficiency with time complexity O(mn) but under different mixture models. In detail, BSLMM assumed a large proportion of SNPs with small effects (under BLUP models), while others had large effects (under Bayesian sparse regression models; the mixture of two normal priors). Due to limited number of SNPs implemented for MCMC sampling (large proportion of SNPs are under GBLUP models), BSLMM could be computationally efficient. However, compared with the mixture of four normal distributions by BayesR which provided great flexibility with respect genetic architecture, the flexibility of BSLMM with respect to different genetic architectures required further investigation. Another algorithm is BOLTLMM, which has been developed mainly for the association studies. BOLTLMM incorporated Bayesian mixture models to improve the power of GWAS with appealing outcomes. Instead of MCMC sampling, BOLTLMM implemented iterative conditional expectation (ICE) algorithm on a mixture of two normal distributions to improve the computational speed with the approximated computational complexity O(mn). There could be three limitations with this method: 1) ICE algorithms did not account for the PEVs from all other SNP effects during the estimation of current SNP effect. On practical data sets, ICE could lead to the loss of prediction accuracy. BOLTLMM introduced LD score regression technique to calibrate the prediction errors. However, since the calibrating factor was constant across all the SNPs (the prediction error variance regarding each SNP differs according to our equation (E ^{− 1} Z _{ i } Z _{ i } ^{'} E ^{− 1} PEV)), such calibration scheme seem not to be effective to solve the problem. 2) The leaveonechromosomeout scheme implemented in BOLTLMM might perform well for GWAS but not be suitable for simultaneous genomic prediction. 3) BOLTLMM treated each SNP effect as a fixed effect for the association statistics. This combined with the stringent significance threshold for multiple testing, leaded to the overestimation for SNP effects. Another efficient method for genomic prediction termed MultiBLUP [7] introduced SNPs clusters into BLUP models according to its adaptive algorithm. For each SNP class, the linear combination models (using genomic relationship matrix) similar to GBLUP were implemented. MultiBLUP has been demonstrated to be computationally efficient with robust prediction accuracy in the human data sets. However, when moved to dairy cattle genomic data sets, there is long Linkage disequilibrium (LD) between markers, which might be easily broken up by multiBLUP models.
Conclusions
In summary, HyB_BR is a computationally efficient method for simultaneous genomic prediction, QTL mapping and inference of genetic architecture. The hybrid scheme of MCMC and EM decreases computational time by a factor of at least 10 fold with no reduction in prediction accuracy. The HyB_BR algorithm makes simultaneous genomic prediction, QTL mapping and inference of genetic architecture feasible in extremely large genomic data sets including whole genome sequence data.
Declarations
Acknowledgements
The authors acknowledge the support from Dairy Futures CRC project.
Availability of data and materials
The WTCCC data are available to researchers by application to the Welcome Trust Case Control Consortium Data Access Committee (http://www.wtccc.org.uk/info/access_to_data_samples.html), or contact egaadmin@ebi.ac.uk. Application is required to ensure proper protection of confidentiality of the participants.
For dairy cattle data, we can provide metaanalysis data related to our paper which can be easily used to conduct the analysis by other researchers.
The HyB_BR compiled program is available for request for noncommercial research.
Authors’ contributions
BJH and YPPC supervised this project; TW developed HyB_BR algorithm, analysed the data and drafted the manuscript; BJH gave important instructions on organizing and revising the manuscript. MEG contributed the valuable idea about hybrid scheme; PJB provided help with C++ programming. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genomewide dense marker maps. Genetics. 2001;157(4):1819–29.PubMedPubMed CentralGoogle Scholar
 Goddard ME, Hayes BJ. Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat Rev Genet. 2009;10(6):381–91.View ArticlePubMedGoogle Scholar
 Meuwissen T, Hayes B, Goddard M. Accelerating improvement of livestock with genomic selection. Annu Rev Anim Biosci. 2013;1(1):221–37.View ArticlePubMedGoogle Scholar
 Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42(7):565–9.View ArticlePubMedPubMed CentralGoogle Scholar
 de los Campos G, Gianola D, Allison DB. Predicting genetic predisposition in humans: the promise of wholegenome markers. Nat Rev Genet. 2010;11(12):880–6.View ArticlePubMedGoogle Scholar
 Zhou X, Carbonetto P, Stephens M. Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genet. 2013;9(2):e1003264.View ArticlePubMedPubMed CentralGoogle Scholar
 Speed D, Balding DJ. MultiBLUP: improved SNPbased prediction for complex traits. Genome Res. 2014;24(9):1550–7.View ArticlePubMedPubMed CentralGoogle Scholar
 Moser G, Lee SH, Hayes BJ, Goddard ME, Wray NR, Visscher PM. Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model. PLoS Genet. 2015;11(4):e1004969.View ArticlePubMedPubMed CentralGoogle Scholar
 VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91(11):4414–23.View ArticlePubMedGoogle Scholar
 VanRaden PM, Null DJ, Sargolzaei M, Wiggans GR, Tooker ME, Cole JB, Sonstegard TS, Connor EE, Winters M, van Kaam JBCHM, et al. Genomic imputation and evaluation using highdensity holstein genotypes. J Dairy Sci. 2013;96(1):668–78.View ArticlePubMedGoogle Scholar
 Wolc A, Zhao HH, Arango J, Settar P, Fulton JE, O’Sullivan NP, Preisinger R, Stricker C, Habier D, Fernando RL, et al. Response and inbreeding from a genomic selection experiment in layer chickens. Genet Sel Evol. 2015;47:59.View ArticlePubMedPubMed CentralGoogle Scholar
 Gianola D. Priors in wholegenome regression: the Bayesian alphabet returns. Genetics. 2013;194(3):573–96.View ArticlePubMedPubMed CentralGoogle Scholar
 Habier D, Fernando RL, Kizilkaya K, Garrick DJ. Extension of the bayesian alphabet for genomic selection. BMC Bioinf. 2011;12(1):1–12.View ArticleGoogle Scholar
 Erbe M, Hayes BJ, Matukumalli LK, Goswami S, Bowman PJ, Reich CM, Mason BA, Goddard ME. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed highdensity single nucleotide polymorphism panels. J Dairy Sci. 2012;95(7):4114–29.View ArticlePubMedGoogle Scholar
 Kemper KE, Reich CM, Bowman PJ, vander Jagt CJ, Chamberlain AJ, Mason BA, Hayes BJ, Goddard ME. Improved precision of QTL mapping using a nonlinear Bayesian method in a multibreed population leads to greater accuracy of acrossbreed genomic predictions. Genet Sel Evol. 2015;47:29.View ArticlePubMedPubMed CentralGoogle Scholar
 Grisart B, Coppieters W, Farnir F, Karim LCF, Berzi P, Cambisano N, Mni M, Reid S, Simon P, et al. Positional candidate cloning of a QTL in dairy cattle: identification of a missense mutation in the bovine DGAT1 gene with major effect on milk yield and composition. Genome Res. 2002;12(2):222–31.View ArticlePubMedGoogle Scholar
 NgKwaiHang K. A review of the relationship between milk protein polymorphism and milk composition/milk production. In: Proceedings of the international dairy federation seminar: 25–27 febuary, 1997 1997; palmerston north, New Zealand. 1997. p. 22–37.Google Scholar
 Blott S, Kim JJ, Moisio S, SchmidtKüntzel A, Cornet A, Berzi P, Cambisano N, Ford C, Grisart B, Johnson D, et al. Molecular dissection of a quantitative trait locus: a phenylalaninetotyrosine substitution in the transmembrane domain of the bovine growth hormone receptor is associated with a major effect on milk yield and composition. Genetics. 2003;163(1):253–66.PubMedPubMed CentralGoogle Scholar
 Wang X, Wurmser C, Pausch H, Jung S, Reinhardt F, Tetens J, Thaller G, Fries R. Identification and dissection of four major QTL affecting milk Fat content in the German holsteinfriesian population. PLoS One. 2012;7(7):e40711.View ArticlePubMedPubMed CentralGoogle Scholar
 Zhang Z, Ersoz E, Lai CQ, Todhunter RJ, Tiwari HK, Gore MA, Bradbury PJ, Yu J, Arnett DK, Ordovas JM, et al. Mixed linear model approach adapted for genomewide association studies. Nat Genet. 2010;42(4):355–60.View ArticlePubMedPubMed CentralGoogle Scholar
 Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D. FaST linear mixed models for genomewide association studies. Nat Meth. 2011;8(10):833–5.View ArticleGoogle Scholar
 Zhou X, Stephens M. Genomewide efficient mixedmodel analysis for association studies. Nat Genet. 2012;44(7):821–4.View ArticlePubMedPubMed CentralGoogle Scholar
 Listgarten J, Lippert C, Kadie CM, Davidson RI, Eskin E, Heckerman D. Improved linear mixed models for genomewide association studies. Nat Meth. 2012;9(6):525–6.View ArticleGoogle Scholar
 Yang J, Zaitlen NA, Goddard ME, Visscher PM, Price AL. Advantages and pitfalls in the application of mixedmodel association methods. Nat Genet. 2014;46(2):100–6.View ArticlePubMedPubMed CentralGoogle Scholar
 Loh PR, Tucker G, BulikSullivan BK, Vilhjalmsson BJ, Finucane HK, Salem RM, Chasman DI, Ridker PM, Neale BM, Berger B, et al. Efficient Bayesian mixedmodel analysis increases association power in large cohorts. Nat Genet. 2015;47(3):284–90.View ArticlePubMedPubMed CentralGoogle Scholar
 Wood AR, Esko T, Yang J, Vedantam S, Pers TH, Gustafsson S, Chu AY, Estrada K, Luan J, Kutalik Z, et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat Genet. 2014;46(11):1173–86.View ArticlePubMedPubMed CentralGoogle Scholar
 The Wellcome Trust Case Control Consortium. Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–78.View ArticlePubMed CentralGoogle Scholar
 Daetwyler HD, Capitan A, Pausch H, Stothard P, van Binsbergen R, Brondum RF, Liao X, Djari A, Rodriguez SC, Grohs C, et al. Wholegenome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nat Genet. 2014;46(8):858–65.View ArticlePubMedGoogle Scholar
 Calus MPL. Righthandside updating for fast computing of genomic breeding values. Genet Sel Evol. 2014;46:24.View ArticlePubMedPubMed CentralGoogle Scholar
 Wang T, Chen YPP, Goddard ME, Meuwissen THE, Kemper KE, Hayes BJ. A computationally efficient algorithm for genomic prediction using a Bayesian model. Genet Sel Evol. 2015;47:34.View ArticlePubMedPubMed CentralGoogle Scholar
 Meuwissen THE, Solberg TR, Shepherd R, Woolliams JA. A fast algorithm for BayesB type of prediction of genomewide estimates of genetic value. Genet Sel Evol. 2009;41:2.View ArticlePubMedPubMed CentralGoogle Scholar
 Yu X, Meuwissen THE. Using the pareto principle in genomewide breeding value estimation. Genet Sel Evol. 2011;43:35.View ArticlePubMedPubMed CentralGoogle Scholar
 Shepherd RK, Meuwissen THE, Woolliams JA. Genomic selection and complex trait prediction using a fast EM algorithm applied to genomewide markers. BMC Bioinf. 2010;11(1):1–12.View ArticleGoogle Scholar
 Hayashi T, Iwata H. EM algorithm for Bayesian estimation of genomic breeding values. BMC Genet. 2010;11(1):1–9.View ArticleGoogle Scholar
 Sun X, Qu L, Garrick DJ, Dekkers JCM, Fernando RL. A fast EM algorithm for BayesAlike prediction of genomic breeding values. PLoS One. 2012;7(11):e49157.View ArticlePubMedPubMed CentralGoogle Scholar
 Garrick D, Taylor J, Fernando R. Deregressing estimated breeding values and weighting information for genomic regression analyses. Genet Sel Evol. 2009;41(1):55.View ArticlePubMedPubMed CentralGoogle Scholar
 Henderson C. Application of linear models in animal breeding. Canada: University of Guelph; 1984.Google Scholar
 Browning BL, Browning SR. A unified approach to genotype imputation and haplotypephase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 2009;84(2):210–23.View ArticlePubMedPubMed CentralGoogle Scholar
 HaileMariam M, Bowman PJ, Pryce JE. Genetic analyses of fertility and predictor traits in Holstein herds with low and high mean calving intervals and in Jersey herds. J Dairy Sci. 2013;96(1):655–67.View ArticlePubMedGoogle Scholar
 HaileMariam M, Pryce JE, Schrooten C, Hayes BJ. Including overseas performance information in genomic evaluations of Australian dairy cattle. J Dairy Sci. 2015;98(5):3443–59.View ArticlePubMedGoogle Scholar
 Gilmour A, Cullis B, Welham S, Thompson R. ASReml reference manual 2nd edition, NSW agriculture biometrical bulletin 3. 2002.Google Scholar
 Lee Sang H, Wray Naomi R, Goddard Michael E, Visscher Peter M. Estimating missing heritability for disease from genomewide association studies. Am J Human Gen. 2011;88(3):294–305.View ArticleGoogle Scholar
 Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, et al. PLINK: a tool Set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007;81(3):559–75.View ArticlePubMedPubMed CentralGoogle Scholar
 Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genomewide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82.View ArticlePubMedPubMed CentralGoogle Scholar
 Wray NR, Yang J, Goddard ME, Visscher PM. The genetic interpretation of area under the ROC curve in genomic profiling. PLoS Genet. 2010;6(2):e1000864.View ArticlePubMedPubMed CentralGoogle Scholar
 Hozé C, Fritz S, Phocas F, Boichard D, Ducrocq V, Croiseau P. Efficiency of multibreed genomic selection for dairy cattle breeds with different sizes of reference population. J Dairy Sci. 2014;97(6):3918–29.View ArticlePubMedGoogle Scholar
 Chamberlain AJ, Vander Jagt CJ, Hayes BJ, Khansefid M, Marett LC, Millen CA, Nguyen TTT, Goddard ME. Extensive variation between tissues in allele specific expression in an outbred mammal. BMC Genomics. 2015;16(1):1–20.View ArticleGoogle Scholar
 Sanders K, Bennewitz J, Reinsch N, Thaller G, Prinzenberg EM, Kühn C, Kalm E. Characterization of the DGAT1 mutations and the CSN1S1 promoter in the German angeln dairy cattle population. J Dairy Sci. 2006;89(8):3164–74.View ArticlePubMedGoogle Scholar
 Raven LA, Cocks BG, Kemper KE, Chamberlain AJ, Jagt CJ, Goddard ME, Hayes BJ. Targeted imputation of sequence variants and gene expression profiling identifies twelve candidate genes associated with lactation volume, composition and calving interval in dairy cattle. Mamm Genome. 2015;27(1):81–97.PubMedGoogle Scholar
 Pryce JE, Bolormaa S, Chamberlain AJ, Bowman PJ, Savin K, Goddard ME, Hayes BJ. A validated genomewide association study in 2 dairy cattle breeds for milk production and fertility traits using variable length haplotypes. J Dairy Sci. 2010;93(7):3331–45.View ArticlePubMedGoogle Scholar
 Cole JB, Wiggans GR, Ma L, Sonstegard TS, Lawlor TJ, Crooker BA, Van Tassell CP, Yang J, Wang S, Matukumalli LK, et al. Genomewide association analysis of thirty one production, health, reproduction and body conformation traits in contemporary U.S. Holstein cows. BMC Genomics. 2011;12(1):1–17.View ArticleGoogle Scholar
 McClure MC, Morsci NS, Schnabel RD, Kim JW, Yao P, Rolf MM, McKay SD, Gregg SJ, Chapple RH, Northcutt SL, et al. A genome scan for quantitative trait loci influencing carcass, postnatal growth and reproductive traits in commercial Angus cattle. Anim Genet. 2010;41(6):597–607.View ArticlePubMedGoogle Scholar
 Wickramasinghe S, Hua S, Rincon G, IslasTrejo A, German JB, Lebrilla CB, Medrano JF. Transcriptome profiling of bovine milk oligosaccharide metabolism genes using RNAsequencing. PLoS One. 2011;6(4):e18895.View ArticlePubMedPubMed CentralGoogle Scholar
 Consortium TGP. A map of human genome variation from populationscale sequencing. Nature. 2010;467(7319):1061–73.View ArticleGoogle Scholar