Application of a Bayesian non-linear model hybrid scheme to sequence data for genomic prediction and QTL mapping

Background Using whole genome sequence data might improve genomic prediction accuracy, when compared with high-density SNP arrays, and could lead to identification of casual mutations affecting complex traits. For some traits, the most accurate genomic predictions are achieved with non-linear Bayesian methods. However, as the number of variants and the size of the reference population increase, the computational time required to implement these Bayesian methods (typically with Monte Carlo Markov Chain sampling) becomes unfeasibly long. Results Here, we applied a new method, HyB_BR (for Hybrid BayesR), which implements a mixture model of normal distributions and hybridizes an Expectation-Maximization (EM) algorithm followed by Markov Chain Monte Carlo (MCMC) sampling, to genomic prediction in a large dairy cattle population with imputed whole genome sequence data. The imputed whole genome sequence data included 994,019 variant genotypes of 16,214 Holstein and Jersey bulls and cows. Traits included fat yield, milk volume, protein kg, fat% and protein% in milk, as well as fertility and heat tolerance. HyB_BR achieved genomic prediction accuracies as high as the full MCMC implementation of BayesR, both for predicting a validation set of Holstein and Jersey bulls (multi-breed prediction) and a validation set of Australian Red bulls (across-breed prediction). HyB_BR had a ten fold reduction in compute time, compared with the MCMC implementation of BayesR (48 hours versus 594 hours). We also demonstrate that in many cases HyB_BR identified sequence variants with a high posterior probability of affecting the milk production or fertility traits that were similar to those identified in BayesR. For heat tolerance, both HyB_BR and BayesR found variants in or close to promising candidate genes associated with this trait and not detected by previous studies. Conclusions The results demonstrate that HyB_BR is a feasible method for simultaneous genomic prediction and QTL mapping with whole genome sequence in large reference populations.


Background
Whole genome sequence data is available for an increasing number of species. In some cases enough individuals have been sequenced to serve as a reference panel for imputation of individuals that have been genotyped with SNP arrays to whole genome sequence variant genotypes. A good example of such a reference set is the 1000 bull genomes project which includes 234 bulls with wholegenome sequencing data and 28.3 million genotyped sequence variants [1]. Compared with dense SNP arrays, the advantage of using whole genome sequence data might potentially include more accurate genomic predictions within and across breeds [2][3][4][5], better persistence of accuracy of genomic predictions across generations, and more precise QTL mapping [5], all as a result of including the causal mutation genotypes in the data set.
As the resulting data sets will be extremely large (thousands of individuals with millions of imputed genotypes), the algorithms used to derive genomic predictions must be computationally efficient. Ideally, they should also implement a non-linear model at the level of the SNP effects, including the possibility of excluding some SNPs from the model, as such models have been demonstrated to give higher accuracies of genomic predictions for some traits with high-density genotype data [5,6]. Although computationally efficient, GBLUP and BLUP do not satisfy the second criteria (they implement a linear model and all SNPs are always in the genomic prediction model). BayesR [7] is a flexible non-linear model, which assumes that SNP effects follow a mixture of four normal distributions (with zero variance, very small variance, small variance, and moderate variance). Compared with GBLUP, BayesR results in superior accuracy of genomic prediction for some traits [6,[8][9][10][11][12]. However, as Bayesian models are typically implemented with MCMC (Markov Chain Monte Carlo) sampling, application of BayesR with sequence data is currently not feasible.
Another advantage of non-linear models such as BayesR, is the application of QTL mapping [5,6,8,13,14]. Loh et al. [14] pointed out that Bayesian mixed-models with speed-up schemes (termed fastBayesB [15]) could improve the power of detecting genes associated with human diseases. There are several modified versions of Bayesian model implemented for the identification of causal mutations. Speed and Balding [13] developed an efficient approach termed multiBLUP (a mixture model of SNP effects, similar to nonlinear models), which was applied on the Welcome Trust Case Control Consortium (WTCCC) human disease data. Later, Kemper et al. [6] implemented a nonlinear model (BayesR) for mapping QTL to 250 kb windows in dairy cattle. Then, Moser et al. [8] applied a modified version of BayesR (updating the additive genetic variance in the MCMC chain instead of fixing it, as in the original BayesR) to WTCCC human disease data. Furthermore, MacLeod et al. [5] proposed the algorithm referred to as BayesRC, which is a modified version of BayesR incorporating biological prior information. All these studies have demonstrated that nonlinear models, which might exclude SNPs from the models with the assumptions of Bayesian mixture priors for SNP effects, could actually help to improve the precision of QTL mapping or association studies in human or dairy cattle.
To take advantage of the accuracy superiority of MCMC nonlinear models but improve their timeefficiency, a hybrid scheme (termed HyB_BR) was proposed by Wang et al. [16]. This scheme has three steps: 1) Implement the mixture model of BayesR, which had been demonstrated to be quite flexible for genomic prediction; 2) run an expectation-maximisation algorithm that estimates the parameters in the mixture model; 3) Using the solutions from the EM as starting points, run a limited number of MCMC iterations to improve the parameter estimates. The results of the Hybrid algorithm on 600 K SNP data in dairy cattle data and 300 K SNP data in human disease data from Welcome Trust Case Control Consortium (WTCCC) have demonstrated that the Hybrid algorithm performed as well as BayesR while requiring half of the running time demanded by MCMC iterations [16].
With the aim of investigating whether HyB_BR gave comparable accuracies to BayesR with MCMC for genomic prediction and precision of QTL mapping with whole genome sequence data, we implemented HyB_BR on a large subset of imputed whole-genome sequence data with 994,019 variants in 16,214 cattle. The genotype data came from the imputed sequence variants in or close to gene coding regions and some SNP from the 600 K Bovine HD SNP genotypes. The HyB_BR algorithm was evaluated on this data set with three criteria: 1) computational performance (speed) compared to a full MCMC implementation, 2) prediction accuracy for a range of complex traits with different genetic architecture. The traits included fat yield, milk yield, protein yield, fat percent, protein percent, fertility and heat tolerance and 3) the precision of HyB_BR for QTL mapping of milk production traits, fertility and heat tolerance.

High density and sequence genotypes
Two types of genomic data, 600 K Bovine HD SNP array, and imputed sequence data were used in this study. As described by Kemper et al. (2015) [6], 10,311 Holstein, 4738 Jersey and 249 Australian Red bulls and cows were genotyped with the Bovine SNP50 Array (Illumina, San Diego, CA). In addition, 1620 Holstein bulls and cows, 125 Jersey bulls, and 114 Australian Red bulls were genotyped with the 777 K bovine HD SNP panel. After quality control steps described by Erbe et al. (2012) [7], all genotypes were imputed to 632,003 SNP using Beagle 3.0 [17].
For the Sequence data set (termed SEQ), the sequences of 136 Holstein and 27 Jersey bulls from the 1000 Bulls Genome Project [1] were used as a reference set for imputation. All the animals described above with real or imputed 600 K SNP genotypes were imputed to whole genome sequence data using Beagle 3.0 software [18]. In total there were 2.785 million sequence variants imputed, including both SNPs and indels in either coding regions or putative regulatory regions flanking genes [5]. After quality control including minor allele frequency filtering and LD pruning by PLINK [19], there were 994,019 variants remaining including 370,259 markers from the 600 K SNP panel, and 623,760 sequence variants in gene coding regions or 5000 bp upand down-stream of the gene start stop positions as detailed by MacLeod et al. (2016) [5].

Phenotypes
Protein, fat and milk yields and, fat and protein percent are key traits in dairy cattle breeding. Phenotypes that were pre-corrected for fixed effects (herd-year-season, and lactation number) were used in this study, these are known as trait deviations (TDs) and daughter trait deviations (DTDs) for cows and bulls respectively. TDs and DTDs are provided by DataGene (and its predecessor, the Australian Dairy Herd Improvement Scheme), which is the organisation responsible for providing genetic evaluations to the Australian dairy industry. (e.g. Data-Gene; http://datagene.com.au/index.php). A summary of the phenotypes is shown in Table 1. For milk production traits, there were 16,214 bulls and cows from Holstein and Jersey breeds as the reference set. Then, for the validation sets, Holstein and Jersey bulls were used to assess the accuracy of within-breed prediction. These bulls were the youngest cohorts (born after 2005) in the data set. As mentioned in [6], all the bulls of the validation set have more than 20 effective daughters. In addition, Australian Red bulls (a third breed; not included in the reference set) were included for the validation set to evaluate the performance of across-breed prediction. We implemented the calculation of Garrick et al. (2009) [20] to appropriately weight the phenotypes of bulls and cows as follows: where, h 2 is the heritability of the trait; t is the repeatability of the traits; d is the number of the daughter of each bulls; r is the number of records; c is the proportion of additive genetic variance not accounted for by the SNP [20]. To compare the prediction accuracy of GBLUP, BayesR and HyB_BR for multi-breeds and across-breed, the weight calculation is included in all three models. In addition to milk production traits, fertility is another important complex trait. The DTD and TD that DataGene calculate and that was available to this study was calving interval (CI) which is the number of days between consecutive calving, For fertility, the number of bulls and cows in the reference set, i.e. with genotypes and fertility phenotypes was around 15,190. The validation set includes Holstein bulls (youngest cohort born after 2004) and Jersey bulls (youngest cohort born after 2005).
As weather becomes warmer and less predictable, there is growing interest in developing genomic breeding values for heat tolerance [21]. In Australian dairy genetics studies, heat tolerance is defined as the rate of the decline in production traits (e.g. fat, milk and protein yield) with increasing heat stress [21]. The rate of decline for each trait was estimated for each cow in the data set with a linear random regression of yield on daily temperature-humidity index (THI), when THI was above a threshold of 60 units [21][22][23]. The total number of animals with phenotypes for heat tolerance was 5657 and included Holstein and Jersey cows and bulls. The validation set for heat tolerance was a set of Holstein bulls and a set of Jersey bulls, Table 1. In contrast to the milk production and fertility phenotypes, heat tolerance is still under development and is not yet officially released as a breeding value in Australia.
The input parameters for HyB_BR were estimated from the data with ASReml4 [24] and included additive genetic variance, error variance, and additive polygenic variance ( Table 2). Using the variances, the heritability is calculated based on the "narrow-sense" definition [25] as the ratio of additive genetic variance and the sum of additive genetic variance, error variance and additive  [6]. Compared with milk production traits, heritabilities for heat tolerance traits and fertility were lower. Across all the traits, the prediction accuracy is evaluated using the correlations between genomic estimated breeding value (GEBV) and DTD in the validation sets. The regression of DTD on GEBV in the validation sets was used to investigate if any of the methods resulted in biased predictions.

Genomic prediction methods GBLUP
GBLUP assumes all marker effects follow a normal distribution with the same additive genetic variance. The overall model of GBLUP is: Where, y = vector of n phenotypes. β = vector of b fixed effects, following uninformative priors. u = vector of q random genetic effects (q = number of animals) captured by the SNP, with N 0; Gσ 2 g . G is the q x q genomic similarity matrix between pairs of individuals constructed as described by [26]; σ 2 g is the additive genetic variance. v = vector of q additive polygenic effects (q = number of animals), with veN 0; Aσ 2 a À ). A is the q × q pedigree-based relationship matrix, and σ 2 a is the additive polygenic variance. e= vector of n residual errors. For cattle data, eeN 0; Eσ 2 e À Á , the n × n diagonal matrix E is especially designed to evaluate the different contributions of the phenotype records from different sex to the error variance, de-regressing predicted breeding values and weighting information for genomic regression analyses [20]. X = n × b design matrix, allocating phenotypes y to fixed effects β. b is the number of fixed effects W= n × q design matrix, which aims at allocating the q × 1 vector of polygenic effects to y. S = n × q design matrix, allocating the q × 1 vector of genetic values to y.

BayesR
Compared with the common prior distributions of GBLUP, BayesR [7] assumes SNP effects are drawn from the mixture of four normal distributions. BayesR aims at estimating each SNP effects instead of estimating breeding values directly for each animal. Therefore, the genetic value u in the model (1) is substituted with Zg in the BayesR model. Briefly, the data model of BayesR can be written as: Where, Therefore, each SNP have four possible normal distributions: and N 0; 0:01 Ã σ 2 g ). Related to such mixture priors, there are two other parameters including b(i, k) and Pr. b(i, k) = {0, 1}, which defines whether or nor SNP i follows normal distribution k (k = 1 , 2 , 3 , 4). Therefore, the prior distribution of each SNP i conditional on b(i, k) can be written as: Pr= the vector of proportion parameter, which defines the proportion SNPs in each of four normal distributions. The prior of Pr is drawn from Dirichlet distribution Pr~Dirichlet(α), with α = [1, 1, 1, 1]. The conditional distribution of SNP effect on the proportion parameter Pr is: Z is the standardised (for mean and variance) n × m genotype matrix.
To implement the BayesR model, and arrive at posterior estimates of parameters, Gibbs sampling has been used, as described by Kemper et al. (2015) [6]. On the sequence data, we use five independent replicate chains of the Gibbs sampling, and for each independent chain, there are 40,000 iterations, with the first 20,000 iterations discarded as burn in, as described by Kemper et al. (2015) (for 630 K SNP data).

HyB_BR
The HyB_BR model [16] incorporates the same assumption for SNP effects as BayesR, but serially hybridizes the expectation-maximization (EM) and MCMC to reduce large number of iterations required by MCMC. That is, HyB_BR first implements an EM algorithm to perform the Maximum A Posterior (MAP) estimation until converged. Then, to improve accuracy, a limited number of MCMC iterations are performed to improve parameter estimates [16].
As described in Wang et al. (2016) [16], the HyB_BR model for a SNP effect is: Assumptions in the model are 1) each SNP effect g i follows the same prior assumption as BayesR with Z i being the standardized genotype for SNP i. 2) to correct the prediction errors generated by all other SNPs, HyB_BR introduces the genetic values u, whereby a correction based on the prediction error variance (PEV) is introduced to account for the effects of all the other SNP with a GBLUP model as detailed by Wang et al. [16]. Then under the model (3), the posterior distribution for all related parameter sets {g i , Pr , β , u , v, σ 2 e } are derived according to the theory: p(θ| y) ∝ f(y| θ)p(θ), where f(y| θ) is the likelihood function based on model (3) and p(θ) is the prior density function for the parameter sets θ. Based on the derived marginal posterior distribution p(θ| y), the expectation-maximization steps are implemented to estimate each parameter while "integrating out" the other parameters detailed by Wang et al. (2016). The process of the EM module is presented in pseudo code in Fig. 1.
As shown in Fig. 1, the EM module begins by initializing all the input parameters including SNP effects (g), Proportion parameter (Pr), the variance for each SNP (σ 2 i ), the fixed matrix (X), the pedigree based relationship matrix (A), the genomic relationship matrix (G), the error matrix (E), and index matrix for polygenic effects (W). Similar to emBayesR [27], the starting values of g and Pr are set as g = 0.01 and Pr = {0.5, 0.487, 0.01, 0.003}, The additive genetic variance σ 2 g , error variance σ 2 e , and polygenic variance σ 2 a are obtained from ASReml. Later, HyB_BR fixes the value of the additive genetic variance and additive polygenic variance (not updating them in later MCMC and EM iterations). The n×3 matrix X is a design matrix, allocating the phenotypes to fixed effects. In our case, matrix X is set up with first column being the mean, the second and third columns defining the breeds (Holstein and Jersey) and sex (bulls and cows) of the cattle. The pedigree relationship matrix, A, is built using the lower symmetrical matrix Ped detailed by Henderson [28]; while the genomic relationship matrix G is constructed using . The diagonal error matrix E is constructed according to the equation defined by Garrick et al. [20] and described above for the phenotypes used in this study.
The EM steps require the time complexity O(mn). For the calculation of tr E −1 Z i Z ′ i E −1 PEV u e ð Þ À Á which is calculated prior to the EM steps, the required time is O(m 2 n). This calculation accounts for 40% of the total computational time of EM module. Since the calculation is independent for each SNP, we parallelize the operations by chromosomes, which reduce total running time by approximately 30%.
Once the EM has converged using the criterion  [16] suggested a speed-up scheme to improve computational efficiency. The scheme is as follows. After 500 MCMC iterations, the SNPs with high probability in the distribution with zero variance will be excluded from the model. In other words, when P(i, 1) is greater than 0.90, the SNP effects will be set as zero.
Previous investigation showed that 4000 MCMC iterations were required by HyB_BR for both 600 K SNP panel and imputed sequence data to maximize accuracy of genomic prediction across all the traits [16].
To compare the computational cost between BayesR and HyB_BR and how this changed with an increasing number of individuals in the reference set, we divided the data ( In all three reference sets, the speed advantage of HyB_BR compared with BayesR was investigated. Then the accuracy of genomic prediction from BayesR, HyB_BR and GBLUP was compared in the full data (including the sequence variants).
In addition, the precision of mapping QTL from the three methods was compared.

Computational time comparison between GBLUP, BayesR and HyB_BR
For both 600 K and SEQ data sets, HyB_BR was more than 10 times faster than BayesR, Fig. 2. As the size of the data set increased (from Ref 1 to Ref3 or from 600 K to SEQ data), the computational time required for HyB_BR could be reduced by a greater and greater margin relative to BayesR. On 600 K data, HyB_BR had a similar compute time to GBLUP. For the SEQ data, HyB_BR was up to four fold faster than GBLUP.
These timings were recorded on a server with Intel E5-2680 2.7GHz processors and 384GB of 1333 MHz RAM.
Accuracy of genomic prediction for GBLUP, BayesR, and hybrid with sequence data Prediction accuracy for milk production traits and fertility For the milk production and fertility traits, the combined Holstein and Jersey reference sets were used to predict three validation sets including Holstein bulls   The bulls and cows from two breeds of Holstein and Jersey are used as the reference set to predict Holstein bulls and Jersey bulls separately. a The prediction accuracy when adding the polygenic term in the model; while b is the prediction accuracy when leaving out the polygenic term from the model (Table 3), Jersey bulls (Table 3), and Australian Red bulls & cows (Table 4). When predicting the Holstein validation bull data, BayesR and HyB_BR performed equally well. Compared with GBLUP, BayesR and HyB_BR had a small but consistent accuracy improvement for the milk production traits except protein%. For fat% trait, BayesR and HyB_BR gave a 5% improvement in accuracy compared with GBLUP. However, for protein% and fertility there was no difference between the methods. With the Jersey validation set, the accuracy superiority of HyB_BR and BayesR over GBLUP was greater; for example for fat percent, BayesR and HyB_BR gave a 10% higher accuracy than GBLUP. HyB_BR and BayesR also gave regression coefficients (DTD on GEBV) closer to one than GBLUP for most traits.
In addition, when incorporating the polygenic effects into the prediction model, a small but consistent accuracy improvement was observed for milk production traits, Table 3. However, for fertility, including the polygenic effects did not affect the prediction accuracy at all.
When predicting Australian red bulls and cows using the combined Holstein and Jersey reference set (across breed prediction), both HyB_BR and BayesR had a considerable accuracy advantage (up to 12% increase) over GBLUP for all the traits (Table 4). Compared with BayesR, HyB_BR performed equally, or better, in terms of accuracy for all traits except fat yield.

Accuracy of genomic prediction for heat tolerance
The accuracy of genomic prediction for the heat tolerance traits was similar for GBLUP, BayesR, and HyB_BR, Table 5. There were two exceptions when predicting the validation set of Jersey bulls: 1) for the fat yield trait associated heat tolerance, there was a 6% accuracy reduction for BayesR and HyB_BR in comparison with GBLUP; 2) For milk yield, a 9% increase in accuracy from BayesR and HyB_BR over that of GBLUP was observed. Given the small size of the validation populations, these differences were not statistically significantly different. HyB_BR and BayesR did give regression coefficients closer to one compared with GBLUP for all the traits.
Compared with 600 K SNP panels, the impact of sequence data (SEQ) on the prediction accuracy of GBLUP, BayesR, and HyB_BR was dependent on the trait and validation population (Fig. 3). For the prediction of the validation sets of Holstein or Jersey bulls (which were closely related to the reference set), only a very small accuracy gain (1%~2%) was observed from using sequence data compared to using the 600 K panel. However, when the validation set comprised of Australian Red bulls and cows, there was greater advantage of using the sequence data, provided BayesR or HyB_BR was used. For example, the accuracy using BayesR and HyB_BR with the sequence data was up to 13% higher than when the 600 K SNP panel was used. When using sequence data, GBLUP gave only a very limited increase (or even a reduction for Fat Yield trait).

Inference of genetic architecture
To compare the genetic architecture of the traits using whole genome sequence data, the number of SNPs in each of four distributions (with the variance 0 Ã σ 2 g , 0:0001 Ã σ 2 g , 0:001 Ã σ 2 g , or 0:01 Ã σ 2 g ) was investigated ( Table 6). Across all the traits, BayesR and HyB_BR gave a similar proportion of SNP in the distribution with the largest variance 0:01σ 2 g . However, there was a difference in the proportion of SNPs in each of the four distributions, in that is HyB_BR systematically estimated more variants in the distributions with non-zero variances than BayesR.

QTL mapping for fertility
For fertility, a putative candidate gene located on Chromosome 18 including (around genes CTU1 and CEACAM18) was detected by BayesR and HyB_BR. These genes haveb previously been reported to be associated with calving traits [39,40].

QTL mapping for heat tolerance traits
As there is a significant unfavourable correlation between milk production and heat tolerance, at least for the traits we have used for heat tolerance (decline in milk production with increasing heat stress) [21], mutations that affect milk production are also likely to affect heat tolerance. To avoid detecting just QTL with large effects on milk production, QTL mapping for heat tolerance traits was performed fitting fixed effects of the mutations in DGAT1, ROBO1, PAEP, and MGST1 (the mutations with largest effects on milk production, to ensure these mutations were not picked up again in the heat tolerance mapping) in the BayesR and HyB_BR models. The posterior possibilities of all the variants estimated by HyB_BR and BayesR were plotted across the whole genome sequence in Figs. 9, 10, and 11. Compared with BayesR, HyB_BR systematically detected more SNPs with small effects (0:001 Ã σ 2 g ) while identifying fewer SNP with zero effects.
In total, we found fourteen novel variants (Table 8) in our study which have previously been associated with heat tolerance in humans or other species. YBEY [41,42], located at BTA1 with the position 147,710,807 bp, has been reported to be important in the response of infection of Escherichia coli of human or other animals under heatshock response. Variants in SERPINE2 and CACNA1D (close to the variants detected in our study, BTA2:112,901,035 and BTA22:47,737,890 respectively) have been reported to impact the sweating rate and respiration rate of dairy cattle [43]. DYRK3 (The dual specificity tyrosine-phosphorylation-regulated kinase 3), has been reported to affect respiration rate (breaths per minute) in dairy cattle [43]. HSF1, heat shock factor protein 1, coordinates stress-induced transcription in Human [44]. One single nucleotide polymorphism (SNP) in the 3′-untranslated region (g.4693G > T) of HSF1 has been reported to be in association with thermo tolerance in Chinese Holstein cattle [45]. STIP1, stress inducible protein 1, has been reported to be homologous to hsc70/ hsp90 in human [46]. In mice, STIP1 could play a key role on in the ability of germ cells to survive in stress conditions including high temperatures [47]. Further investigation of the effect of these genes on heat tolerance is required.

Discussion
In this paper, we have demonstrated that HyB_BR [16] could be efficiently implemented for simultaneous prediction of genomic estimated breeding values, inference     Table 7 Known genes (impacting milk production traits and fertility) identified by HyB_BR using the variants with the largest variances 0:01 Ã σ 2 g [49][50][51][52][53][54][55] The blue bar highlights the genes that were not detected by BayesR in the proportion with the largest variances of genetic architecture, and potential causal mutation discovery using whole-genome sequence data. As mentioned by Wang et al. (2016), HyB_BR was developed to overcome two challenges: 1) Long compute times are the main limitation of traditional MCMC Bayesian models applied to whole genome sequence data with very large data size. Therefore, an Expectation-Maximisation scheme was introduced to reduce number of iterations of MCMC.

2) Fast schemes (mainly including Iterative Conditional
Expectation, and Expectation-Maximisation algorithms) implemented for Bayesian models have tended to reduce the accuracy compared with MCMC.
HyB_BR implements an EM algorithm to quickly converge for estimates of SNP effects and other parameters, followed by a limited number of MCMC iterations to optimise the posterior estimation for SNP effects. When applied to whole genome sequence data, our results indicated HyB_BR had similar accuracy of genomic prediction and precision of QTL mapping to BayesR implemented with full MCMC, but with 10 fold less computational time required. Furthermore, compared with the prediction accuracy on 600 K SNP panels, we have demonstrated that using sequence data improved the accuracy of genomic prediction for some of the traits, and particularly in multi-breed evaluations, if a breed was not included in the reference population.
The key improvement for computational efficiency was that HyB_BR reduced the iteration times. BayesR required a huge number of MCMC iterations, which was dependent on the size of the data. For example, on the whole genome sequence data with 16,214 animals and almost 1 million variants, 40,000 iterations with first 20,000 as burn-in were required. For each MCMC iteration, the basis operation times were O(mn 2 ). In comparison with BayesR, HyB_BR has the same number of basic operations. But after the EM converges (with very small number of iterations as demonstrated by Wang et al. (2015) [27]), HyB_BR implemented MCMC iterations with speed-up schemes, which could reduce All the listed genes are identified by HyB_BR using the variants with the largest variances 0:01 Ã σ 2 g the iteration number to 4000 iterations. The results from Fig. 2 provided the evidence that HyB_BR was up to 10 times faster than BayesR in the whole genome sequence data set. In addition to the computational time, the prediction accuracy of HyB_BR for multi-breed prediction and across-breed prediction was very similar to BayesR for a range of traits with various genetic architectures, shown in Tables 3, 4 and 5. The accuracy advantage of HyB_BR and BayesR over GBLUP for across-breed prediction demonstrated the benefit of the non-linear Bayesian models. Also, the increase in accuracy using whole genome sequence data for across-breed prediction in comparison with using 600 K data, confirmed the results from [5].
For the genetic architecture identification of milk production traits, there was one notable difference between BayesR and HyB_BR: In comparison with BayesR, HyB_BR does not shrink variants with small effects ( 0:001 Ã σ 2 g ) as strongly, the same is true for very small effects ( 0:0001 Ã σ 2 g ), ( Table 6). The same is true for the identification of causal mutations for heat tolerance, Figs. 9, 10 and 11. One explanation is that EM steps do not have enough power to shrink SNPs with small effects [27], which limits the following MCMC steps.
For the heat tolerance traits, there is relatively little literature reporting QTL for heat tolerance in cattle. Only one of the additive genetic variants (located at Chromosome 29 with the position 48,329,079 base pairs; close to FGF4) [48], later suggested to be SHANK2 by [43] has previously been reported. In Table 7, the gene SHANK2 was detected but not in the list of top causal mutations. However, both BayesR and HyB_BR did pick up mutations in or close to seven genes (e.g. YEBY, HSF1, MED17, ME3, STIP1, SERPINE2 and CAC-NA1D), which have been reported by previous studies to be involved in response to heat stress events in cattle (e.g. [45]), human, mice, or other species. In addition, HyB_BR also detected two other unknown variants. All these variants required the further investigation in regards to their function interacting between milk productions and heat tolerance.
The computational advantage of HyB_BR makes it attractive for implementation of genomic prediction in many applications. However, there are still two limitations: 1) the speed-up scheme of HyB_BR defines the fixed threshold for different traits and various densities of genomic data, which could hinder its flexibility for practical applications; 2) when the size of the data increases dramatically to 30 million variants on millions of animals, which is possible in the near future, HyB_BR is still not computationally efficient enough. Therefore, a flexible and more efficient speed-up scheme will play an important role to further improve the computational performance of HyB_BR.

Conclusion
A hybrid scheme of Expectation-Maximisation algorithm and MCMC sampling was implemented on wholegenome sequence data for simultaneous genomic prediction, inference of genetic architecture inference and causal mutation identification. The accuracy of HyB_BR for multi-breed and across breed prediction for all traits was very similar to the results from BayesR (implemented with full MCMC) while requiring only 1/10 of the total running time of BayesR. HyB_BR could identify some well-known mutations (e.g. DGAT1) with the highest posterior probability, which demonstrated the value of the method for QTL mapping of complex traits. The advantage of using sequence data and HyB_BR was greatest for multi-breed and across breed predictions.