The most complete model used describes phenotypes y (BW, FI, or FE) by an intercept μ, environmental effects of batch and sex b, a polygenic effect based on pedigree u, regressions on SNP covariates a, regressions on gene expression covariates g, and a model residual e, as:
(1)
where X is the design matrix for batch and sex effects, Z is a design matrix that links polygenic effects to the observed records, W is a matrix with 1806 SNP covariates, and Q is a matrix with 23,698 gene expression covariates. The SNP and gene expression covariates were centered and scaled to unit variance.
Based on work of[19–22], the Bayesian mixed model version assigns normal prior to the vectors u, a, g, and e in (1), i.e.,, where is the polygenic variance and A is the numerator relationship matrix based on pedigree information, is the per-SNP explained variance, is the per-gene expression explained variance, and is the residual or environmental variance. These four variances are estimated in the model using flat prior distributions, i.e.,. The remaining parameters in (1), μ and b, are assigned flat prior distributions, which is the Bayesian analog of fitting “fixed effects” (unshrunken) estimates. A Markov chain Monte Carlo (MCMC) algorithm was applied in the software bayz[23] to obtain samples from the posterior distribution of the model parameters. MCMC algorithms for sampling effects and variances in mixed models have been extensively described, for a general overview see[24]. The Monte Carlo accuracy of the MCMC algorithm was evaluated by correlating repeated estimates for the parameter vectors u, a and g, requiring a correlation >0.999 from repeated MCMC runs, and by computing the effective sample sizes for the variance components using the R Coda package[25].
The explained variance in y from (1) is var(Zu) + var(Wa) + var(Qg) + var(e). To obtain posterior means (PMs) and posterior standard deviations (PSDs) on the explained variances for SNPs and gene expressions, var(Wa) and var(Qg) were evaluated based on the posterior samples for a and g from the MCMC, i.e., as the PM and PSD of var(Wat) values over MCMC cycles, where at is the posterior sample for a from MCMC cycle t. This procedure is not required for the polygenic variance, because Z is a design matrix, unlike W and Q, which are covariate matrices.
The second model used was a Bayesian variable selection model, where the approach of George and McCulloch[26] was followed to fit mixture distributions with small and large variances as the prior distribution for regression coefficients. In model (1), such a mixture prior was applied to SNPs as well as gene expression regression coefficients, with independent parameters and mixture indicators for SNPs and for gene expressions. The basic model of George and McCulloch[26] was further extended to incorporate the variances in the mixture distribution as unknown model parameters, which allows the model to learn the relative importance of SNPs and gene expressions from the data. This variable selection model thus takes the prior distributions for a and g as follows:
(2)
(3)
where and are the “large” and “small” variances in the mixture distribution for a, and are the “large” and “small” variances in the mixture distribution for g, and and are vectors of 0/1 indicator variables for a and g, respectively, indicating whether the i th element in a or g, respectively, comes from the distribution with large or small variance. The variances were all estimated from the data using unbounded flat prior distributions. The constraints and were applied using a rejection sampler, so that “large” and “small” effects remained identifiable. The priors for the indicator variables were taken as and, wheremeans a Bernoulli distribution for a 0/1 indicator with a probability π for a 1. The parameterswere taken as known. The MCMC implementation of this model is relatively straightforward, because conditional on the indicator variables the model remains a mixed model. The updating of the mixture indicators is described in[26]. This model is also run in the software bayz[23], and the Monte Carlo accuracy was evaluated in the same way as the mixed model version.
From the posterior samples for a and g in the variable selection model, explained variances were computed and partitioned by chromosome and by genome section. The variable selection model is more suited to make such a partitioning, because unlike the mixed model version, it allows for different variance contributions per SNP. The explained variances were evaluated in the same way as for the mixed model, by evaluating var(Wat) and var(Qgt) over MCMC cycles t, except that the a and g samples are obtained under the mixture model prior assumptions. The same expressions can be straightforwardly evaluated for parts of the SNPs or gene expressions to obtain explained variances per chromosome and for small windows of SNPs within chromosomes. Variance within a chromosome was computed using a 5-SNP sliding window to obtain a genomic variance profile.
It is difficult to choose an optimal windows size as it depends on extend of LD, marker density and an arbitrary cut-off for what is considered important LD. In the data analyzed here, average R2 between adjacent SNPs was 0.55, and average R2 between SNPs two apart was 0.39, which we considered sufficiently high to warrant computation of variances in a 5-SNP window. To study the relative importance of family structure, SNPs, and gene expressions, six sub models and the complete model (1) were used. These were models that use only pedigree information (PED), only SNP data (SNP), only gene expression data (GEX), SNP + GEX, PED + GEX, PED + SNP, and the complete model PED + SNP + GEX. These models always included sex and batch effects.
The predictive ability of the models was evaluated using an 11-fold cross-validation. For body weight, 440 records were divided randomly in 11 groups, each with 40 individuals. Feed intake and feed efficiency, with 337 records in total, were randomly divided in 10 groups of 30 records and one group of 37 records. The complete model, including all variance parameters, was re-estimated on each set of 10 folds and predictions were computed for the phenotypes in the remaining 11th fold. All predictions from the 11-fold cross validation were collected to compute correlations between predicted and actual phenotypes, and regressions of predicted phenotypes on actual phenotypes, using the whole data set. The slope of the regression lines of predicted phenotypes on actual phenotypes are expected to be 1 if the model produces unbiased predictions, which would validate the estimates of explained variances. The University of Nebraska Institutional Animal Care and Use Committee approved all procedures and protocols.