Bias in estimates of variance components in populations undergoing genomic selection: a simulation study

Background After the extensive implementation of genomic selection (GS), the choice of the statistical model and data used to estimate variance components (VCs) remains unclear. A primary concern is that VCs estimated from a traditional pedigree-based animal model (P-AM) will be biased due to ignoring the impact of GS. The objectives of this study were to examine the effects of GS on estimates of VC in the analysis of different sets of phenotypes and to investigate VC estimation using different methods. Data were simulated to resemble the Danish Jersey population. The simulation included three phases: (1) a historical phase; (2) 20 years of conventional breeding; and (3) 15 years of GS. The three scenarios based on different sets of phenotypes for VC estimation were as follows: (1) Pheno1: phenotypes from only the conventional phase (1–20 years); (2) Pheno1 + 2: phenotypes from both the conventional phase and GS phase (1–35 years); (3) Pheno2: phenotypes from only the GS phase (21–35 years). Single-step genomic BLUP (ssGBLUP), a single-step Bayesian regression model (ssBR), and P-AM were applied. Two base populations were defined: the first was the founder population referred to by the pedigree-based relationship (P-base); the second was the base population referred to by the current genotyped population (G-base). Results In general, both the ssGBLUP and ssBR models with all the phenotypic and genotypic information (Pheno1 + 2) yielded biased estimates of additive genetic variance compared to the P-base model. When the phenotypes from the conventional breeding phase were excluded (Pheno2), P-AM led to underestimation of the genetic variance of P-base. Compared to the VCs of G-base, when phenotypes from the conventional breeding phase (Pheno2) were ignored, the ssBR model yielded unbiased estimates of the total genetic variance and marker-based genetic variance, whereas the residual variance was overestimated. Conclusions The results show that neither of the single-step models (ssGBLUP and ssBR) can precisely estimate the VCs for populations undergoing GS. Overall, the best solution for obtaining unbiased estimates of VCs is to use P-AM with phenotypes from the conventional phase or phenotypes from both the conventional and GS phases.


Background
In animal breeding, the prediction of breeding values requires accurate and unbiased estimates of variance components (VCs). There are several approaches available for VC estimation depending on the practical conditions (for instance, population structure) [1]. Among these methods, restricted maximum likelihood (REML) [2] is a well-known method that has been widely applied under animal models to obtain genetic parameters for traits of interest [3][4][5][6][7][8]. Alternatively, Bayesian methods with Markov chain Monte-Carlo (MCMC) procedures such as Gibbs sampling can be adopted to obtain marginal inferences for VCs [9][10][11][12].
Before genomic information was available, Sorensen and Kennedy [13] showed that unbiased estimation of VCs in a base population could be obtained by an animal model when all data leading to selection decisions and complete relationships were incorporated, while ignoring data from selected ancestors led to biased estimation due to not accounting for gametic disequilibrium. Moreover, in small populations with few generations, an animal model with pedigree information can account for bias due to inbreeding and the Bulmer effect [14]. However, these results were obtained assuming the infinitesimal model. To date, although there are currently several genomic prediction models available, the choice of statistical model and which data should be used to estimate VCs in the genomics era remain unclear.
The use of genomic selection (GS) has greatly enhanced genetic gains due to improved selection accuracy and shortened generation intervals (e.g., dairy cattle breeding); this has led to faster changes in gene frequencies compared to traditional phenotype and pedigreebased genetic improvement programmes. In such situations, the extra selection acting on genomic information increases the challenges of obtaining unbiased VC estimates. Concerns have been raised about biased VC estimates caused by ignoring genomic information and selective genotyping [15]. Hence, it can be expected that after performing GS for several generations, the estimates of VC from a pedigree-based animal model (P-AM) would be biased due to not accounting for the impact of GS.
Single-step genomic BLUP (ssGBLUP) derived by [16][17][18] has provided a way to predict breeding values using phenotype, pedigree and genomic information simultaneously for both genotyped and non-genotyped individuals via a combined relationship matrix (H). Thus, in contrast to typical genomic BLUP (GBLUP) [19,20] using phenotypes only from genotyped individuals, a large number of historical phenotypes from non-genotyped individuals can also be used for analyses. Alternatively, a single-step Bayesian regression model (ssBR) was proposed by [21,22]. In contrast to ssGBLUP, in which markers for non-genotyped individuals are implicitly imputed, ssBR requires the explicit imputation of the markers for nongenotyped individuals, followed by fitting of the marker effects in the model. Previous studies have reported that ssGBLUP and the original ssBR model perform equally in terms of accuracy of prediction when assuming that VCs are known [23,24]. However, an attractive feature of ssBR that has been neglected is that ssBR offers separate estimation of marker variance (σ 2 α ) and total genetic variance ( σ 2 g , denoted as σ 2 ε in ssBR). Therefore, ssBR could be employed as a variance component model for VC estimation that includes two additive genetic components.
The compatibility between pedigree-based relationships and marker-based relationships is a crucial factor for ensuring unbiased estimation when using single-step methods [19,25,26]. This concern is due to the different base populations to which these two relationships refer. More specifically, the pedigree-based relationship refers to a founder population in which individuals are assumed to be unrelated (denoted as P-base hereafter). The base population of marker-based relationships can be defined as the population from which the allele frequencies were obtained to compute the relationships (denoted as G-base hereafter) [27,28]. Therefore, the VCs estimated using single-step methods with phenotypes from a population undergoing GS may not be directly comparable with the VCs estimated using pedigree-based methods [29]. The objectives of this study were (1) to examine the effects of genomic selection on estimates of VC based on the scenario of choosing phenotypes from different phases of breeding programme and (2) to investigate the estimation of VCs using different methods in a population undergoing GS.

Results
To determine the effects of GS on the estimation of VCs, we analysed data from three scenarios reflecting no application of GS (Pheno 1 ), application of GS (Pheno 1 + 2 ), and a lack of information from the previous conventional breeding scheme (Pheno 2 ). Table 2 presents the means and standard deviations (SDs) of the estimated VCs and heritabilities over replicates. In general, the use of single-step methods with all the phenotypes and genotypes (Pheno 1 + 2 ) yielded biased estimates of the total genetic variance of P-base. Using P-AM with all the phenotypes (Pheno 1 + 2 ) produced unbiased VC estimates and heritability of P-base. When the phenotypes from the conventional breeding phase were excluded (Pheno 2 ), P-AM led to the underestimation of genetic variance (P < 0.01) of P-base. As expected, we obtained unbiased VC estimates and heritability when using P-AM with phenotypes only from the conventional phase (Pheno 1 ).
For ssGBLUP, in general, the estimates of genetic variance were all biased (P < 0.005) across all the scenarios ( Table 2). In contrast to the VCs from P-base, when using phenotypes only from the conventional phase (Pheno 1 ), the genetic variance was significantly underestimated, and the residual variance was significantly overestimated. Furthermore, when including phenotypes from the GS phase in the model (Pheno 1 + 2 ), the genetic variance was significantly underestimated, although unbiased estimated residual variance was observed. In contrast to the VCs from G-base, when ignoring data from the conventional breeding phase (Pheno 2 ), ssGBLUP significantly overestimated genetic variance, although an unbiased estimate of residual variance was obtained.
For ssBR, the convergence of the Gibbs sampler was assessed by estimating Monte Carlo error (MCE) (via batch means). The MCEs for the estimates of the total genetic variance and residual variance were at the level of 10 − 3 , and for the estimates of marker variances, they were at the level of 10 − 6 . Two estimated genetic variances are reported ( Table 2). In contrast to VCs from Pbase, when using phenotypes only from the conventional phase (Pheno 1 ), unbiased total genetic variance and residual variance were obtained. This led to an unbiased estimate of heritability. However, the marker-based genetic variance was significantly overestimated (P < 0.001) compared to the variance in G-base, resulting in a biased estimate of heritability. Conversely, when including phenotypes from the GS phase in the model (Pheno 1 + 2 ), the marker-based genetic variance was unbiased, but the total genetic variance was significantly underestimated (P < 0.001). In contrast to the VCs from G-base, when ignoring data from the previous conventional breeding phase (Pheno 2 ), ssBR yielded unbiased estimates of the total genetic variance and marker-based genetic variance, whereas the residual variance was overestimated.

Discussion
In this study, we addressed the effects of GS on estimates of VCs. Two single-step methods (ssGBLUP and ssBR), together with the traditional pedigree-based animal model (P-AM), were used and compared based on simulated datasets that mimic dairy cattle populations. The first question addressed in this study was aimed at determining the impact of the choice of phenotypes from different phases of a breeding program on the estimation of VCs. The results showed that both the ssGBLUP and ssBR models led to biased VC estimates across all scenarios.

Selection of data to be included in the estimation of genetic variance
We showed that P-AM yielded unbiased estimates of VC when including the phenotypes from the GS phase (Pheno 1 + 2 ). We also demonstrated that when using phenotypes only from the conventional selection phase (Pheno 1 ), P-AM produced unbiased VC estimates. The scenarios of Pheno 2 , which reflects current breeding programmes using GS, led to biased estimates of VC by using P-AM. This was mainly caused by ignoring the information from the selection decisions. Another possible cause of this bias might be explained by the negative LD between QTLs across the genome [30,31]. In particular, this negative LD is stronger in a population selected according to GS information than in a population selected based on only the pedigree relationships [the mean (SD) variance of TBV for animals from generation 35 decreased to 2.46 (0.04)].
Reductions in genotyping costs have contributed to the comprehensive implementation of GS in many genetic selection programmes. This offers the opportunity to select for new traits that are difficult to measure or not yet among the breeding goals. In this situation, a new phenotyping strategy might be used, or new phenotypes may be more likely to be collected from populations under intense GS. In the present study, based on the scenario of Pheno 2 with P-AM, our results confirmed the reduction of genetic variance due to ignoring the information from the conventional breeding phase; i.e., there are no phenotypes to account for the selection conducted in the previous period, and previous selection cannot be properly handled in the current model. Therefore, the use of P-AM including only the phenotypes from the GS phase (Pheno 2 ) resulted in biased estimates of VC.

Changed base population
With respect to the genomic relationship, an arbitrary base population may exist close to the current population undergoing GS. It can be expected that using such a base population (more recent) would lead to smaller estimates of genetic variance than using P-base. Therefore, in this study, the G-base corresponding to the current GS phase was defined separately from P-base. Consequently, it may be improper to compare VCs estimated in the GS phase (e.g., Pheno2) with the VCs in P-base, which is generally referred to by the pedigree. This concern was initially raised by Powell et al. [25]; that is, VCs estimated using genetic markers can be erroneous since an inconsistent base population is defined compared to the base population with common founders. A similar study was conducted by Veerkamp et al. [32] based on a small dataset from Holsteins, where VCs for milk yield, dry matter intake, and body weight were estimated with P-AM, GBLUP, and ssGBLUP. Their results showed that ssGBLUP produced the most precise estimates of VCs; however, they questioned whether the VCs obtained using genomic information might not be comparable with the VCs obtained using only pedigree information since these relationship matrices might refer to different base populations. Our results confirmed that when phenotypes only from the GS phase (Pheno 2 ) were used, both ssGBLUP and ssBR yielded smaller genetic variance estimates compared with the VCs in P-base (Table 2).
In the present simulation study, we directly used the allele frequencies calculated from P-base to avoid the compatibility issue between the G and A 22 matrices in ssGBLUP [26,27,33]. In practice, the allele frequencies of P-base are unknown, but they can be estimated using the approach proposed by Gengler et al. [34] by regressing the gene contents of ancestors on the genotypes of the progenies.

Two estimates of genetic variance in ssBR
As originally introduced by Fernando et al. [21,22], the ssBR model is essentially a marker effect model with all markers fitted in the model. Consequently, this feature results in a model with two estimates of additive genetic variances, i.e., the total genetic variance (σ 2 ε ) and the markerbased genetic variance, which can be obtained by multiplying P m j¼1 2p j ð1−p j Þ by the estimated marker variance ( σ 2 α ). When using phenotypes only from the conventional phase (Pheno 1 ), the estimated total genetic variance was unbiased; however, the marker-based genetic variance was biased upwards, although the allele frequencies from Pbase were used. This result can be explained by the fact that a small proportion (1.2%) of animals (only progenytested bulls) in the pedigree were genotyped, resulting in poor imputation for non-genotyped animals and biased estimation of marker variance.
Apart from the estimated marker variance, the total genetic variance shows a relationship with the conditional variance of the breeding values of non-genotyped individuals (g 1 ) given the breeding values of genotyped individuals (g 2 ), i.e., Varðg 1 jg 2 Þ ¼ ðA 11 . With the assumption of multivariate normality, the markers of non-genotyped individuals were imputed via a pedigree-based linear system. A residual vector (ε) accounting for the remaining portion of the breeding values that could not be modelled by the imputed markers was added to the marker-based breeding values to obtain the final g 1 . The accuracy of this imputation quality depends on the genetic relationships between genotyped and non-genotyped animals; i.e., it is expected that less precise imputation will be achieved for old ancestors than for younger ones. More specifically, the imputation is based on a linear relationship between genotyped and non-genotyped individuals, which may not approximate the distribution of marker genotypes very well [21]. Thus, better methods (e.g., an imputation method based on a peeling algorithm [35]) for imputing the genotypes of non-genotyped individuals conditional on the genotyped individuals would be expected to yield more accurate results. In addition, as pointed out by [21], in the single-step method, we do not observe g 2 , but M 2 ; this indicates that the conditioning is on the observed marker information, and the conditional genetic variance estimated in ssBR is therefore actually only an approximation of the genetic variance.

Conclusions
This study contributes to a better understanding of the effects of GS on VC estimation. The results show that neither of the single-step models (ssGBLUP and ssBR) can precisely estimate the VCs for populations undergoing GS. Furthermore, this study has demonstrated that when the complete data are analysed with both pre-GS data and data from the GS phase, the classic P-AM can yield unbiased estimates of VC. Therefore, an implication of these findings is that the best solution for obtaining unbiased estimates of VC is to use P-AM with phenotypes from the conventional phase or phenotypes from both conventional and GS phases.

Simulation of data
Populations that were similar to the Danish Jersey dairy cattle population in terms of the breeding scheme and population structure were simulated over a 35-year period with 5 replicates for each scenario. The simulation was conducted with the following three phases: (1) The historical phase, covering 3000 non-overlapping generations, was run to generate an initial linkage disequilibrium (LD) structure. The simulated genome consisted of 30 chromosomes of 100 centiMorgans (cM) each with 100 QTLs and 10,000 biallelic SNP markers. The QTLs and markers were uniformly distributed within each chromosome. This resulted in a total of 3000 QTLs and 300,000 markers across the whole genome. The offspring inherited alleles at these loci from their parents following Mendel's rules allowing for mutation (assumed only to happen in the historical phase) and recombination. A recurrent mutation rate of 2.5 × 10 − 5 for both markers and QTLs was set to establish mutation-drift equilibrium in the historical generations. The number of recombination per chromosome (per Morgan) was sampled from a Poisson distribution with a mean equal to the length of the chromosome, and crossovers were uniformly located along the chromosome. This part of the simulation was implemented with QMSim software [36]. Generation 3000 was used as the base population, in which 40,000 SNPs were randomly chosen from the pool of 300,000 markers, and 2000 QTLs were randomly chosen from the pool of 3000 QTLs. (2) In the next phase, 20 years of conventional breeding were simulated. Each year, 50 young bulls were selected based on their parent average (PA) and the progeny tested, and 10,000 cows in different age groups were maintained. Only cows in the first lactation, however, were assumed to have phenotypes. At the end of phase (2), all proven bulls were genotyped. (3) In the last phase, 15 years of genomic selection were simulated.
Each year, 500 bulls and 2000 heifers were selected for genotyping at one year of age based on their PA genomic estimated breeding values (GEBV) computed by ssGBLUP. After genomic evaluation, 50 of these 500 bulls were selected for breeding. The simulations of phase (2) and (3) were performed with ADAM software [37]. For the herd-year-season (HYS) effect (contemporary group effect), individuals were allocated to 100 herds, 35 years and 4 seasons.

Scenarios
Three scenarios based on the use of phenotypic information from different phases of breeding programs for the estimation of VCs were explored: (1) Pheno 1 : phenotypes only from conventional phase (1-20 years) were used; (2) Pheno 1 + 2 : phenotypes from both the conventional phase and genomic selection phase (1-35 years) were used; (3) Pheno 2 : phenotypes from only the genomic selection phase were used (21-35 years). Within each scenario, the three single-trait models (P-AM, ssGBLUP, and ssBR) were applied. An overview of the subsets used and the average number of individuals in the pedigree, phenotypes, and genotypes for each scenario over 5 replicates are shown in Table 1. For validation, VCs and heritabilities from two different base populations (P-base and G-base) are presented in Table 2. The genetic variance of P-base was calculated from the variance of "true" breeding values (TBVs) based on animals from the founder population, while the genetic variance of G-base was calculated from the variance of TBVs based on animals from years 18, 19, and 20, i.e., the last three years before the start of GS (animals in G-base were related). Hence, when genomic information was included and phenotypes were only collected from GS phase (Pheno 2 ), the VC estimates needed to be compared with the VCs from G-base.

Statistical models for VC estimation P-AM
The classical animal model [38] using pedigree-based relationships can be written as follows: where y represents the vector of phenotypes; β is a vector of fixed effects, i.e., herd-year-season effects: a is a vector of additive genetic effects following Nð0; Aσ 2 g Þ , where A is the numerator relationship matrix, and σ 2 g is the additive genetic variance; X and Z are design matrices relating phenotypes to fixed effects and random animal effects, respectively; and e is a vector of residuals following Nð0; Iσ 2 e Þ, where σ 2 e is the residual variance. ssGBLUP The model equation of the regular ssGBLUP model [16,17] was the same as model (1) but used an H matrix that combines the marker-based (G) and pedigree-based (A) relationship matrices to replace the numerator relationship matrix (A) in the classical animal model. Therefore, the vector of GEBVs is assumed to be distributed as Nð0 ; Hσ 2 g Þ. The mixed model equations (MME) require the inverse of the H matrix [18]: where A −1 22 is the inverse of the pedigree-based relationship matrix for the genotyped individuals, and G is constructed according to [19]: where Z is a centred marker covariate matrix containing 0 − 2p j , 1 − 2p j , and 2 − 2p j for genotypes of AA, AB, and BB, respectively; p j is the allele frequency at locus j, and m is the total number of markers. For the scenarios of Pheno 1 and Pheno 1 + 2 , the allele frequencies were estimated from the animals in P-base, whereas for the scenario of Pheno 2 , the allele frequencies were estimated from the animals in G-base. The variance components of P-AM and ssGBLUP were estimated with REML [2] using the average information algorithm (AIREML) [8] as implemented in the DMU package [39].
ssBR An ssBR model [21] based on BayesC [40,41] with the assumption of all markers having non-zero effects (i.e., π = 0) and a common variance for all markers can be specified as follows: where subscript 1 denotes non-genotyped individuals, and subscript 2 denotes genotyped individuals. Thus, y represents the vector of phenotypes; β is a vector including elements of herd-year-season effects; X and Z are design matrices; M 2 contains observed marker covariates for genotyped individuals; M 1 contains imputed marker covariates for non-genotyped individuals; the imputation is conducted from the following linear relationship: is the variance of the marker effects under the assumption that all markers exhibit common genetic variance and can explain all additive genetic variance; σ 2 ε is the imputation residual variance; and e is a vector of residuals, e Nð0; Iσ 2 e Þ . For location parameters, a flat prior is assigned for β, and normal priors are specified for α, ε, and e, with null mean and variance of σ 2 α , σ 2 ε , and σ 2 e , respectively. For dispersion parameters, scaled inverse chi-squared distributions with scale factors, S 2 α ¼ 0:0002, S 2 ε ¼ 2, S 2 e ¼ 2:5, and degree of freedom, ν α(ε, e) = 4, are assumed to be the prior distributions for σ 2 α , σ 2 ε , and σ 2 e , respectively. The final GEBV for individuals can be written as follows: where g 1 and g 2 represent the vectors of GEBV for nongenotyped and genotyped individuals, respectively. The use of ssBR allows the inference of the additive genetic variance from two sources of information: first, the total genetic variance, approximated by the estimated imputation residual variance, σ 2 ε ; second, the marker-based genetic variance, computed as the estimated marker variance, σ 2 α , multiplied by P m j¼1 2p j ð1−p j Þ, where p j is the allele frequency at locus j and m is the total number of markers. Similar to ssGBLUP, for the Table 2 Mean (SD) of the "true" variance components and heritability in the base (founder) population (P-base) and the base population for the genomic phase a and estimates of variance components and heritabilities from P-AM, ssGBLUP, and ssBR based on three scenarios of phenotyping b a P-AM: traditional pedigree-based animal model; ssGBLUP: single-step genomic BLUP; ssBR: single-step Bayesian regression. σ 2 g is the genetic variance used in P-AM, σ 2 ε is the total genetic variance used in ssBR; σ 2 e is the residual variance; σ 2 α is the marker variance; σ 2 α P m j¼1 2p j ð1−p j Þ is used to calculate genetic variance via the estimated marker variance in ssBR, where p j is the observed allele frequency at locus j, and m is the total number of markers b Pheno 1 : phenotypes from only the conventional phase (1-20 years) were used; Pheno 1 + 2 : phenotypes from both the conventional phase and genomic selection phase (1-35 years) were used; Pheno 2 : phenotypes from only the genomic selection phase were used (21-35 years) c Heritabilities (h 2 ) from P-AM, ssGBLUP, and ssBR calculated as The significance test was performed to determine whether the estimated parameter differs from the simulated parameter. * significant at P < 0.01; ** significant at P < 0.005; *** significant at P < 0.001 scenarios of Pheno 1 and Pheno 1 + 2 , allele frequencies were calculated based on the stored genotypes in the base population (P-base). For the of Pheno 2 scenario, the allele frequencies were calculated based on the current genotyped population. In addition, an extra fixed effect (μ g ) was fitted in the model to account for the unknown expectation for genotyped individuals [21]. The ssBR program was written in Fortran 95. A Gibbs sampler was used to draw inferences for all model parameters from their posterior distributions. The length of the chain was set to 50,000, with a burn-in of 20,000 iterations. The convergence of the posterior distribution for each parameter investigated was assessed using the boa and coda packages [42,43].