 Methodology Article
 Open Access
 Published:
Robust estimation of heritability and predictive accuracy in plant breeding: evaluation using simulation and empirical data
BMC Genomics volume 21, Article number: 43 (2020)
Abstract
Background
Genomic prediction (GP) is used in animal and plant breeding to help identify the best genotypes for selection. One of the most important measures of the effectiveness and reliability of GP in plant breeding is predictive accuracy. An accurate estimate of this measure is thus central to GP. Moreover, regression models are the models of choice for analyzing field trial data in plant breeding. However, models that use the classical likelihood typically perform poorly, often resulting in biased parameter estimates, when their underlying assumptions are violated. This typically happens when data are contaminated with outliers. These biases often translate into inaccurate estimates of heritability and predictive accuracy, compromising the performance of GP. Since phenotypic data are susceptible to contamination, improving the methods for estimating heritability and predictive accuracy can enhance the performance of GP. Robust statistical methods provide an intuitively appealing and a theoretically well justified framework for overcoming some of the drawbacks of classical regression, most notably the departure from the normality assumption. We compare the performance of robust and classical approaches to two recently published methods for estimating heritability and predictive accuracy of GP using simulation of several plausible scenarios of random and block data contamination with outliers and commercial maize and rye breeding datasets.
Results
The robust approach generally performed as good as or better than the classical approach in phenotypic data analysis and in estimating the predictive accuracy of heritability and genomic prediction under both the random and block contamination scenarios. Notably, it consistently outperformed the classical approach under the random contamination scenario. Analyses of the empirical maize and rye datasets further reinforce the stability and reliability of the robust approach in the presence of outliers or missing data.
Conclusions
The proposed robust approach enhances the predictive accuracy of heritability and genomic prediction by minimizing the deleterious effects of outliers for a broad range of simulation scenarios and empirical breeding datasets. Accordingly, plant breeders should seriously consider regularly using the robust alongside the classical approach and increasing the number of replicates to three or more, to further enhance the accuracy of the robust approach.
Background
Genomic studies, whether from an association, prediction or selection perspective, constitute a field of research with increasing statistical methodological challenges given the growing complexity (population structure, coancestry, etc), dimension of datasets, measurement errors and atypical observations (outliers). Outliers often arise from atypical environments, years, field pests or other phenomena. Here, regression models are the tool of choice whether in studies involving human, animal or plant applications. However, it is well known that the performance of these models is poor when their underlying assumptions are violated and their unknown parameters are estimated by the classical likelihood [49]. For example, violation of the normality assumption – depending on its severity – may lead to both biased parameter estimates and coefficients of determination [7] and strongly interfere with variable selection [5]. In the case of the linear mixed model, such violation can tamper with the estimation of variance components [24], which itself can be very challenging even when data are normally distributed but the sample size is small. Violation of model assumptions due to contamination of data with outliers can have several other deleterious effects on regression models. In genomic association studies, for example, departure from normality can induce power loss in the detection of true associations and inflate the number of detected spurious associations [22]. In plant genomics such violations of model assumptions and the associated biases often translate into inaccurate estimates of heritability and predictive accuracy [10]. This can have significant practical consenquences because predictive accuracy is the single most important measure of the performance of genomic prediction (GP). The reduction of these adverse effects through the use of more robust methods is thus of considerable practical importance [48].
Recently, [9] proposed a method for estimating heritability and predictive accuracy simultaneously (Method 5) and compared its performance with several contending methods from the literature including a popular method in animal breeding (Method 7). More details on Methods 5 and 7 can be found in the “Genomic prediction” section. The authors concluded from these comparisons that Methods 5 and 7 consistently gave the least biased, most precise and stable estimates of predictive accuracy across all the scenarios they considered. Additionally, Method 5 gave the most accurate estimates of heritability [9]. Both methods are founded on the linear mixed effects model as well as on ridge regression best linear unbiased prediction (RRBLUP) through a twostage approach [34–36]. The first stage of this twostage approach involves phenotypic analysis and thus is likely to be adversely affected by contaminated phenotypic plot data. In particular, contamination can undermine the accuracy with which the adjusted means are estimated in the first stage and thus negatively impact estimation of both heritability (only Method 5) and predictive accuracy in the subsequent second stage where RRBLUP is used [15]. Estaghvirou et al. [10] later examined the performance of the same seven methods in the presence of one outlying observation under 10 simulated contamination scenarios. These simulations reaffirmed that Methods 5 and 7 performed the best overall and produced the best estimates of both heritability (only Method 5) and predictive accuracy across all the contamination scenarios they considered. However, one outlying observation for their dataset with a sample size of 698 genotypes corresponds to a level of contamination of merely 0.1%. As stated by [10], outliers may arise in plant breeding studies from measurement errors, inherent characteristics of the studied genotypes, enviroments or even years. As the process generating the outliers may vary across locations and/or trials, it is conceivable that a nonneglegible percentage of phenotypic observations may be typically contaminated when large field trial datasets are considered. As a result, the composite effects of such substantial levels of contamination on the accuracy of methods for estimating heritability and accuracy of GP can be potentially considerable. Such outliers may not always be easy to detect and eliminate prior to phenotypic data analysis. Therefore, using robust statistical procedures for phenotypic data analysis of field trial datasets can help ameliorate the adverse effects of outliers.
Robust statistical methods have been around for a long time and are designed to be resistant to influential factors such as outlying observations, nonnormality and other problems associated with model misspecification [17]. Therefore, the use of robust methods has been advocated for inference in the linear and linear mixed model setups [6, 25], as well as in ridge regression [1, 15, 26, 27, 45, 52]. As a result of such considerations and the recent advances in computing power, it is not surprising that there has been a strong, renewed interest in exploring these techniques to robustify existing methods or develop new procedures robust to moderate deviations from model specifications [24, 41].
Consequently, to tackle the problem of biased estimation of heritability and predictive accuracy due to contamination of phenotypic data with outliers, we aim to robustify the first phase of the twostage analysis used in GP. We use a MonteCarlo simulation study encompassing several contamination scenarios to assess the performance of the proposed robust approach relative to: (i) the approach used by [35], and (ii) simulated underlying true breeding values taken as the gold standard. These assessments are carried out at each of the two stages involved in predicting breeding values by comparing the accuracy with which the two approaches estimate true genotypic values in phenotypic analysis. In a third stage, we compare the heritabilities (H^{2}) and predictive accuracies (PA) estimated by the two competing approaches using Method 5 (H^{2} and PA) and Method 7 (PA only). In addition, we compare the heritability estimated by Method 5 with the generalized heritability estimated by Oakey’s method [29]. The latter method was not evaluated by [9].
Also, an application of the methodology to real commercial maize (Zea mays) and rye (Secale sereale) datasets is presented and used to empirically assess the usefulness of the proposed robust approach. Lastly, we discuss how to effectively apply the proposed robust approach to phenotypic data analysis and the estimation of heritability and predictive accuracy of GP in plant breeding.
The robust and the classical approaches are implemented in the R software using the code in the supplementary materials (Additional file 5). The ASREMLR package is used to fit the models at the second stage.
Materials and methods
Datasets
Rye dataset: The Rye data were obtained from the KWSLOCHOW project and is described in more detail elsewhere [2, 3]. These data consist of 150 genotypes tested between 2009 and 2011 at several locations in Germany and Poland, using α designs with two replicates and four checks (replicated two times in the two replicates). Each trial was randomized independently of the others. The field layout of some trials was not perfectly rectangular. Trials at some locations and for some years had fewer blocks but larger size, i.e., two different sizes were used for a few trials. Blocks were nested within rows in the field layout. The dataset has 16 anomalous observations pertaining to distinct genotypes, that the breeders identified as outliers. Moreover, yield was not observed for one genotype. For this example we consider two complete datasets (320 observations): the first is the original dataset without any corrections, which we call the ’raw’ dataset, and the second is the original dataset with the 16 yield observations replaced with missing values, which we refer to as the ’processed’ dataset. In addition, we consider a cleaned version of the raw dataset (288 observations; called cleaned dataset) obtained by removing from the raw data the 16 outlying genotypes (32 observations) identified by both the breeders and the criterion used for outlier detection described in the “Example application” section. We note that because the empirical rye dataset has only two replicates, a single outlier will automatically generate an outlier with the same absolute value of opposite sign for the other replicate of the same genotype. Consequently, we removed a testcross genotype entirely from the cleaned dataset even if only one of its two replicate observations was outlying. The raw, processed and cleaned datasets comprise only 148, 148 and 132 genotypes with genomic information, respectively.
Maize dataset: The maize dataset was produced by KWS in 2010 for the Synbreed Project. The data set has 1800 yield observations on 900 doubled haploid maize lines and 11,646 SNP markers. Out of the 900 test crosses 698 were genotyped whereas 202 were not. The test crosses were planted in a single location (labelled RET) on nine 10 by 10 lattices each with two replicates. Six hybrid and five line checks connected the lattices (398 observations in total). The lines were crossed with four testers. After performing quality control, the breeder recommended replacement of 38 yield observations with missing values. A more elaborate description of this maize dataset is provided in [9, 11].
For this example we consider two datasets each with 1800 yield observations: the first is the original dataset without any corrections, which we call the ’raw’ dataset, and the second one is the original dataset with the 38 yield observations replaced with missing values, which we refer to as the ’processed’ dataset. Furthermore, we consider a third dataset (called cleaned raw dataset) obtained by removing 46 outliers from the raw dataset. The fourth dataset (called the cleaned and processed dataset) is obtained by removing seven outliers from the processed dataset. All the outliers satisfied the criterion for outliers described in the “Example application” section. As with the rye dataset, we removed a testcross genotype entirely from the raw dataset if at least one of the two replicate observations was outlying. Thus, the raw, processed, cleaned raw and cleaned and processed datasets have 1800, 1754, 1800 and 1793 yield observations and 698, 687, 698 and 697 genotypes with genomic information, respectively.
Genomic prediction
True correlation
The correlation between the true (g) and the predicted (\(\mathbf {\widehat {g}}\)) breeding values (true correlation or true predictive accuracy) can be calculated from simulated data as
where \(s_{g,\widehat {g}}\) is the sample covariance between the true and predicted breeding values, \(s^{2}_{g}\) and \(s^{2}_{\widehat {g}}\) are the sample variances of the true and predicted genetic breeding values, respectively. This correlation is often the quantity of primary interest in breeding studies. The simulation study therefore assesses the accuracy with which \(r_{g,\widehat {g}}\) is estimated by Methods 5 and 7, whose details are described below.
Twostage approach for predicting breeding values
Estaghvirou et al. [9] use the twostage approach of [35] to predict true breeding values (g) that are then used to estimate heritability and predictive accuracy. This approach is quite appealing because it greatly alleviates the computational burden of the singlestage approach [47], without compromising the accuracy of the results.
The singlestage model can be written as
where y is the vector of the observed phenotypic plot values, ϕ is the general mean, f is a vector that combines all the fixed, random design and error effects (replicates, blocks, etc.). For the simulated data f has four random effects only, namely, f=Z_{g}g+Z_{r}u_{r}+Z_{b}u_{b}+e where (i) Z_{g} is the design matrix for the genotypes with \(\mathbf {g}\sim N\left (0,\mathbf {Z}_{s}\mathbf {Z}_{s}^{T}\sigma ^{2}_{s}=\tilde {\mathbf {G}}\right)\), Z_{s} is the matrix of biallelic markers of the single nucleotide polymorphisms (SNPs), coded as −1 for genotypes AA, 1 for BB and 0 for AB or missing values and \(\sigma ^{2}_{s}\) is the variance of the marker effects; (ii) Z_{r} is the design matrix for the replicate effects with \(\mathbf {u}_{r}\sim N\left (0,\sigma ^{2}_{r}\mathbf {I}\right)\) and \(\sigma ^{2}_{r}\) is the variance of the replicate effects; (iii) Z_{b} is the design matrix for the block effects with \(\mathbf {u}_{b}\sim N\left (0,\sigma ^{2}_{r:b}\mathbf {I}\right)\) and \(\sigma ^{2}_{r:b}\) is the variance of the block effects; and (iv) e∼N(0,R) are the residual errors and R is the variancecovariance matrix of the residuals. In our model \(\mathbf {R}=\sigma ^{2}_{e}\mathbf {I}\) where \(\sigma ^{2}_{e}\) is the residual plot error variance.
The twostage approach basically breaks this model into two models. In the first stage, which we seek to robustify, we use the model
where y is defined as before, X=Z_{g} is the design matrix for the genotype means, μ=ϕ1+g is the vector of unknown genotypic means with g denoting the genetic effects or breeding values, and \(\tilde {\mathbf {f}} = \mathbf {Z}_{r}\mathbf {u}_{r} + \mathbf {Z}_{b}\mathbf {u}_{b} + \mathbf {e}\). Note that in this first stage the genomic information regarding the SNP markers \(\left (\boldsymbol {\Gamma }=\mathbf {Z}_{s}\mathbf {Z}_{s}^{T}\right)\) is excluded from this analysis because genotype means μ are modelled as fixed. This is usually the case when stagewise approaches are considered, in which case the genomic information is included only in the last stage [35].
In the second stage, the genotype means \(\widehat {\boldsymbol {\mu }}\) estimated at the first stage are used as a response variable in a model for estimating the true breeding values g specified as
where ϕ is the general mean and \(\tilde {\mathbf {e}}\sim N(0,\tilde {\mathbf {R}})\) with \(\tilde {\mathbf {R}}=var(\hat {\boldsymbol {\mu }}\mid \phi, \mathbf {g})\).
Note that any standard varieties or checks are dropped from the dataset before the adjusted means (\(\hat {\boldsymbol {\mu }}\)) from the first stage are submitted to the second stage. The mixed model equations for (4) can be solved to obtain the best linear unbiased prediction for g, \(\text {BLUP}(\mathbf {g})=\widehat {\mathbf {g}}\), using a ridgeregression formulation of BLUP, i.e., RRBLUP.
In case weights are used when fitting the secondstage model, then \(\tilde {\mathbf {R}}\) should be replaced by W^{−1}, with W being a weight matrix computed from the estimated firststage variancecovariance matrix \(\tilde {\mathbf {R}}\). In our case we used Smith’s [46] and standard (ordinary) [35] weights. Specifically, \(\mathbf {W}_{sm}=diag(\tilde {\mathbf {R}}^{1})\) for Smith’s and \(\mathbf {W}_{st}=(diag(\tilde {\mathbf {R}}))^{1}\) for standard weights, respectively.
More details on the twostage approach can be found in [9, 35, 36].
Method 5
This method (M5) calculates predictive accuracy as
where \(\mathbf {V}=\tilde {\mathbf {G}}+\tilde {\mathbf {R}}\) with V, \(\tilde {\mathbf {G}}\) and \(\tilde {\mathbf {R}}\) being the variancecovariance matrices for the phenotypes, genotypes and residual errors of the adjusted genotypes, respectively; \(\mathbf {P}_{u}=\frac {1}{n1}\left (\mathbf {I}\frac {1}{n}\mathbf {J}_{n}\right)\), with J_{n} a n×n matrix of ones; \(\mathbf {C}=\tilde {\mathbf {G}}\mathbf {V}^{1}\mathbf {Q}\), with Q=I−1(1^{T}V^{−1}1)^{−1}1^{T}V^{−1}, and 1 denoting a vector of ones. Under this formulation, which provides a direct estimate of the correlation between the true (g) and the predicted (\({\mathbf {\widehat {g}}}\)) breeding values, the RRBLUP of g is now given by \(\widehat {\mathbf {g}}=\tilde {\mathbf {G}}\mathbf {V}^{1}\mathbf {Q}\widehat {\boldsymbol {\mu }}\) [34].
Heritability can then be computed from (5) as
Method 7
This method (M7) is commonly used by animal breeders to directly compute predictive accuracy (ρ) from the mixed model equations (MME, [12, 28, 51]) by firstly computing the squared correlation between the true (g) and predicted breeding values (\(\widehat {\mathbf {g}}\)), i.e., reliability (ρ^{2}).
Since the MME for the secondstage model (4) are given by
with the variancecovariance matrix of \((\hat {\phi }\phi,\hat {\mathbf {g}}\mathbf {g})\) given by
and the variancecovariance matrix of g and \(\widehat {\mathbf {g}}\) given by
the reliability for each genotype is computed as
where only the diagonal elements of the matrices \(var(\mathbf {g})=\tilde {\mathbf {G}}\), \(var(\widehat {\mathbf {g}})=\tilde {\mathbf {G}}\mathbf {C_{22}}=cov(\mathbf {g},\widehat {\mathbf {g}})\) are extracted. The average reliability across the genotypes in each dataset is then estimated by
where n is the total number of genotypes in the dataset. Predictive accuracy (\(\widehat {\rho \ }_{m_{7}}\)) is then computed as the square root of \(\widehat {\rho \ }^{2}_{m_{7}}\). Alternatively, predictive accuracy can be computed as
Further details on this derivation can be found in [36].
Oakey’s method
Oakey et al. [29] propose a generalized heritability measure that was recently reexpressed by [40] as
where \(\mathbf {D}=\mathbf {I}_{n}{\tilde {G}}^{1}\mathbf {C_{22}}\), s is the number of zero eigenvalues and n−s is the effective dimension of D. We also use this method to estimate heritability and compare this estimate with the estimate obtained by method M5.
Robust estimation
Robust estimation of the linear mixed model for phenotypic data analysis
In this section we briefly review the robust approach of [19] to linear mixed effects models that we use in an attempt to robustify the first stage of the twostage approach to genomic prediction in plant breeding. This approach is implemented in the R software package robustlmm via the function rlmer() [20, 21].
We consider the general linear mixed model
where y is a vector of observations, X is the design matrix for the fixed effects (intercept included), μ is the vector of unknown fixed effects, H is the design matrix for the random effects, u∼N(0,U) is the vector of unknown random effects and e∼N(0,R) is the vector of random plot errors. Note that for our firststage model Hu=Z_{r}u_{r}+Z_{b}u_{b} and μ=ϕ1+g.
Model (13) also assumes that cov(u,e)=0 and as such we have that
We henceforward assume for simplicity that \(\mathbf {e}\sim N \left (\mathbf {0},\sigma ^{2}_{e}\mathbf {I}\right)\) and \(\mathbf {u}\sim N\left (\mathbf {0},\sigma ^{2}_{e} \mathbf {A}(\boldsymbol {\theta })\right)\) where the variance matrix A of the random effects depends on the vector of unknown variance parameters θ (this assumption can be relaxed to obtain more general formulations, see e.g., [19]). The variance of y now simplifies to
with Φ=HA(θ)H^{′}+I.
Because A(θ) is a positivedefinite symmetric matrix and assuming that θ is known, one can obtain its Cholesky decomposition as chol(A(θ))=B(θ), set u=B(θ)b and rewrite model (13) as
where \(\mathbf {b}\sim N(\mathbf {0},\sigma _{e}^{2}\mathbf {I})\) so that we again have \(\mathbf {y}\sim N\left (\mathbf {X}\boldsymbol {\mu },\sigma ^{2}_{e}\mathbf {\Phi }\right)\).
The classical loglikelihood for (14) can be written as
Furthermore, for a given set of θ,μ and σ_{e} ([44], Chapter 7)
From (15) and (16), an objective function that incorporates the observationlevel residuals and the random effects as separate additive terms can be derived and expressed as
where
This particular trick is crucial in order to independently control contamination at the levels of the residual and random effects.
Assuming θ and σ_{e} are known and taking the partial derivatives of (17) with respect to μ and b^{∗}, we get the following estimating equations for these effects,
where
If B(θ) is diagonal, as in our case, these equations are robustified by replacing \(\widehat {\mathbf {e}}^{*}\) and \(\widehat {\mathbf {b}}^{*}\) by bounded functions \(\psi _{e}(\widehat {\mathbf {e}}^{*})\) and \(\psi _{b}\left (\widehat {\mathbf {b}}^{*}\right)\), where the ψ_{e} and ψ_{b} functions need not be the same:
where \(\lambda _{\bullet }=\mathbb {E}_{0}[\psi '_{\bullet }]\) is required to balance the \(\widehat {\mathbf {e}}^{*}\) and \(\widehat {\mathbf {b}}^{*}\) terms in case different ψ functions are used; 1/λ_{e} and 1/λ_{b} are scaling factors (as in Mregression [17]) and cancel out in the special case where ψ_{e}≡ψ_{b}.
If we let
Λ_{b}=λ_{e}/λ_{b}, \(\mathbf {W}_{e}=\mathbf {Diag}(w_{e}(e^{*}_{i}/\sigma _{e}))\) and \(\mathbf {W}_{b}=\mathbf {Diag}(w_{b}(b^{*}_{i}/\sigma _{e}))\), and after some simplification, Eq. (20) can be written as
which, after expanding \(\widehat {\mathbf {e}}^{*}\) with (19), yields the following system of linear equations:
The algorithm for estimating parameters of (21) begins with a predefined set of weights. It then alternates between computing \(\widehat {\boldsymbol {\mu }}\) and \(\widehat {\mathbf {b}}^{*}\) for a given set of weights and updating the weights for a given set of estimates. Koller and Stahel [18] and Koller [19] provide more details on the estimation of the scale and covariance parameters and the estimation procedure for the nondiagonal case.
If replicate and block (nested within replicates) are the only random effects apart from the residual error in the firststage model (this is the case for the simulation study for our firststage model and for the firststage model for the rye dataset) then \(\boldsymbol {\theta }=\left (\frac {\sigma ^{2}_{r}}{\sigma ^{2}_{e}},\frac {\sigma ^{2}_{r:b}}{\sigma ^{2}_{e}}\right)\), where \(\sigma ^{2}_{r}\) and \(\sigma ^{2}_{r:b}\) are the variances for the replicate and block random effects, respectively. Also here, A(θ) is a twoblock diagonal matrix (k=2 blocks). Furthermore, because we assume \(\mathbf {u}_{r}\sim N(0,\sigma ^{2}_{r}\mathbf {I})\) and \(\mathbf {u}_{b}\sim N(0,\sigma ^{2}_{r:b}\mathbf {I})\) for the firststage model, B(θ)=[A(θ)]^{1/2} is a diagonal matrix.
In particular, for the simulated data consisting of 698 observations of maize yield from 2 replicates each having 39 blocks (more details in the “Simulation” section), we compute 2+39=41 weights (W_{b}) for the observations at the level of the random effects and 2×698=1396 weights (W_{e}) for the observations at the level of the fixed effects (i.e., for the residuals).
Robust approach to phenotypic analysis
Phenotypic data derived from field trials are prone to several types of contamination that may range from measurement errors, inherent characteristics of the genotypes and the environments to the years in which the trials were conducted. As such, if contaminated observations are present in the vector of phenotypes y in the first stage of phenotypic data analysis, then they can unduly influence the estimation of the means for the testcross genotypes (μ) in model (3), resulting in inaccurate estimates of adjusted phenotypic means \(\widehat {\boldsymbol \mu }\). In turn, these possibly inaccurate estimates of μ are passed on to the second stage of the procedure (model (4); adjusted RRBLUP) from which the breeding values g are estimated. The possibly biased estimates of (g) may undermine the accuracy of the estimated heritability and predictive accuracy.
To minimize bias in the estimation of heritability and predictive accuracy, we propose using the preceding robust model for the first stage of phenotypic data analysis. The second stage then proceeds in the same way as the classical method except that, now, the robust estimates \(\widehat {\boldsymbol \mu }_{R}\) from the first stage are used in (4).
Simulation
Simulated datasets
We consider a real maize dataset from the Synbreed Project (2009−2014). This dataset was extracted for one location from a larger dataset and consists of 900 doubled haploid maize lines, of which only 698 testcrosses were genotyped, and 11,646 SNP markers. Six hybrid checks and five line checks were considered and genotypes were crossed with four testers as explained in more detail in [9]. Variance components estimated from this dataset (\(\sigma ^{2}_{r}=0\), \(\sigma ^{2}_{r:b}=6.27\), \(\sigma ^{2}_{e}=53.8715\) and \(\sigma _{s}^{2}=0.005892\)) were used to simulate the block and plot effects based on an αdesign [31] with two replicates and the model
where y_{ijk} is the yield of the ith genotype in the jth block nested within the kth complete replicate, ϕ is the general mean, r_{k} is the fixed effect of the kth complete replicate, b_{jk} is the random effect of the jth block nested within the kth complete replicate, g_{i} is the random effect of the ith genotype, and e_{ijk} is the residual plot error associated with y_{ijk}. More details on (22) can be found in Table S3 in the supplementary materials of [10].
Our simulations consider 1000 simulated Maize datasets described as follows: each dataset consists of 698 observations of yield in 2 replicates, with the 698 genotypes distributed over 39 blocks as in Table 1. Four out of the 39 blocks have 17 observations, whereas the remaining 35 have 18 observations.
Simulation of outliers
The type of outliers we consider, commonly known in the literature as shiftoutliers (or location outliers), are typically the hardest type to detect in multivariate settings because they have the same shape (the same covariance structure but shifted mean) as the overall data [39]. The shiftoutliers can arise from various contamination sources, including the following: errors, inherent characteristics of the genotype(s) in a particular spatial location or replicate, or, occurrence of a specific phenomenon that negatively or positively impacts the genotype(s). Although our simulations focus on these particular cases, other types of outliers that we do not consider here are certainly conceivable (see [39] for more details).
In order to simulate outliers, a percentage of phenotypic observations in the dataset is chosen and contaminated by replacing the observed value of each selected observation by that value plus 5, 8 or 10 times the standard deviation of the residual error (σ) used to simulate the phenotypic datasets. Additionally, we also consider two distinct scenarios of data contamination:

(i)
Random contamination: 1, 3, 5, 7 and 10% of the phenotypic data in only one of the two replicates are randomly contaminated, amounting to an overall data contamination rate of 0.5, 1.5, 2.5, 3.5 and 5%, respectively.

(ii)
Block contamination: phenotypic data in 1, 2, 3, 4 and 5 whole blocks in only one of the two replicates are contaminated, amounting approximately to 1.3, 2.6, 3.9, 5.2 and 6.5% overall rate of data contamination, respectively.
We use the notation “% cont" to denote a particular percentage (%) of data contamination with outliers, “sd” to denote the size of the outliers and “No.blocks" to refer to the number of contaminated blocks.
First and secondstage models
In the first stage (Eq. 3), we consider yield as the response variable, the genotypes as the fixed effects and the replicates and blocks nested within replicates as the random effects. In the second stage (Eq. 4), we consider the adjusted genotypic means estimated in the first stage as the response variable, the intercept as the fixed effect and the genotypes as the random effects with a variancecovariance structure given by the genomic relationship matrix.
Comparing performance of the classical and robust approaches
The performance of the classical and robust approaches is evaluated in three steps, labelled L1, L2 and L3. L1 involves a comparison of results from the first stage; L2 entails a comparison of results from the second stage and L3 focuses on a comparison of the estimated heritability and predictive accuracy, which can be viewed as constituting the third stage. For each of the three levels, we consider the null scenario (uncontaminated datasets), random and block contamination scenarios.
Additionally, the influence of the Smith’s and standard weighting schemes used in the second stage of the twostage approach are considered in L2.
The following quantities are computed and used to compare the performance of the classical and robust approaches at levels L1–L3.
L1: The mean squared deviation (MSD) of the estimated from the true genotypic means is computed for both the classical and robust approaches as
where μ_{il} is the true mean of the ith genotype in the lth simulation run and \(\widehat {\mu }_{il}\) is its estimate.
The estimates of MSD\(_{\widehat {\mu }}\) for the classical (C) and robust (R) approaches are compared for each scenario using
and are expected a priori to agree for the null scenario.
It is also instructive to compute and plot
for each genotype i=1,...,698 for both approaches. Furthermore, the overall estimated genotypic mean (across genotypes and simulations) is also computed and compared to the corresponding true genotypic mean. Moreover, since the rank order of genotypes is also of great importance in plant breeding studies, the Pearson correlation coefficient (r_{p}) between the true and estimated genotypic means (predictive accuracy) is also computed and compared between the two approaches. This yields an estimate of the predictive accuracy for the genomic means.
L2: At this level, we compute the MSDs for the genomic breeding values g analogously to Eqs. (23)–(25). The r_{p} between the true and estimated breeding values is again computed and used to compare the two methods and assess any improvement in the estimation of g when genomic information is included in the analysis. This provides an estimate of the accuracy of genomic prediction.
L3: Here, the methods are compared by computing the following MSDs,
where \(r_{g, \widehat {g}}\) is the Pearson correlation computed between the true and the estimated breeding values and averaged across the 1000 simulations, \(\widehat {H}^{2}_{l}\) and \(\widehat {r}_{g, \widehat {g},l}\) are, respectively, the heritability and predictive accuracy estimated in the sth simulation via the methods described earlier. These MSDs quantify the deviation of the estimated from the true heritability (H^{2}) or predictive accuracy (PA). In addition, we provide boxplots of the estimated heritablity and predictive accuracy for the 1000 simulation runs for each scenario.
Simulation results
Null scenario
L1: The following computed MSDs
show, as expected, that both methods perform similarly when the data are not contaminated (\(\text {MSD}_{\widehat {\mu }}\simeq 0\)). However, the classical method performs slightly better than the robust one (\({\text {MSD}_{\mu }^{C}}\lesssim {\text {MSD}_{\mu }^{R}}\)). Even so, both MSD values are not particularly close to zero. Still, as these MSD values are squared deviations averaged across all the 1000 simulation runs and 698 genotypes, they seem reasonable.
The slightly better performance of the classical relative to the robust method is also apparent in the pergenotype MSDs (Additional file 2: Figure S1). The two approaches produce virtually identical estimates for the overall mean of \(\widehat {\boldsymbol \mu }\) (i.e., mean{μ_{il}}, i=1,...,698, l=1,...,1000) and r_{p} (Table 2).
The two methods estimate the variances of both the random effects and the residual errors equally well (Additional file 2: Figure S2).
The Smith’s and standard weights obtained in the first stage for both the classical and robust approaches are very small (Additional file 2: Figure S3). Precisely, the MSD between the two different types of weights is approximately 0.6×10^{−6} and the MSD between the values of each type of weight computed by the two approaches is about 0.6×10^{−5}.
L2: There were no major differences between the estimated breeding values obtained using either the standard or Smith’s weighting schemes at the second stage (MSD _{g}≃25 for both cases). For this reason we only present results produced using Smith’s weights.
The MSDs for the second stage
show a modest improvement over the corresponding estimated genotypic means at the first stage and that the methods continue to perform similarly as in the first stage. Relative to the estimates for the first stage, the pergenotype MSDs (Additional file 2: Figure S13) increase for about 22% but decreases for about 47% of the genotypes. This trend is similar for both the classical and robust approaches. Additionally, for the second stage, the mean r_{p}=0.903 for both approaches. This increase in r_{p} relative to the first stage (≃18.2%) shows that using genomic information at the second stage improves genomic prediction and hence the ranking of genotypes. For the overall mean of the EBVs (\(\widehat {\mathbf {g}}\)), it drops to ≃5 from ≃9 for both approaches (first row, Additional file 1: Tables S2 & S4).
Quite interestingly, in terms of the estimation of the genetic variance, the robust approach performs slightly better than the classical (Additional file 2: Figure S14).
L3: Both the classical and robust approaches produce the following MSDs for heritability (Method M5 only) and predictive accuracy (Methods M5 and M7):
showing the estimates of heritability and predictive accuracy to be quite accurate. We note that estimates of heritability and predictive accuracy were computed by fixing the residual variance from the first stage to one as described in the “Genomic prediction” section. In general this produced more accurate estimates than the alternative for which the residual variance estimated in the first stage is used. Therefore all the results displayed here for the third stage use the former implementation.
Boxplots for the estimated PA (methods M5 and M7) and H^{2} (method M5 only) across the 1000 simulations for the null scenario are shown together with the ones for the random and block contamination scenarios (Additional file 2: Figures S19S20). These suggest that method M5 produces more accurate estimates of PA than method M7.
Relative to method M5, Oakey’s heritability estimates are unacceptably lower than the simulated true values (Fig. 1). To further explore why these estimates were remarkably smaller than those produced by M5 or the simulated true values, we took 100 out of the 1000 simulation replicates and refitted the entire twostage model by setting \(\tilde {\mathbf {G}}=\sigma ^{2}_{g}\mathbf {I}\) at the second stage. The heritability estimates for M5 and Oakey’s were virtually identical (MSD ≃0.1×10^{−27}). This strongly suggests that Oakey’s method works fine with independent genotypes but performs poorly when the model used to estimate heritability has a kinship matrix. Consequently, we do not consider Oakey’s heritability estimates further except in a few comparisons in the “Example application” section.
Random contamination scenarios
L1: The MSD_{μ} for these scenarios are similar between approaches for each level of contamination and size of outlier (Tables 3 and Additional file 1: Table S1). Hence for the random contamination scenarios, the robust and classical approaches produce comparable estimates for the genotypic means. The pergenotype MSDs also reaffirm the similar performance of the two approaches (Additional file 2: Figure S4). Nevertheless, it is noteworthy that even for the least extreme scenario, 1% contamination with an outlier of size 5 sd, the increase in the pergenotype \(\text {MSD}_{\mu }^{i}\) is nonnegligible relative to the corresponding values computed by the classical method under the null scenario and used as a benchmark (Fig. 2). The pergenotype MSDs increase greatly with increase in the percentage contamination and size of outliers (Additional file 2: Figure S4).
The estimated overall mean of the estimated genotypic means (\(\widehat {\boldsymbol {\mu }}\)) for both approaches departs increasingly from the true overall mean of μ as both the level of contamination and size of outliers increase (Additional file 1: Table S2). The same trend is evident for the r_{p}, implying deterioration of the ranking of genotypes (Additional file 1: Table S2).
The two methods also differ with respect to how well they estimate particular variance components. More precisely, the classical method estimates the variance for blocks nested within replicates (\(\sigma ^{2}_{r:b}\)) somewhat better than the robust method does from 5% contamination upwards. However, the robust method estimates the variances for replicates (\(\sigma ^{2}_{r}\)) and residual errors\(\left (\sigma ^{2}_{e}\right)\) far better than the classical method does (Additional file 2: Figures S6–S8).
The Standard and Smith’s weights computed for both the classical and robust approaches across the random contamination scenarios are shown in Additional file 2: Figures S9 and S10.
As the percentage of contamination and size of the outliers increase, the degree of overlap of the empirical frequency distributions of the classical and robust weights evidently reduces. In particular, the distributions do not overlap at all from the 3% contamination level upwards for the 8− and 10−sd shiftoutliers. Also, the weights show an overall decreasing trend, which is more evident for the classical approach and the Standard weights.
L2: The mean squared deviations (MSD_{g}) between the EBVs and the TBVs for the classical approach are displayed in Additional file 1: Table S5. The MSD_{g}’s for the Standard and Smith’s weights show some minor differences in favour of the latter from 7% contamination and 8sd shiftoutliers onwards. Thus, only the secondstage results obtained using the Smith’s weights are presented in the remainder of this section.
The robust approach tends to produce smaller MSDs between the EBVs and the TBVs as the percentage of random contamination and the size of the shiftoutliers increase (Tables 4 & Additional file 1: Table S1). The secondstage pergenotype MSDs do not show the increasing trend observed for the pergenotype MSDs from the first stage with values ranging between 0 and 100 (Additional file 2: Figure S16). In addition, the robust method always produces higher estimated predictive accuracy, expressed as the averaged Pearson correlation coefficient r_{p}, than the classical method, implying better ranking of the genotypes (Additional file 1: Table S2). The overall mean EBVs (\(\widehat {\mathbf {g}}\)) is similar for both approaches but drops steadily as the percentage contamination and size of outliers increase, implying underestimation.
The robust approach also produces more accurate estimates of the markereffect variance (\(\sigma ^{2}_{s}\)) up to 10% contamination and 5sd shiftoutliers (Additional file 2: Figure S15). For the 10% contamination scenarios with 8 and 10sd shiftoutliers the robust estimates of \(\sigma ^{2}_{s}\) no longer overlap with the true markereffect variance but their boxplots show a smaller interquartile range (IQR) and lower dispersion than those for the classical approach.
L3: The robust approach produced generally more accurate estimates for both H^{2} and PA for the random contamination scenarios (Additional file 2: Figures S18S20). However, both approaches tend to underestimate both parameters as the percentage contamination and size of outliers increase. The \(\text {MSD}_{\texttt {H}}^{\texttt {M}5}\) ranged between approximately 0.00−0.09 and 0.00−0.07 for the classical and robust methods, respectively. The corresponding, \(\text {MSD}_{\texttt {PA}}^{\texttt {M}5}\) ranged between approximately 0.00−0.02 and 0.00−0.01 whereas \(\text {MSD}_{\texttt {PA}}^{\texttt {M}7}\) ranged between approximately 0.01−0.04 and 0.01−0.07. Overall, method M5 performs somewhat better than method M7 in estimating predictive accuracy (Additional file 1: Table S6 and Additional file 2: Figures S18S20).
Block contamination scenarios
L1: Although the MSD_{μ} for the block contamination scenarios is relatively stable for the robust approach (between 29.08 and 30.11), it increases with increasing level of contamination and size of the outliers for the classical approach (Tables 5 & Additional file 1: Table S3). In the worst block contamination scenario (5_10) the MSD for the classical approach is about 1.7 times larger than that for the robust approach. The pergenotype MSDs show even poorer performance for the classical method in estimating each of the 698 genotypic means (Additional file 2: Figure S5). By contrast, the robust approach maintains the errors at roughly the same level across the contamination scenarios; a level that is close to the one estimated for the null scenario. This is an attractive property of this method.
Block contamination had generally less debilitating effect on the accuracy of the estimated genotypic means than random contamination (Additional file 1: Tables S1 and S3). For example, the block contamination scenario 1_5, which corresponds to an overall 1.3% data contamination, produces smaller MSDs than the random contamination scenario 1_5, which corresponds to only 0.5% overall contamination (Figs. 2 and 3; Table 5).
The performance of the two methods also differed noticeably with respect to the estimation of the overall mean of \(\widehat {\boldsymbol {\mu }}\). For example, for the worst case scenario (block 5_10) the overall mean of \(\widehat {\boldsymbol {\mu }}\) deviated from the true mean by merely 5.5% for the robust approach but by 50.2% for the classical approach (Additional file 1: Table S4), indicating superior performance of the robust approach. Nevertheless, the poor predictive performance of the classical approach at the first stage does not necessarily translate to a reduced predictive accuracy r_{p} because it does not alter the relative ranking of the genotypes (Additional file 1: Table S4). Accordingly, the ranking of the genotypes does not differ much between the two approaches (estimated r_{p}≃0.76 for both approaches across all scenarios).
An overall superior performance of the robust compared to the classical approach is also evident for the accuracy of the estimated variance components (Additional file 2: Figures S6S8).
L2: In this case, the MSD _{g} obtained in the second stage differ depending on whether the Smith’s or the standard weights are used. In particular, using the Smith’s weights produces more stable MSD _{g} estimates across all the block contamination scenarios than using the standard weights, which tend to increase with increasing number of contaminated blocks and size of outliers (Additional file 1: Table S5). For this reason, only results obtained using the Smith’s weights are presented in the remainder of this section.
For all levels of contamination and size of outliers, the robust overall MSDs between the EBVs and the TBVs did not differ much and fluctuated around ≃25 (Additional file 1: Table S5), a value that is similar to the corresponding value for the null scenario (Table 6).
The pergenotype MSD _{g} values vary little with increasing size of outliers but suggest that the classical method performs slightly better than the robust method (Additional file 2: Figure S17).
The average estimated predictive accuracy (r_{p}) across all scenarios was approximately 0.90 for both approaches (Additional file 1: Table S4). Predictive accuracy thus increased from the first to the second stage for the classical (by 17%) and robust (by 18%) approaches, an increase comparable to that observed under the null scenario.
Finally, the robust method estimates the markereffect variance \(\sigma ^{2}_{s}\) more accurately than the classical method throughout all the block contamination scenarios (Additional file 1: Table S15).
L3: MSD\(_{\texttt {H}}^{\texttt {M5}}\) and MSD\(_{\texttt {PA}}^{\texttt {M5}}\) were both ≃0.00 for both the classical and robust approaches across all the block contamination scenarios, with the classical producing marginally better results than the robust approach (Additional file 2: Figures S18 and S19). MSD\(_{\texttt {PA}}^{\texttt {M7}}\simeq 0.01\) for both approaches with the robust estimates of PA obtained via M7 showing slightly greater dispersion (Additional file 2: Figure S20). It is noteworthy that estimates of H^{2} and PA are rather stable across block contamination scenarios (Additional file 2: Figures S18S20), consistent with the estimated markereffect variances (Additional file 2: Figure S15).
Example application
In this section, we comparatively evaluate differences in the performances of the classical and robust approaches on raw empirical rye and maize datasets prior to quality control. Substantial differences in results between the two approaches would imply problems with the data that require closer inspection by the breeder or data analyst. Such inspection can be followed by data cleaning, which can be a very challenging and timeconsuming task. For the two example datasets in this section, we perform data cleaning based on a simple rule of thumb that relies on the weight given to each observation by the robust method. Specifically, observations assigned weights smaller than 0.5 are flagged as outliers. More sophisticated outlier detection techniques are outside the scope of this paper [3, 23]. We apply the classical and robust approaches to the cleaned dataset and compare the results with each other and with the results for the raw dataset. We note that cleaning the data does not necessarily make it conform to model assumptions such as the normality of the errors. We note further that because empirical datasets for both examples each have only two replicates, the robust method usually assigns the same, or very similar, weights to both replicates. This is the reason that a testcross genotype is removed entirely from the cleaned dataset even if only one of its two replicate observations is outlying. This problem would be eliminated by replicating each genotype three or more times.
We similarly analyze the empirical datasets after taking into account the recommendations of the breeder based on quality control to demonstrate that quality control alone will not always detect and eliminate all sources of data contamination and hence does not preclude the use of robust statistical methods.
Rye dataset
In this example we consider only one trial from the Rye dataset described in the “Materials and methods” section, which otherwise has the same structure as the simulated maize data set shown in Table 1. The first and secondstage models fitted to the Rye data set are the same as those described in the “Simulation” section.
The classical and robust approaches produced strikingly different estimates for the residual and blocks variances at the first stage as well as for heritability and predictive accuracy at the third stage (Table 7; CLS ^{r} and ROB ^{r} results). The robust weights assigned to each of the 320 observations in the first stage identified 32 observations for the exact same 16 genotypes identified as outliers by the breeders. When the 32 observations are removed from the data, which amounts to around a 10% reduction in the size of the dataset, then the classical and robust approaches produce very similar estimates, as is expected when data conform to the model’s assumptions (Table 7; CLS ^{r}* and ROB ^{r}* results). In particular, the distributions of the residuals from the classical first and secondstage models fit to the cleaned dataset satisfy the normality assumption (ShapiroWilk normality test: p=0.9771 and p=0.6974, respectively), but the distribution of the residuals from the raw dataset do not (ShapiroWilk normality test: p<10^{−9}). Inspection of the QQ plots of the residuals (not shown) further reinforced the results of the normality tests. This example clearly demonstrates how the robust approach ameliorates most of the devastating influences of outliers on the classical method. Thus, contamination with outliers inflates the estimated residual variance ∼20 times for the classical method but only ∼3.6 times for the robust method. By contrast, contamination reduces the estimated block variance from ∼11.5 to zero for the classical method but from ∼11.9 to ∼8.2 for the robust method. Lastly, contamination reduces the estimated heritability and predictive accuracy far more strikingly for the classical than for the robust approach. However, contamination inflates the markereffect variance equally by a factor of two for both approaches. Although far lower than the simulated true values, Oakey’s heritability estimates are also shown and compared between the full and the cleaned datasets for completeness (Table 7).
Results for the processed dataset are similar for the two approaches (Table 7 ; CLS ^{qc} and ROB ^{qc}). They are also quite similar to the results for the cleaned dataset, except for the estimated markereffect variances, which are smaller for the cleaned dataset perhaps due to the decrease in the sample size at the second stage. Interestingly, for this particular case, the residuals from the first stage of the classical model fit satisfy the normality assumption (ShapiroWilk normality test: p=0.6974) but those from the second stage only marginally pass the normality test (ShapiroWilk normality test: p=0.0437). A quick look at the QQ plot of these residuals reveals two residuals that deviate substantially from the equality line (plot not shown). The robust method did not, however, assign any observation a weight smaller than 0.5 and and hence we did not analyse the cleaned and processed datasets. Nevertheless, if a less conservative threshold of 0.7, say, were used instead of 0.5, then, the robust method would have flagged one observation for a check genotype and two for one testcross genotype.
Maize dataset
In the first stage (Eq. 3), we consider yield as the response variable, the genotypes as the fixed effects and the trials, the replicates nested within trials and the blocks nested within replicates nested within trials as the random effects. In the second stage (Eq. 4), we consider the adjusted genotypic means estimated in the first stage as the response variable, the intercept as the fixed effect and the genotypes as the random effects with a variancecovariance structure given by the genomic relationship matrix. Note that only the 698 genotypes with available genomic information are submitted to the second stage. In addition, 46 observations of yield (amounting to around 2.6% overall contamination) were identified as outliers by the robust weights computed from the robust firststage model using the raw dataset. Here, all the observations assigned weights of 0.5 or less by the robust model were classified as outliers. Among these 46 outliers, 22 corresponded to 11 genotypes with genomic information, meaning that the second stage for the cleaned raw dataset comprised only 687 (698−11) genotypes. Of the remaining 24 outliers, 18 correspond to 9 genotypes with no genomic information and 6 to 3 hybrid checks and 3 line checks. Overall, 11+9=20 test crosses and 4 of the 6 checks are a subset of the 38 yield observations recommended for removal (deletion) by the breeder during quality control. The robust method identified only 7 observations as outliers from the processed maize dataset (i.e., with 38 missing yield observations). Two of the 7 observations came from one genotyped test cross, 2 hybrid and 3 line checks. Furthermore, two out of these 7 outliers were also identified when the raw dataset was analysed with the robust method. A detailed treatment of outlier detection strategies is beyond the scope of this paper and can be found elsewhere [4, 23, 39, 48].
As with the Rye dataset, the classical and robust approaches produced different results for the full dataset (Table 8; CLS ^{r} and ROB ^{r} results). The differences between the two methods at the first and second stages translate into major differences in the estimated heritability and predictive accuracy. For the cleaned dataset (Table 8; CLS ^{r}* and ROB ^{r}* results) the methods produce more similar estimates of variance components, heritability and predictive accuracy, although estimates are not as close as the ones observed in the case of the rye dataset. In addition, the robust results for the full dataset are close to those obtained via the classical method applied to the cleaned dataset. Note that removal of the outliers was sufficient for the residuals from the first stage but not the second stage of the classical model fit to conform to the normality assumption, in contrast to the results for the rye dataset.
The results for the full dataset after quality control (Table 8; CLS ^{qc} and ROB ^{qc}) are similar to those from the robust method (ROB ^{r}; Table 8) However, the residuals from the classical firststage model fit violate the normality assumption. After removing the 7 outliers from the processed dataset (CLS ^{qc}* and ROB ^{qc}*) the classical and robust approaches produced even more similar results (Table 8) but the residuals still depart from the normality assumption.
As before, Oakey’s heritability estimates are also provided and compared between the full and the cleaned datasets (Table 8).
Discussion
The simulation results showed that the classical and robust approaches perform similarly when datasets are not contaminated and thus conform to the linear model assumptions. This is a desirable property for any method that seeks to be an alternative to the classical method. Since datasets do not usually conform to all model assumptions, we assessed the relative performance of both methods in estimating genetic breeding values, heritability and predictive accuracy, across a range of contamination scenarios with outliers, which tamper mostly with the assumptions of normality and variance homogeneity of the residuals. All the scenarios involved either random or block contamination (mimicking plausible field conditions), and for each contamination type, differed only in the percentage of the observations that were contaminated and the size of the outliers. Also, two weighting schemes were used with each dataset in the second stage of the twostage approach.
The simulations revealed that block contamination has a lesser impact on the estimated parameters than random contamination. Also, the estimated true breeding values improve from the first to the second stage, based on the Pearson correlation coefficient, reaffirming the value of using genomic information in the analyses. In addition, the use of the Smith’s weights produces more consistent parameter estimates from the second stage onwards and is therefore recommended for the twostage approach.
A comparison of the performance of the classical and robust twostage approaches is summarized in Additional file 1: Table S7. In general, the proposed robust method shows a superior performance to the classical approach. In terms of the accuracy of heritability and genomic prediction, the robust approach clearly outperforms the classical for the random contamination scenarios but performs similarly to the classical approach for the block contamination scenarios. Also, method M5 produces more accurate estimates of predictive accuracy of genomic prediction than method M7. Quite surprisingly, the simulations suggest that Oakey’s method is unsuitable for estimating heritability when using a model with a kinship matrix.
Interestingly for the block contamination scenarios, the robust method generally outperformed the classical in both the first and second stages, but this did not translate into higher predictive accuracies. This is likely because the block effect (i.e., effect of blocks within replicates) is completely confounded with the effect of contamination within blocks. As a result, if the block effect is included in the model at the first stage it captures the effect of contamination within the block, yielding an inflated block variance for the classical but not the robust approach. This explains why the performance of the classical approach improves from the first to the second stage. It also emphasizes the need to include a random block effect in the first stage to account for intrablock variance especially when using the classical approach.
A noteworthy observation from the simulations is that if a study design has only two replicates, then the robust or the classical methods cannot identify only one of the two replicate observations as an outlier. Hence, when using an automated cleaning process, one has to discard twice as many observations as the actual number of outliers. This is because given only two replicates, a single outlier results in two large residuals with the same absolute value but opposite signs. This makes it hard to determine which of the two replicates is actually the outlier.
The robust method can also be useful to breeders doing variety testing for which only the firststage model is required. Here, the robust approach had clearly superior performance for the block contamination scenarios. For the random contamination scenarios, except for the blocks within replicates variance, the robust method produced more accurate estimates of the variance components than the classical method did. Moreover, because lategeneration breeding trials typically use only two replicates as breeders aim to maximize the number of different environments, the robust method will merely downweight but not require deleting both replicate observations if it identifies either one or both of them as outliers. This property of the robust method is highly desirable because it enables the plant breeder to obtain genomic predictions for all the target genotypes. By contrast, using the classical method only would result in discarding all the genotypes for which at least one of the two replicate observations is an outlier. This is because it is impossible to determine which of the two replicate observations is the true outlier.
The analysis of the real datasets also furnished some insights into the performance of the methods. For the rye dataset, for example, the 16 outliers identified by the breeders, were also detected by the robust model. Each of the 16 outliers belonged to a distinct block, thus mimicking a random contamination scenario. Also, 8 outliers were observed in the first replicate and the remaining 8 in the second replicate hence resulting in a balanced distribution of outliers at the replicate level. The differences between the results obtained by the classical and robust approaches for the complete dataset are consistent with the ones observed for the simulated random contamination scenarios. Removing the outliers from the data produced a closer agreement between the classical and the robust results but some slight difference from the results produced by the robust approach for the complete dataset still remained. A plausible explanation for this is that removing outliers from the data may: (i) substantially reduce sample size; (ii) alter the distribution of the data and (iii) potentially lead to the underestimation of variances for the cleaned data. This last point precisely matches what is observed for the estimated residuals and markereffect variances in the two empirical data analyses.
The firststage results from the analysis of the full empirical raw maize dataset showed a huge discrepancy in the estimated residual variance component, moderate disagreement in the estimated block variances and similar estimates of replicate and trial variances between the classical and robust approaches. This result is surprising and deviates from expectation based on the results of the analyses of the simulated data sets. A possible explanation for this unexpected result may relate to the difference in the models fitted to the simulated data and the empirical maize data set and the nature of the outliers. In this case, the 46 observations removed were unevenly spread across 17 out of the 20 blocks (of size 90) and amount to 3% of all the data. Two of these 17 blocks had approximately 8−9% contamination. The criterion used to identify the outliers in the maize data set was the robust weights computed from the robust model fit and was somewhat conservative as only the observations assigned weights equal to 0.5 or less were flagged. This criterion, when applied to the rye dataset, correctly identified the 16 outliers that had already been identified by the breeders. However, for the maize dataset this approach to outlier identification is probably too restrictive because the distribution of the residuals from the classical firststage model fit to the cleaned dataset satisfied the normality assumption but the residuals from the classical secondstage model fit did not. This observation reinforces the view that successfully cleaning the data to eliminate outliers prior to analysis, plus satisfactorily addressing the drawbacks listed above can be exceedingly challenging. Of the 38 yield observations replaced with missing values as recommended by the breeder based on quality control, 24 were identified by the robust method as outliers based on the analysis of the raw dataset and consisted of either negative or zero yield values, which are evidently anomalous. The other 14 of the 38 deleted observations were plausible and were not identified as outliers by the robust method. Results of the analysis of the processed maize dataset with 38 missing yield observations set to missing were very similar between the two approaches. In particular, the results are also quite similar to those from the analysis of the raw dataset using the robust method. This finding emphasizes the stability and reliability of the robust approach both in the presence of outliers and missing observations.
Conclusion
We have proposed a robust framework for enhancing the accuracy of genomic prediction in plant breeding applicable to fully replicated designs or replicated genotypes in partially replicated designs. The approach does not apply to fully unreplicated designs or to unreplicated genotypes in partially replicated designs for which it is impossible to detect outliers using linear models. Hence, fully or partially unreplicated designs tradeoff the ability to detect outliers for costeffectiveness of field trials. We emphasize that the ideas underlying the robustness of the proposed approach can be extended to multistage and singlestage approaches. But, because the singlestage approach is characteristically computationally expensive for large or complex data sets, robustifying it would make it even more computationally demanding.
In conclusion, we show not only the advantages of a robust approach to phenotypic data analysis and genomic prediction but also provide new insights into the potential problems associated with using the classical approach to phenotypic data analysis and genotypic prediction in plant breeding. Accordingly, in addition to performing standard data quality controls, plant breeders would do well to seriously consider using these robust methods regularly alongside the classical approach to better minimize the influence of outliers on genomic prediction.
Availability of data and materials
The datasets generated and analyzed during the current study are not publicly available due to being proprietary materials of KWS. These can only be shared upon reasonable request and after KWS express consent. Notwithstanding, an R code to simulate a synthetic dataset, which we call toydata, that includes the simulation of phenotypic data as well as SNP marker data and kinship is provided in the Additional file 4. After generating the toydata the user can perform both the classical and robust twostage approaches by running the R code provided in the Additional file 5.
Abbreviations
 CLS:

Classical
 EBVs:

Estimated breeding values
 GP:

Genomic prediction
 MME:

Mixed model equations
 MSD:

Mean squared deviation
 PA:

Predictive accuracy
 ROB:

Robust
 RRBLUP:

Ridge regression best linear unbiased prediction
 SNP:

Singlenucleotide polymorphism
 TBVs:

True breeding values
References
Arslan O, Billor N. Robust Liu estimator for regression based on an Mestimator. J Appl Stat. 2000; 27(1):39–47.
BernalVasquez AM, Utz F, Piepho HP. Outlier detection methods for generalized lattices: a case study on the transition from ANOVA to REML. Theor Appl Genet. 2014; 129(4):787–804.
BernalVasquez AM, Utz HF, et al.Outlier detection methods for generalized lattices: a case study on the transition from ANOVA to REML. Theor Appl Genet. 2016; 129(4):787–804.
Cerioli A, Farcomeni A, Riani M. Robust distances for outlierfree goodnessoffit testing. Comput Statist Data Anal. 2013; 65:29–45.
Chi EC, Scott DW. Robust Parametric Classification and Variable Selection by a Minimum Distance Criterion. J Comput Graph Stat. 2014; 23(1):111–128.
Copt S, Feser V. Highbreakdown inference for mixed linear models. J Am Stat Assoc. 2006; 101:292–300.
Croux C, Dehon C. Estimators of the multiple correlation coefficient: local robustness and confidence intervals. Stat Pap. 2003; 44(3):315–334.
Demidenko E. Mixed Models: Theory and Applications. Hoboken: John Wiley & Sons; 2004.
Estaghvirou SBO, Ogutu JO, SchulzStreeck T, Knaak C, Ouzunova M, Gordillo A, Piepho HP. Evaluation of approaches for estimating the accuracy of genomic prediction in plant breeding. BMC Genomics. 2013; 14:860.
Estaghvirou SBO, Ogutu JO, Piepho HP. Influence of outliers on accuracy and robustness of methods for genomic prediction in plant breeding. G3. 2014; 4:2317–28.
Estaghvirou SBO, Ogutu JO, Piepho HP. How genetic variance, number of genotypes and markers influence estimates of genomic prediction accuracy in plant breeding. Crop Sci. 2015; 55(5):1911–24.
Henderson CR. Comparison of alternative sire evaluation methods. J Anim Sci. 1975; 41:760–70.
Hoerl AE, Kennard RW. Ridge Regression: Biased estimation for nonorthogonal problems. Technometrics. 1970; 8:27–51.
Hoerl AE, Kennard RW, Baldwin KF. Ridge Regression: Some Simulations. Commun Stat Theory Methods. 1975; 4:105–23.
Holland PW. Weighted Ridge Regression: Combining Ridge and Robust Regression Methods. NBER Working Paper Series. 1973. Working Paper No.11.
Huber PJ. Robust estimation of a location parameter. Ann Math Stat. 1964; 35:73–101.
Huber PJ. Robust statistics: a review. Ann Math Stat. 1972; 43:1041–67.
Koller M, Stahel WA. Sharpening WaldType Inference in Robust Regression for Small Samples. Comput Stat Data Anal. 2011; 55(8):2504–15.
Koller M. Robust estimation of Linear Mixed Models. PhD Thesis. 2013. http://ecollection.library.ethz.ch/eserv/eth:6670/eth667002.pdf.
Koller M. robustlmm: Robust Linear Mixed Effects Models. R package version 2.1. 2015. http://CRAN.Rproject.org/package=robustlmm.
Koller M. robustlmm: An R Package for Robust Estimation of Linear MixedEffects Models. J Stat Softw. 2016; 75(6):1–24.
Lourenço VM, Pires AM, Kirst M. Robust linear regression methods in association studies. Bioinformatics. 2011; 27(6):815–21.
Lourenço VM, Pires AM. Mregression, false discovery rates and outlier detection with application to genetic association studies. Comput Stat Data Anal. 2014; 78:33–42.
Lourenço VM, Rodrigues PC, Pires AM, Piepho HP. A robust DFREML framework for variance components estimation in genetic studies. Bioinformatics. 2017; 33(22):3584–94.
Maronna RA, Martin DR, Yohai VJ. Robust Statistics. Chichester: Wiley; 2006.
Maronna RA. Robust Ridge Regression for HighDimensional Data. Technometrics. 2011; 53(1):44–53.
Midi H, Zahari M. Estimators in the Presence of Outliers and Multicollinearity. Jurnal Teknologi. 2007; 47(C):59–74.
Mrode RA, Thompson R. Linear Models for the Prediction of Animal Breeding Values, 2nd Edition. Wallingford; 2005.
Oakey H, Verbyla A, Pitchford W, et al.Joint modelling of additive and nonadditive genetic line effects in single field trials. Theor Appl Genet. 2006; 113:809–19.
Pen̋a D, Yohai VJ. A fast procedure for outlier diagnostics in large regression problems. J Am Stat Assoc. 1999; 94:434–45.
Petersen RG. Agricultural field experiments/design and analysis. New York: Marcel Dekker; 1994.
Piepho HP, Möhring J. Computing heritability and selection response from unbalanced plant trials. Genetics. 2007; 177:1881–8.
Piepho HP, Möhring J, Melchinger AE, Büchse A. BLUP for phenotypic selection in plant breeding and variety testing. Euphytica. 2008; 161:209–28.
Piepho HP. Ridge regression and extensions for genomewide selection in maize. 2009; 49:1165–76.
Piepho HP, Möhring J, SchultzStreeck T, Ogutu JO. A stagewise approach for the analysis of multienvironment trials. Biom J. 2012a; 54:844–60.
Piepho HP, Ogutu JO, SchultzStreeck T, Estaghvirou B, Gordillo A, Tchenow F. Efficient computation of RidgeRegression Best Linear Unbiased Prediction in genomic selection in plant breeding. Crop Sci. 2012b; 52:1093–104.
Pinheiro JC, Bates DM. MixedEffects Models in S and SPLUS. 2000.
Pinheiro JC, Bates DM, DebRoy S, Sarkar D, R Core Team. nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1–117. 2014. http://CRAN.Rproject.org/package=nlme.
Rocke DM, Woodruff DL. Identification of Outliers in Multivariate Data. J Am Stat Assoc. 1996; 91:1047–61.
RodríguezÁlvarez MX, Boer MP, Van Eeuwijk FA, Eilers PH. Correcting for spatial heterogeneity in plant breeding experiments with Psplines. Spat Stat. 2018; 23:52–71.
Rodrigues PC, Monteiro AF, Lourenço VM. A Robust AMMI model for the analysis of genotypebyenvironment data. Bioinformatics. 2016; 32(1):58–66.
SchulzStreeck T, Ogutu JO, Piepho HP. Comparisons of singlestage and twostage approaches to genomic selection. J Theor Appl Genet. 1996; 126:69–82.
Searle SR. Linear models. New York: Wiley; 1971.
Searle SR, Casella G, McCulloch CE. Variance Components. 1992.
Silvapulle MJ. Robust ridge regression based on an Mestimator. Aust J Stat. 1991; 33(3):319–33.
Smith A, Cullis BR, Gilmour A. The analysis of crop variety evaluation data in Australia. Aust N Z J Stat. 2001; 43:129–45.
Smith AB, Cullis BR, Thompson R. The analysis of crop cultivar breeding and evaluation trials: an overview of current mixed model approaches. J Agric Sci. 2005; 143(6):449–62.
Tanaka E. Simple robust genomic prediction and outlier detection for a multienvironmental field trial. arXiv preprint arXiv:1807.07268. 2018.
Tukey JW. A survey of sampling from contaminated distributions In: Olkin I, et al., editors. Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. Stanford: Stanford University Press: 1960. p. 448–85.
Utz HF. PlabStat: A computer program for statistical analysis of plant breeding experiments. Institute of Plant Breeding, Seed Science and Population Genetics. 2011. University of Hohenheim, D70593 Stuttgart, Germany.
VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008; 91:4414–23.
Zahari SM, Zainol MS, AlBanna MI, Ismail B. Weighted Ridge MMEstimator in Robust Ridge Regression with Multicollinearity. In: Proceedings of Mathematical Models and Methods in Modern Science. World Scientific and Engineering Academy and Society: 2012.
Acknowledgements
We thank KWS LOCHOW GMBH for providing the rye dataset and KWS for providing the maize dataset. We also thank Paul Schmidt for the helpful discussion regarding Oakey’s heritability results.
Funding
This work received partial financial support from Portuguese National Funds by the Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) through Projects PTDC/MATSTA/0568/2012, UID/MAT/00297/2019 and UIDB/00297/2020 (CMA: Centro de Matemática e Aplicações) and Sabbatical Grants SFRH/BSAB/105935/2014 and SFRH/BSAB/142919/2018. VML, HPP & JOO acknowledge partial support by the Conselho de Reitores das Universidades Portuguesas (CRUP) and the Deutscher Akademischer Austauschdienst (DAAD) through the LusoGerman Integrated Actions Project PT/A13/17DE/57339863. HPP acknowledges support by the German Research Foundation (DFG, Grant PI 377/181). JOO was supported by the German Research Foundation (DFG, Grant OG 83/11 & OG 83/12).
Author information
Authors and Affiliations
Contributions
VML and HPP conceived the project. VML and JOO wrote the R code, performed the simulations and the analyses. VML wrote the manuscript with input from JOO and HPP. HPP supervised the project. All authors read and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
This file contains additional Tables S1,...,S7. First and secondstage results are reported together.
Additional file 2
This file contains additional Figures S1,...,S20. Results reported in these figures are organized from the first stage to the third stage.
Additional file 3
This file contains a discussion of the differences between the secondstage results obtained using either the robust Smith’s weights or the doubly robust Smith’s weights.
Additional file 4
This file contains the R code needed to generate phenotypic data as well as SNP marker data and kinship, computed from an alphadesign as the one of the maize dataset used in the simulations.
Additional file 5
This file contains the R code needed to run both the classical and robust twostage approaches.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Lourenço, V.M., Ogutu, J.O. & Piepho, HP. Robust estimation of heritability and predictive accuracy in plant breeding: evaluation using simulation and empirical data. BMC Genomics 21, 43 (2020). https://doi.org/10.1186/s128640196429z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s128640196429z
Keywords
 Genomic prediction
 Predictive accuracy
 Heritability
 SNPs
 Robust estimation