The importance of phenotypic data analysis for genomic prediction - a case study comparing different spatial models in rye

Background Genomic prediction is becoming a daily tool for plant breeders. It makes use of genotypic information to make predictions used for selection decisions. The accuracy of the predictions depends on the number of genotypes used in the calibration; hence, there is a need of combining data across years. A proper phenotypic analysis is a crucial prerequisite for accurate calibration of genomic prediction procedures. We compared stage-wise approaches to analyse a real dataset of a multi-environment trial (MET) in rye, which was connected between years only through one check, and used different spatial models to obtain better estimates, and thus, improved predictive abilities for genomic prediction. The aims of this study were to assess the advantage of using spatial models for the predictive abilities of genomic prediction, to identify suitable procedures to analyse a MET weakly connected across years using different stage-wise approaches, and to explore genomic prediction as a tool for selection of models for phenotypic data analysis. Results Using complex spatial models did not significantly improve the predictive ability of genomic prediction, but using row and column effects yielded the highest predictive abilities of all models. In the case of MET poorly connected between years, analysing each year separately and fitting year as a fixed effect in the genomic prediction stage yielded the most realistic predictive abilities. Predictive abilities can also be used to select models for phenotypic data analysis. The trend of the predictive abilities was not the same as the traditionally used Akaike information criterion, but favoured in the end the same models. Conclusions Making predictions using weakly linked datasets is of utmost interest for plant breeders. We provide an example with suggestions on how to handle such cases. Rather than relying on checks we show how to use year means across all entries for integrating data across years. It is further shown that fitting of row and column effects captures most of the heterogeneity in the field trials analysed. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-646) contains supplementary material, which is available to authorized users.


Background
Genomic prediction (GP) was first introduced in 2001 [1] as a method that allows the prediction of genomic estimated breeding values (GEBV) for plants and animals by using information of genetic markers. In plant breeding, GP has been adopted as another stage of the breeding scheme [2], not diminishing the importance of the phenotypic analysis usually carried out in several environments. Merging the phenotype and the genotype analyses has been addressed through the so-called stage-wise analysis [3]. In the first stage environments are analysed separately and genotype means are computed and then submitted in the GP stage to predict GEBV based on dense genetic markers such as single nucleotide polymorphisms (SNPs).
In plant breeding, assessing genotypic adaptability and stability, and predicting breeding values of the genotypes http://www.biomedcentral.com/1471-2164 /15/646 in other environments and other years, makes use of multi-environment trials (METs), which aim to evaluate as many genotypes as possible in as many as possible locations [4][5][6][7]. These METs are typically laid out as generalised lattice designs testing a large number of different genotypes per trial. The number of tested genotypes is limited by factors such as seed production, production cycle length and availability of physical resources, e.g. land and budget [8].
Within years, genotypes are tested in series of trials, which are connected by checks. Checks are lines grown in every trial as controls because their performance is known and/or they are already commercial material. Checks can be also used to connect years. In the rye breeding program considered in this paper, a completely different set of genotypes is tested in each year, but these genotypes are from the same breeding population. The accuracy of a genomic prediction model depends on the number of genotypes used for calibration. So there is definitely a need to combine data across years. Low connectivity across years is a challenge when trying to combine data across years, and this is one main motivation for this paper. Furthermore, the unbalancedness due to the design layout and the different and large number of evaluated genotypes increases the heterogeneity introducing high complexity to the variance-covariance structure among adjusted genotype means [3].
Analysis of METs could be done as single-stage analysis, modelling the complete observed data at the level of individual plots, or using a stage-wise approach, where experiments are analysed first at the level of environments (or trials), obtaining adjusted means per genotype, which are then summarised across environments (or trials) in the next stage [3]. A single-stage analysis accounts entirely for the variance-covariance structure of the recorded observations [6], therefore it is regarded as the gold standard. However, it has been shown that in a stage-wise analysis, a loss of information occurring in the transition through stages can be minimized by an appropriate weighting scheme [9].
If feasible, a single-stage approach is preferable to a stage-wise analysis [10]. Nevertheless, the latter is acceptable for GP, since it is simple, computationally more efficient and also allows to easily account for any specifics of randomisation layout and error modelling for each environment [3]. It should be stressed, however, that in a stage-wise analysis the weights are chosen to approximate the variance-covariance matrix of adjusted means from previous stages. We used here a three-stage approach and compared different spatial correlation structures in the first stage to correct field heterogeneity at the trial level.
Spatial error models may provide more accurate estimates of genotype effects than models not accounting for spatial adjustment [11,12] but they are computationally more demanding and convergence may be difficult to reach. Any effort in terms of improving the genomic predictions would include checking if these improved estimates have an effect on the predictive ability when markers are added to the model. The performance of alternative spatial models can be assessed by k-fold cross validation (CV).
Similarly, the merits of different spatial models used to compute adjusted means in the first stage can be compared by the same CV procedure, if the same GP procedure is used for each analysis. This suggests that genomic prediction-cross validation (GP-CV) can be used to identify the best-fitting mixed model in stage one. The common method of model selection makes use of information criteria based on the log likelihood, e.g. the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) [13]. When the restricted maximum likelihood (REML) method is used, models can only be compared by information criteria if they have the same fixed effects; otherwise, the maximum likelihood (ML) method should be used [13]. CV is, in this sense, not used to tune parameters as in many penalization methods (e.g. adaptive Lasso, SCAD (Smoothly Clipped Absolute Deviation), machine learning methods) but only as a tool to compare models that use REML. REML is considered the best available method of variance parameter estimation, preferable to ML [14]. Consequently, it is of interest to devise model selection procedures that can use REML and also can compare models with different fixed effects. GP-CV has already been used to judge environments in order to optimise the accuracy in GP [15]. We used this tool here as model selection method in comparison to the traditional use of AIC.
The aims of this work were: i) to assess the advantage for the predictive ability when using a spatial model for phenotypic analysis, ii) to compare stage-wise approaches for GP when the data are weakly connected across years, and iii) to compare AIC and GP-CV as methods of selection of models for phenotypic data analysis towards GP in rye.

Field layout and data set
A commercial rye breeding program by KWS-LOCHOW established in Poland and Germany aims to develop superior hybrid varieties for the seed market. The implementation of GP within the breeding program makes use of the measurements of hybrid performance of the first cycles of phenotypic evaluation of the material (Cycle1). Selections made in Cycle1 are intensively evaluated in further cycles, aiming to double-check the selection decisions. For our purposes, these additional cycles do not add much useful information. Hence, we used only the first cycles of the program. The populations tested in each year consist of http://www.biomedcentral.com/1471-2164/15/646 S 2 genotypes, which display genetic relatedness and population stratification due to complex genealogical history [16].
Besides the phenotypic data, a 16K Infinium iSelect HD Custom BeadChip was used to characterise 1610 individuals from Cycle1-2009 and Cycle1-2010 and 6 checks. Several traits were evaluated during this project: grain dry matter yield, plant height and thousand kernel weight, as well as ordinal scores of rust, mildew and lodging among others. In this work we used grain dry matter yield measurements of the phases of selection Cycle1-2009, Cycle1-2010 and Cycle1-2012, and marker information for the genotypes of 2009 and 2010. Although no marker information of year 2012 was available, it makes sense to use this dataset to observe the trend in one additional year and in this way, support the results of the phenotypic analysis of previous years.
A Cycle1 experiment consists of subsets of 320 genotypes from the S 2 populations tested in several locations within each of the two countries involving two testers (Tables 1 and 2). We define a trial as the physical unit within a location, where a subset of genotypes that were testcrossed to the same tester is evaluated. Trials at a location were laid out as α-designs with two replicates. Each trial was randomized independently from the others using the software CycDesign (VSN International; http://www. vsni.co.uk/). (However, we are aware that some breeders tend to use the same randomization layout in several locations. Ideally, each trial should have a different randomization). In our notation, trials of a Cycle1 experiment are labelled as S1, S2, . . . , S24. Row and column Table 1 General representation of the testers by locations (Loc) by years classification of Cycle1 year 2009 and 2010 in Germany (G-L1, · · ·, G-L8) and Poland (P-L1, · · ·, P-L4)
coordinates of the plots to account for spatial variation are available.
Normally throughout the program, only a single tester was used per location and year, but in some locations, some subsets of genotypes were testcrossed with the two available testers. This is the case, for example, for location G-L4 in Cycle1-2009, where the genotypes evaluated in the trials S1, S2 and S3 were testcrossed with both Tester1 and Tester2, and it is also the case of locations P-L1, P-L2, P-L3 and P-L4 evaluating genotypes of trials S7, S8 and S9 with both testers. In each year, four common checks were testcrossed with the testers and grown twice in each trial. Over the years 2009 and 2010 one check was in common and none was shared with 2012 ( Table 3).
The field layout of some trials was not perfectly rectangular. Some trials at a given location and year had fewer blocks but larger size, i.e., there were two different block sizes within a few trials. Blocks were nested within rows of the field layout.
In the genetic dataset, homozygous marker genotypes were coded as -1 and 1, and the heterozygous type, missing values and technical failures were coded as 0. 58.7% of the markers corresponded to homozygous alleles and 16.1% were heterozygous. Only a 0.03% of the markers were recorded as missing values or technical failures; therefore, an imputation method would not have a strong impact on the subsequent analyses. Monomorphic markers and markers with minor allele frequency (MAF) less than 1% or missing information of more than 10% per http://www.biomedcentral.com/1471-2164/15/646 Table 3 Year x Check classification in Germany (G) and Poland (P) 2009 2010 2012 Check13 x marker were dropped. A total of 11285 markers passed the quality test and were used for GP.

Models
In this section we present the models used in the first stage of the analysis and the models of the approaches followed to adjust the year effect either in the second or the third stage. Figures 1 and 2 depict a general scheme that helps visualizing the methodology.

First stage
In the first stage we computed adjusted genotype means by location and year. The factors used for the analysis were genotypes (G), testers (T), trials (S), replicates (R) nested within trials and blocks (B) nested within replicates. We defined a baseline model as where Y hijkv is the observed grain dry matter yield of the h-th genotype testcrossed with the v-th tester in the k-th block within the j-th replicate of the i-th trial, (GT) hv is the effect of the h-th genotype testcrossed with the v-th tester, S i is the effect of the i-th trial S i ∼ N 0, σ 2 S , Datasets generated from 9 spatial and non-spatial models plus two mixed datasets generated from best models given the Akaike information criterion (Mix1) and the predictive abilities (Mix2). Factors in second stage were genotype (G), location (L) and tester (T). M (1) represents the adjusted mean of genotypes across locations and years. M (1) = G × L × T is the shorthand notation for M (1) hsv = G h +L s +T v +(GL) hs +(GT) hv +(LT) sv + (GLT) hsv + e hsv . In the genomic prediction (GP) stage M (2) is the adjusted mean of genotypes across locations, X and β are respectively the design matrix and parameter vector of fixed effects, Z is the n × p marker matrix, u is the p-dimensional vector of SNP effects and e the error vector. Sampling methods in cross validation (CV) were across crosses (AC) and within crosses (WC). The final predictive abilities (ρ) are presented in the ellipses. In model equation (1) we assumed genotypes crossed with testers as a fixed effect to be able to compute genotype adjusted means per tester, whereas the other effects were considered as random effects due to the nested design structure [17]. Table 4 summarises the further models. Some SAS code to fit the first stage models is provided in the supplementary material (Additional file 1). The first model (M1) will be referred to as the baseline model because it was the simplest model and represented the randomisation structure. In the second model (M2) we considered additionally the effects of the o-th row (W ijo ) and the q-th column (V ijq ) both within the j-th replicate of the i-th trial. Subsequently, we added a spatially correlated residual plot effect different from the baseline model, which uses the independent model (ID) with homogeneous variances. We fitted one-and two-dimensional spatial models with and without the so-called nugget, a geostatistical term to designate an independent error effect. As onedimensional models we used the autoregressive AR(1) variance-covariance nested within blocks without nugget (M3) and with nugget (M7), and linear variance LV within blocks with nugget (M4). In the AR(1) we accounted for the correlation between plots in the same block assuming an exponential decay of correlation with distance, whereas by using LV, it is assumed that the covariance among plots in the same block decays linearly with spatial distance [18,19]. The most common extension of the spatial model in two dimensions is the direct product structure AR(1) × AR(1), which assumes that an AR(1) model holds both along rows and along columns [20]. The twodimensional models were fitted along rows and columns Table 4 Spatial and non-spatial models used for the first stage

Label Model
Variance-covariance structure for error Y hijkv is the observed dry matter yield of the h-th genotype testcrossed with the v-th tester in the k-th block within the j-th replicate of the i-th trial, (GT) hv is the effect of the h-th genotype testcrossed with the v-th tester, S i is the effect of the within replicates without nugget (M5), with nugget (M8), adding rows and columns as effects without nugget (M6) and with nugget (M9). The LV model can also be extended in two dimensions [21]; however, for METs, where the arrangement of the plots might not be perfectly rectangular, this LV × LV model was cumbersome to fit with the software we used, thus we did not consider this model.
Note that we use (GT) hv as fixed effect, which is necessary to obtain the genotype by tester means. The purpose is also to recover the information of the entries that are grown in the same locations but using different testers (e.g. in Cycle1 location G-L4 and the Polish locations P-L1 to P-L4), so that we captured the effect of the tester in the shared locations.

Second stage
In the second stage we computed genotype means across locations and testers. This was done either separately for each year (Approach 1) or also averaging across years (Approach 2). The years 2009 and 2010, where molecular marker data were available, were connected through only one check. The resulting fundamental question is then how to fit the year effect. Either the year effect is estimated by the mean of all tested entries (Approach 1) or we rely on the adjustment by the one single check (Approach 2). We assume that genotypes tested in each year can be regarded as a random sample from the same parent population. Based on the structure of the breeding program, this is a realistic assumption that motivates the approaches described in the following.
Both approaches were compared using the M (1) resulting from the analysis of the baseline model in the first stage.

Approach 1: Year-wise analysis
Each year was analysed in the second stage using a threeway interaction model of genotypes (G), locations (L) and testers (T) as factors to obtain adjusted genotypes means of each year. The model was where M (1) hsv represents the adjusted mean of grain dry matter yield of the h-th genotype, testcrossed with the vth tester in the s-th location, G h , L s and T v are the main effects of the h-th genotype, the s-th location and the vth tester, respectively, (GL) hs , (GT) hv and (LT) sv are the two-way interaction effects, (GLT) hsv is the effect of the three-way interaction and e hsv is the residual error associ- the variance of the hsv-th adjusted mean M (1) hsv obtained in the first stage.
Location was considered as random effect L s ∼ N 0, σ 2 L and hence, all the interactions containing this factor are random [17]. The crossed effect of genotypes and testers [(GT) hv ] could have been a fixed effect since genotypes and testers are taken as fixed factors in this stage. However, the crossed effects that include G were taken as random here because the factor genotype was used as random in the GP stage. But note that in the first and the second stage we needed to take genotype main effects as fixed in order to compute adjusted means [3]. Besides, since not every genotype was tested with every tester (e.g. in Cycle1 locations G-L1 to G-L3 and G-L5 to G-L8), we needed to take (GT) hv random to be able to estimate genotype means across levels of testers.
In this approach, the year effect was adjusted in two ways, hereafter referred as to Approach 1a and Approach 1b. Approach 1a used years as fixed factors in the GP stage and Approach 1b used a manual adjustment after the second stage by simply calculating the mean of the genotypes by year M (1) r and subtracting it to each genotype adjusted mean of the corresponding year ( Figure 1). The rationale behind the latter approach is the assumption that the correction for the year effect is better represented by the simple mean of the complete sample of genotypes per year than by just a few checks. The resulting year effect-corrected genotype means M (1 * ) hsv are forwarded to the GP stage, and through CV are evaluated as predictors.
As in the transition from the first to the second stage, there is a loss of information in passing on from the second to the third stage because the (GLT) hsv effect is confounded with the residual error term. This loss can be minimized by weighting the adjusted means [3]. We used the Smith et al. scheme [6], where adjusted means are weighted by the diagonal elements of the inverse of their variance-covariance matrix computed in the first stage.
At this stage, we computed the heritability for each year using the ad hoc method described in Piepho and Möhring [22] as where σ 2 G is the genetic variance andv is the mean variance of a difference of two adjusted genotype means, corresponding to the best linear unbiased estimators (BLUE). Even though this is not the best method to estimate heritability [23], the square root of this heritability estimate gives a rough idea of an upper limit for the predictive abilities. http://www.biomedcentral.com/1471-2164/15/646

Approach 2: Across years analysis
The model to account for the year effect in the second stage through the shared check was where M (1) hrsv represents the adjusted mean of grain dry matter yield of the h-th genotype, testcrossed with the v-th tester, in the s-th location and r-th year, G h is the main effect of the h-th genotype, L s is the main effect of the s-th location and D rv the main effect of the vth tester within the r-th year, which can be extended as D rv = A r + (AT) rv , with A r the effect of the year and T denoting the tester [17]. (GD) hrv , (GL) hs and (LD) rsv are the two-way interaction effects, (GLD) hrsv is the effect of the three-way interaction and e hrsv is the residual error associated to M the variance of the hrsv-th adjusted mean M (1) hrsv obtained in the first stage. The effects containing D rv can be extended as We considered genotypes and testers as fixed factors and location and year as random factors L s ∼ N 0, σ 2 L and A r ∼ N 0, σ 2 A . All effects involving A r are random except (AT) rv because we do not want to recover interyear information since there are only two years and the year by tester classification is very disconnected (years do not share testers). Moreover, the (AT) rv term is analogous to a block factor in an incomplete block design because it is free of G h ; therefore, due to the unbalancedness and the small number of years, we can use it as a fixed effect. Furthermore, the main year effect (A r ) can be dropped considering that the adjustment of the genotype means is the same for A r + (AT) rv as for only (AT) rv .
Including all the effects, the final model (4) is To minimise the loss of information in the transition to the GP stage, we weighted the adjusted means using the inverse of the squared standard errors, which is also appropriate since we are not fitting random block effects [9].

Third stage: Genomic prediction
At the third stage, the dataset of p markers was merged with the n grain dry matter yield adjusted means by years of evaluated models. GP was performed using ridge-regression best linear unbiased prediction (RR-BLUP), where the genotypic values are predicted using the marker information by regressing each SNP on the phenotype [24].
The model was where, M (2) is the n × 1 vector of phenotypic records, here, containing the adjusted means calculated from the second stage, X and β are, respectively, the design matrix and parameter vector of fixed effects, Z is the n × p marker matrix, whose elements z hm represent the SNP genotype of the m-th marker of the h-th genotype entry and take the values −1, 0, or +1 for the aa, Aa, and AA genotypes [24], u is the p-dimensional vector of SNP effects and e is the error vector. The term Zu is interpreted as the genetic effect and its estimate Zû as the GEBV. The GEBV of the h-th genotype corresponds to GEBV h = p m=1û m z hm , with m = 1, · · · , p the number of markers,û m is the estimated effect of the m-th marker and z hm the SNP genotype of the m-th marker for the h-th genotype entry. The assumptions of the model are that the error is normally distributed with zero mean and variance R [e ∼ N(0, R)] and that u has a normal distribution with zero mean and variance R is a diagonal matrix with diagonal elements equal to the inverses of the diagonal elements of the inverse of the original variance-covariance matrix of the adjusted means of the second stage [6]. I p is the pdimensional identity matrix and σ 2 u represents the proportion of the genetic variance contributed by each individual SNP.
Under the model equation (5) the variance of the observed data is var(M (2) ) = σ 2 u + R, in which = ZZ T and Z T denotes the transpose of Z [24]. To speed up the computation, was rescaled by replacing Z with Z/ √ p, with p the number of markers [25].
In the year-wise analysis (Approach 1a), the genotype adjusted means by year are merged in the M (2) vector, and vector β contains the intercept and the year effect. In the across-years analysis (Approach 2), where year effect was already accounted for, M (2) contains the genotype adjusted means and vector β contains only the intercept. In the year-wise analysis correcting genotype adjusted means for year effects (Approach 1b), the model used did not include a fixed year factor (since we had already adjusted for it) but a common intercept, thus the model was the same as for across-years analysis.
To measure the influence of the relationship among the genotypes on the predictions, we used the adjusted means obtained in the second stage and the pedigree information of the entries in a mixed model testing genotypes and crosses as random effects, so that the variances of both effects would give us an estimation of how much the http://www.biomedcentral.com/1471-2164/15/646 variation is attributed to the pedigree, e.g. the crosses. The model was M (2) ah = G h + C a + e ah (6) where M (2) ah is the adjusted mean of the h-th genotype obtained in the second stage, G h is the effect of the hth genotype, C a is the effect of the a-th grand parent (gp) cross, e.g. (gp1 × gp2) × (gp3 × gp4), and e ah the associated error. Additionally, we plotted the relationship heat-map of estimated coefficients of relatedness for individuals based on marker data computed according to Wimmer et al. [26].

Cross validation for model comparison
To evaluate model performance, k-fold CV was carried out. In CV, the data is split into k subsets t times. k − 1 subsets are used as the training set (TS) and the one other subset is the validation set (VS). The TS is used to estimate the parameters that then are used to predict the observations in the VS. The performance of the model was assessed by the Pearson correlation coefficient between the predicted GEBV and the corresponding observations of the VS. This correlation is referred to as predictive ability [23]. As in the first stage, the predictive ability was not adjusted by the square root of the heritability. Although breeding programs are most of the time operating with closely related genotypes, breeders are also interested in knowing the results in a scenario with more distantly related genotypes, for example, using genotypes that share the same grandparents either in the TS or in the VS but not in both. Hence, we wanted to check if accounting for the effect of population structure in the randomisation of CV would make the spatial error models improve the predictive abilities. We chose two scenarios given the relatedness level of the entries and followed the suggested sampling schemes from Albrecht et al. [27], which takes into consideration this fact in the CV procedure. In the first sampling scheme, hereafter called "within crosses" (WC), random sampling is done using all genotypes in the dataset; in the second scheme, hereafter referred to as "across crosses" (AC), genotypes were clustered by cross, so that complete cross-groups were used randomly either in the VS or the TS. There were 349 crosses of different sizes, sharing none, one or two grand parents. The general overview of the methodology is depicted in Figure 2.

Model selection
Two strategies for selecting the best phenotypic model were used in the first stage. In strategy one the best model for all locations is selected, that is, there is no model selection per location but across locations. In strategy two, model selection is location-specific ( Figure 3). For both strategies we computed the AIC and performed genomic prediction-cross validation (GP-CV), both per locationyear combination. To accomplish the GP-CV approach, we used the adjusted means per location and year of all spatial and non-spatial models. Then, means of genotypes by year-location combination were joined with the molecular marker data to perform GP-CV, in which genetic values were regressed on markers and validation of the model was done using k-fold CV. Predictions of unobserved records and predictive abilities of each model were obtained for each year-location combination. We assessed the predictive ability of the models using the Pearson correlation coefficient (ρ) between the predicted GEBV and the observed phenotypic value. Hereafter we denote this predictive ability as ρ-GP-CV. Predictive abilities were not adjusted with the square root of the heritability, as suggested by Dekkers [28], since this adds an extra error due to heritability computation [15,23].
For strategy one (across locations model selection), the number of locations with the best fits (either AIC or Figure 3 General representation of strategies to compare model selection methods. Factors were genotype (G), tester (T), trial (S), replicate (R) and block (B). Grain dry matter yield (Y) is the response variable in the first stage. Y = G · T : S/R/B is the shorthand notation for the model Y hijkv = (GT) hv + S i + R ij + B ijk + e hijkv . Datasets of 9 spatial and non spatial models plus one mixed dataset (Mix1) generated from best models given the Akaike information criterion (AIC) and another mixed dataset (Mix2) generated from best models given the predictive abilities (ρ-GP-CV). http://www.biomedcentral.com/1471-2164/15/646 ρ-GP-CV) was counted, so that the model with the best fits in the majority of locations was identified as the best model. For strategy two (location-specific model selection), two datasets were built: "Mix 1", containing the adjusted means of the locations with the best fit according to the AIC and "Mix 2", containing the adjusted means of the locations with the highest ρ-GP-CV. Thus, after the first stage we had in total eleven data sets of adjusted means, nine corresponding to each tested model from strategy one, plus two more datasets from strategy two: A mixed data set (Mix 1) with the best models per locationyear according to the AIC, and another mixed set (Mix 2) with best models per location-year according to the ρ-GP-CV.

Softwares
All analyses were performed using SAS. Stage 1 and 3 used the MIXED procedure and Stage 2 used PROC HPMIXED. Relationship matrix was calculated using the Synbreed Package [29] for R 2.15.

First stage -strategy 1: Model selection across locations
In the first stage -strategy 1, we did model selection across locations using AIC and predictive abilities (ρ-GP-CV) per location-year combination. According to the AIC, the results favoured the two-dimensional models (Table 5). To do a fair comparison between selection methods using AIC and ρ-GP-CV, we first describe AIC for years 2009 and 2010, for which ρ-GP-CV were also available and then, as additional information, for year 2012, for which ρ-GP-CV was not available since the marker information was missing.
For years 2009 and 2010, M9 and M8 had the majority of best fits across locations. M9 (Baseline + row + column and AR(1) × AR(1) + nugget) resulted in 12 out of 22 cases as the best model. M8 (Baseline and AR(1) × AR(1) + nugget) was best in 7 out of 22 cases. The baseline model + row + column (M2) fitted the best 9% of the times and M6 5% of the times.
A similar tendency was observed in 2012, where 43% of the times (6 out of 14) M9 had the best fit and M8 was best 29% of the times. For this year 2012, models M7, M8 and M9 could not be fitted in some locations. Another third of the times (29%), M2 had best fits. Interestingly, M2 had the best fits in the locations that had convergence problems for models M8 and M9. M1, M3, M4, M5 and M7 never had best fits in any of both groups of years.
The predictive abilities (ρ-GP-CV) per locationyear combination showed a rather different pattern for best models within locations; however, the twodimensional models were also more frequently selected than one-dimensional models ( Table 6). M8 (Baseline and AR(1) × AR(1) + nugget) showed in seven of 22 settings the highest ρ-GP-CV per location-year combination followed by M9 (Baseline + row + column and AR(1) × AR(1) + nugget) with six out of 22 times. The baseline model + row + column (M2) was selected twice and models M3, M4 and M6 had also one, three and three selections out of 22, respectively. M1, M5 and M7 had no best fits at all.
One location of 2009 (P-L3) produced a negative predictive ability for all models. We did not consider this location in the counting of best fits, since a higher negative number is actually a worse fit in regard to predictions, but low or high negative are both interpreted as zero prediction. Despite the negative correlations, this location was included in the mixed datasets produced from the site-specific model selection. We used the adjusted means produced from the baseline model. Another location (G-L1 2009) showed way lower predictive abilities than the rest of the locations. To understand these two situations, we calculated the repeatability of the trait in each location for the baseline model. The repeatability R is defined as the ratio of the between-individual component to the total phenotypic variance [30], which in our case, and following the methodology described by Nakagawa and Schielzeth [31], corresponds to where σ 2 GT is the between-groups variance and corresponds to the variance of the effect (GT) hv fitted as random effect, and in the denominator, the total phenotypic variance given by the sum of the between-groups variance σ 2 GT and the within-groups variances, i.e. replicates within trials σ 2 S + σ 2 R and blocks within replicates σ 2 B plus the residual variance σ 2 e . The interpretation of this repeatability strictly refers to the expected within-group correlations among measurements, i.e. the agreement among measurements; thus, the gist of the definition of repeatability is related to the reproducibility of the absolute values of measurements. A slightly higher repeatability in Cycle1-2009 was observed for location G-L4 (Table 6), which involved more trials, i.e. more genotypes, in comparison with other locations in Germany. The trend in Cycle1-2010 was in favour of the Polish locations, which overal l had more homogeneous and higher repeatabilities. We discuss the relation between repeatabilities and predictive abilities in the next section.

Second stage: fitting genotypes by year vs. across years
From a methodological point of view, fitting the year effect in the GP stage was easier and more direct than accounting for the year effect in the second stage, in the sense that the model for the latter approach became too complex and the variance covariance matrix of adjusted means was not http://www.biomedcentral.com/1471-2164/15/646  Boldfaced entries in the table indicate best model (fit) within location. Empty cells correspond to locations where the model did not converge. In italics, we report the models that converged but the Hessian matrix was not positive definite. § Better than second best model at forth decimal place possible to be produced using the procedure HPMIXED of SAS given the high computer power required. Instead, we computed the adjusted means with corresponding standard errors, which were then used to do the weighting to pass on from the second to the third stage. The adjusted means obtained from the across-years analysis (Approach 2) were plotted against the year effectcorrected genotype adjusted means (from Approach 1b) to compare the difference of adjustments, in the former case based on one single check against the adjustment given the simple mean of the genotypes in each year (Figure 4). Below the two principal lines, an observation corresponding to the shared check across years stood out from the others, reflecting the year adjustment. At first glance, it is clear that the check was the only observation pulled down implying that the year adjustment of this check was not strong enough to pull down the observations of the whole year. Both approaches were examined later using the predictive abilities obtained in the GP stage.

Third stage: genomic prediction
The predictive abilities of the GP stage were taken as the definitive decision criterion for identifying the best strategy for model selection, the best model, and the most reliable approach to account for year effects, and to identify the consequences of population stratification in GP. We start by presenting results of the comparison of the approaches used for fitting the year effect, since with these we only used the baseline model. Then we present the differences between sampling methods for CV together with the comparison of the models and the model selection strategies.

Comparison of approaches to account for year effect in GP
The GP-CV for the approach using the year as a fixed term in the third stage (Approach 1a) yielded a predictive ability of 0.70 (Table 7), whereas predictive ability for the approach accounting for a fixed year effect in the second stage (Approach 2) was 0.74. The predictive ability http://www.biomedcentral.com/1471-2164/15/646 reached 0.68, using the year-effect-corrected adjusted means in the GP-CV (Approach 1b). The scatter plots of GEBV (Zû) against the observed phenotypic values (adjusted means) in the three cases are depicted in Figure 5. In Approach 1a, we plotted the GEBV against the corrected observed phenotypic values, calculated as M (2) − Xβ, where M (2) is the vector of genotype adjusted means obtained in the second stage and Xβ the predicted year effect ( Figure 5A). For Approach 2, the observed phenotypic values M (2) against Zû are shown ( Figure 5B). For Approach 1b, M (2 * ) against Zû are plotted, with M (2 * ) the year-effect-corrected adjusted means of genotypes ( Figure 5C).

Comparison of model selection strategies using different sampling methods in cross validation
Fitting model (6) to measure the influence of the relationship among genotypes on predictions yielded variance components for genotypes, crosses and error for year 2009 of 4.03, 3.67 and 1.66, respectively, and for year 2010 of 4.72, 10.70 and 1.32, respectively. Thus, the cross effect in 2009 is contributing in about 40% and in the next year more than 60% to the total variation explained by the data.
The marker-based relationship heat-map ( Figure 6) shows some clusters among genotypes of the same cross indicating genetic relatedness. The predictive abilities using five times 5-fold CV of datasets resulting from first stage analysis of all spatial and non-spatial models plus the mixed datasets were in general very similar within sampling strategies (Table 7). For the across-crosses (AC) sampling scheme, the predictive abilities were lower than the ones obtained with the within-crosses (WC) sampling scheme. In the AC sampling, we fixed the initial seed of the random number generator used for randomization in the CV procedure at the same value for all models to be able to compare the models when the same crosses were used in the training set.
We compared the models and the sampling methods using a paired t-test (α = 5%) by resembling a randomized complete block design, where the predictive ability of each repetition of the CV was taken as a block, thus accounting for the dependence among observations from the same Table 7 Predictive abilities between observed and predicted values for 9 spatial and non-spatial models (M1, · · · , M9) and mixed datasets using the best locations given the AIC (Mix1) and the ρ-GP-CV per location-year combination (Mix2)  samples (Table 7). For the first sampling method (WC), three groups were identified with some overlaps, but showing not much of a difference among models. From the across-crosses sampling strategy (AC), five groups were distinguished with some overlaps: M2 had the highest predictive ability and models M4, M6 and M7 had the worst predictive abilities. Potential bias of GP is another important element that could be used to compare models. We computed the bias as suggested by [32,33]. The comparison of the biases of all models followed a rather similar trend as the predictive abilities showed in Table 7. We present the analysis of bias as supplementary material (Additional file 2).
The heritability (square root of heritability) for the base- unrealistic assumption of uncorrelated genotypes [23]. We computed the heritability to have a rough idea of how much could we expect from the predictive abilities. The predictive ability divided by square root of heritability is an estimate of the accuracy of GP [23], and the square root of the heritability provides the upper bound for the predictive ability [30], thus one expects that the predictive abilities are not very far from the square root of heritability. In this case, the square roots of the heritabilities are somewhat larger than the corresponding predictive abilities, indicating that the predictions are not sufficiently accurate due to limited data size, thus not exhausting completely the genetic variance. To explore in which extend could have our models explained the variance not captured by the markers, we fitted an additional component accounting for the polygenic effect in the GP stage [24]. The baseline model (M1) yielded a genotypic variance of 2.99; when we incorporated the polygenic effect, the genotypic variance was 2.72 and polygenic variance was 0.36, indicating that about 88% of the total genetic variance was captured by the RR-BLUP model.

Discussion
Selecting the models at the first stage produced different results than assessing them in the third stage. AIC had better scores for the models that used row and column effects, e.g. Models M9, M6 and M2 (Table 5) or M8 that had a 2-dimensional variance-covariance error structure. ρ-GP-CV also picked M8 and M9 (Table 6) but the choices were more spread over the models covering even the baseline model. In general, in the first stage, both AIC and ρ-GP-CV produced better scores for the two-dimensional models, whereas in the third stage the baseline and onedimensional models seemed to be better than the more complex models ( Table 7). The explanation of this pattern may be related to the second stage, where the interaction genotype × location played a role. The two-dimensional models performed very well in modelling heterogeneity within field, but when the means were integrated across the whole experiment, including all locations and years, the two-dimensional spatial error models seemed to overadjust the means, yielding a poorer predictive ability in the GP stage. The one-dimensional spatial error models and the two-dimensional model without spatial error structure were sufficient to estimate appropriately adjusted means. This corroborates Piepho and Williams [21] who concluded that for small portions of a field, a particular spatial model may hold well but if fitted all across the field it may fail. In a wheat experiment, Lado et al. [34] found that using moving averages as covariable significantly improved the predictive abilities of GP. They recognised strong heterogeneous patterns of irrigation in the field, that were not controlled with a single blocking system. Models M1, M3 and M7 were never selected as having the best fits either by AIC or ρ-GP-CV. These models had in common that none of them used rows and columns as additional factors, strengthening the conclusion that row-column designs may have the potential to correctly control field heterogeneity and thus enhance predictive ability of genomic prediction.
Fitting a location-specific error model did not have an advantage over fitting a common model across locations. Neither did the dataset composed of means computed using models have best AIC fits (Mix 1) nor the second dataset containing the means computed using models with highest ρ-GP-CV (Mix 2) produce better predictive abilities in the GP stage.
The models with nugget had better fits than the corresponding baseline model without the nugget. The drawback was that fitting those models was not straightforward, since almost every location required a separate coding specifying initial values and lower boundary constraints on the covariance parameters. Good http://www.biomedcentral.com/1471-2164/15/646 statistical and biological reasons have been presented of why including a nugget to analysis of field experiment is beneficial [35].
If we ignore the two-dimensional spatial models (M5, M6, M8 and M9), the AIC privileges M2 and ρ-GP-CV yields more diverse results with the majority of choices for M2 and M4. In fact, when the spatial component of a resolvable row-column design based on linear variance (LV) does not lead to an improved fit, returning to classical row-column design provides randomisation protection [36].
Williams and Luckett [37] performed studies aiming to find the optimal plot size, the optimal plot arrangements and the best spatial model (the so-called uniformity trials) and showed that in cotton and barley row and column designs are well suited for variety testing in plant breeding trials. Moreover, recent simulation studies from Möhring et al. [38] showed that designs including rows and columns outperformed one-dimensional blocking. In the same work, the authors mention that blocking in the direction of plots with common long sides is preferable, which is common in cereal breeding [39].
We cannot affirm that ρ-GP-CV was better than AIC for model selection or vice versa, nor that the results showed the same trend; but if we would have used either of these two strategies to select the best model, we would have selected the M9 with AIC or M8 with ρ-GP-CV. The GP predictive ability obtained by M2 (Table 7) was slightly better than M8 and M9 (specifically AC sampling method); however, this model (M2) was not highlighted by either of the two selection criteria (AIC or ρ-GP-CV).
In practice, the fact that there were no large statistical differences is good news for the breeders because the baseline model (M1), or even better, the simplest model with row-column adjustment (M2), are appropriate for phenotypic analysis towards GP.
As a model selection method, GP-CV is of interest because it may allow to compare models with different fixed effects, even when REML is used for estimating the variance parameters. No simple recommendation has been reported concerning the best model selection criterion in the case of spatial models [13,40]. Predictive abilities have been used between environments as similarity measure and then to join similar environments into clusters [15]. Thus, in a sense ρ-GP-CV allows giving an interpretation to the environment under scrutiny and the displayed trend do not depart far from the classical AIC. The repeatabilities (R) presented in parallel to the ρ-GP-CV (Table 6) show a low correlation (ρ = 0.36, p-value = 0.0965) with the predictive abilities from the baseline model. In fact, we expected that for location P-L3 of 2009, which had a negative predictability, the R was very low almost zero, but this was not the case; hence we could not conclude that the low predictive ability is mainly due to environmental effects. Riedelsheimer et al. [41] also reported negative predictive accuracies when testing unrelated crosses in the CV procedure and observed that using unrelated crosses could have provided a negative prediction signal due to opposite linkage phases with important QTL displayed in the TS, suggesting that the negative predictive accuracies are associated with the marker pattern.
In this study we explored three ways to adjust the year effect given the weak connectivity across years. Using the single check (Approach 2) to make the year adjustment was not a better choice than adjusting by the simple year mean (Approach 1b) or accounting for the year effect in the GP stage (Approach 1a), even though the estimated predictive ability was the highest. The "year clouds" produced using Approach 2 ( Figure 5B) did not overlap perfectly, from which we concluded that the correction was not appropriate and generated an over-fitting of the markers in the GP-CV procedure due to the fact that markers also predicted the year effect and not the SNPeffects alone. Using the year-mean correction for adjusted means in the second stage (Approach 1b) produced a lower ρ-GP-CV, that, given the overlay of the clouds of predicted vs. observed values, seems to be more realistic. However, fitting the year effect manually, i.e. using ordinary least squares estimation (OLSE) vs. fitting it as a fixed effect in the GP stage, i.e. using generalised least squares estimation (GLSE) can definitively yield a more precise estimate. Indeed, the residual variance in Approach 1b using year effect-corrected adjusted means was around 3.9 (in average for the five replicates) and in Approach 1a using the year fixed effect in the GP stage yielded residual variance of 3.0 (in average for the five replicates). In Approach 1a, where we fitted the year in the GP stage, we removed the year effect from the observed adjusted means derived from the second stage (M (2) − Xβ) to avoid bias of the predictive abilities; however, there would still be some bias because the subtracted year effect was not the true effect but an estimate of the year effect.
Models were eventually assessed and compared using the ρ-GP-CV in the third stage. The two sampling scenarios to perform the CV procedure aimed to recreate the cases where the material was genetically close, with some individuals coming from the same parental cross, and more distantly related to avoid individuals from the same parental cross in the randomisation procedure of CV. This more distantly related material shows some identical-bystate (IBS) similarity, therefore it was not unrelated in the theoretical sense of population genetics. This more distantly related scenario may be seen also as a case where one tries to predict a scenario whose linking information is weak or lacking, e.g. different genotypes and/or locations in the TS and VS [42][43][44]. http://www.biomedcentral.com/1471-2164/15/646 The predictive abilities obtained for GP using WC sampling were located in the middle-high range and using AC sampling, predictive abilities were placed in the middle range. The predictive ability of the AC sampling was significantly lower than WC, as expected for GP of a dataset showing population structure. Riedelsheimer et al. [41] drew similar conclusions using unrelated biparental maize families. They concluded that predictive accuracy could be increased by adding crosses (families) sharing both parents to the TS. In this respect, the use of pedigree and marker information to borrow information from both sources is suggested [44].