Genomic prediction using machine learning: a comparison of the performance of regularized regression, ensemble, instance-based and deep learning methods on synthetic and empirical data

Background The accurate prediction of genomic breeding values is central to genomic selection in both plant and animal breeding studies. Genomic prediction involves the use of thousands of molecular markers spanning the entire genome and therefore requires methods able to efficiently handle high dimensional data. Not surprisingly, machine learning methods are becoming widely advocated for and used in genomic prediction studies. These methods encompass different groups of supervised and unsupervised learning methods. Although several studies have compared the predictive performances of individual methods, studies comparing the predictive performance of different groups of methods are rare. However, such studies are crucial for identifying (i) groups of methods with superior genomic predictive performance and assessing (ii) the merits and demerits of such groups of methods relative to each other and to the established classical methods. Here, we comparatively evaluate the genomic predictive performance and informally assess the computational cost of several groups of supervised machine learning methods, specifically, regularized regression methods, deep, ensemble and instance-based learning algorithms, using one simulated animal breeding dataset and three empirical maize breeding datasets obtained from a commercial breeding program. Results Our results show that the relative predictive performance and computational expense of the groups of machine learning methods depend upon both the data and target traits and that for classical regularized methods, increasing model complexity can incur huge computational costs but does not necessarily always improve predictive accuracy. Thus, despite their greater complexity and computational burden, neither the adaptive nor the group regularized methods clearly improved upon the results of their simple regularized counterparts. This rules out selection of one procedure among machine learning methods for routine use in genomic prediction. The results also show that, because of their competitive predictive performance, computational efficiency, simplicity and therefore relatively few tuning parameters, the classical linear mixed model and regularized regression methods are likely to remain strong contenders for genomic prediction. Conclusions The dependence of predictive performance and computational burden on target datasets and traits call for increasing investments in enhancing the computational efficiency of machine learning algorithms and computing resources. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-023-09933-x.


1
Ridge Regression (RR)-BLUP model used to estimate variance components for the KWS real maize data The linear mixed effects model used for the phenotypic analysis of the KWS real maize dataset, which is described in greater detail in [14,15], is y ijklqmn = ϕ + γ q + t l + r kl + b jkl + δ m + τ n + Z 1im g 1im + Z 2im g 2im + e ijklmnq (1.1) where • y ijklqmn is the quantitative response (maize yield) of the i-th genotype in the j-th block nested within the k-th replicate in the l-th trial in the qth location and in the m-th group tested against the n-th tester • ϕ is the general effect • γ q is the random effect of the q-th location, which is iid according to N (0, σ 2 γ ), σ 2 γ is the location variance • t lq is the random effect of the l-th trial nested within the q-th location, which is iid according to N (0, σ 2 t ), where σ 2 t is the trial variance • r klq is the random effect of the k-th replicate nested within the l-trial in the q-th location, which is iid according to N (0, σ 2 r ), where σ 2 r is the replicate variance • b jklq is the random effect of the j-th block nested within the k-th replicate in the l-th trial in the q-th location, which is iid according to N (0, σ 2 b ), where σ 2 b is the block variance • δ m is the fixed effect of the m-th group of checks, testers and genotypes • τ n is the fixed effect of the n-th tester • Z 1im g 1im refers to the random effect of the i-th genotyped line in the m-th group, coded as (Z1*TESTER*GRP*G1), where Z1 is the quantitative variable Z 1im (SWITCH1) in model (1.1), tester (TESTER), group (GRP) and genotypes (G1) are categorical variables • Z 2im g 2im refers to the random effect of the i-th non-genotyped line in the m-th group, coded as (Z2*TESTER*GRP*G2), where Z2 is the quantitative variable Z 2im (SWITCH2) in model (1.1), tester (TESTER), group (GRP) and genotypes (G2) are categorical variables • e ijklmqn is the residual plot error, which is iid according to N (0, σ 2 e ), where σ 2 e is the error variance We further note that, • Variable G1 has one unique level for each of the genotyped lines and assigns any one level of the genotyped lines to all the non-genotyped lines.Hence G1=G2 for all the genotyped lines.
• Variable G2 has one unique level for each genotyped line and one unique level for each non-genotyped line.
• SWITCH1 is a dummy variable equal to 1 for genotyped lines and 0 for non-genotyped lines.
• SWITCH2 is a dummy variable equal to 0 for genotyped lines and 1 for non-genotyped lines.

Noteworthy details of model fitting
Because we used CV for model selection, we fixed the data split whenever possible by setting a specific positive seed before splitting the data to enhance the reproducibility of results.Besides the k-fold split, common to all the methods, some methods involve additional kinds of randomization, making it impossible to reproduce results even if the data split or the seed for the random number generator are fixed prior to fitting the model (RF, SVM and FFNN methods).This happens because some of the routines internally generate some random number seeds when they initialize the computations.Consequently, this introduces an additional source of variation between results obtained from different runs or even from using different computing platforms.Ideally, for such cases, and at the risk of vastly increasing the computational burden, the model could be fitted a large number of times until the average and the range (or standard error) of PA stabilize.Even so, we considered only single model runs for the RF and SVM methods.For the FFNN method we implemented 1000 runs for the simulated data and 10 for the KWS data.
In addition, the following details of model fitting are noteworthy: (i) γ = 0.5 was considered in the bridge and grouped bridge methods.
(ii) Calibration of random forests involves selecting the number of trees to grow (n trees ), the random number of covariates to select for growing each tree (n covars ) and the minimum size of the terminal nodes per tree (n nodes ), below which no split is attempted.The parameters that affect the final accuracy the most are the first two.Increasing the n trees only increases the accuracy up to some point but can substantially increase the computational time.Here, we fix n trees = 1000 (this ensures that every input row gets predicted at least a couple of times) and n nodes = 1 and search for the best value of n covars in {0.5, 1, 2} × (p/3).
(iii) For the SGB method we assumed the Gaussian distribution for minimizing squarederror loss.The basic boosting algorithm requires the specification of two parameters: the number of splits (J; or the number of nodes, which equals the number of splits plus one) and the number of trees (or iterations; n trees ) to be fitted.Hastie et al [25] point out that the number of splits J such that 4 ≤ J ≤ 8 generally works well with results being fairly insensitive to particular choices in this range.We use J = 6.As for n trees , it should neither be too small (bad fit) nor too large (overfit).Usually, the suitable n trees can be found by checking how well the model fits a test dataset, where a typical fraction of the data used for testing is 0.5 (can be much smaller if the dataset is very large).We search for n trees in {500, 1500, 3000}.As with RF, we fix the number of terminal nodes per tree n nodes = 1.In addition, we set the shrinkage factor applied to each tree in the expansion to 0.001 and the subsampling fraction to 0.5 [19].
(v) We used the Adam optimizer, 'Relu' (rectified linear units) activation function and a linear output layer in the configuration of both the FFNN and CNN.
Because we used the Python software and GPU to fit the neural networks, we were able to produce 1000 different runs of the FFNNs and CNNs for the simulated dataset and 10 runs of the FFNNs for each of the three real datasets (amounting to 10 runs × 5 f olds × 10 reps FFNNs fits for each real dataset), and report the average plus the range for the PA.
(vi) For the 1D CNN we fixed the following parameters across the three traits: Parameters referring to the Number of epochs and the Learning rate were calibrated individually for each trait (Table S7).
(vii) For the simulated animal trait T3, specifically with the CNN method, we applied the transformation T 3 → T 3 × 10 3 .This was necessary because the CNN did not generate reasonable results with the original trait values, which are exceptionally small and close to zero.Subsequently, the prediction errors were back-transformed to the original scale and reported accordingly.
All the methods are implemented in the R software and are available in various R packages.
For the FFNN and CNN methods, and because of fine tuning requirements, we used the Python software.Table S1 lists the R and Python packages we used to analyse the synthetic and real datasets.

3
Supplementary Tables  Table S3: Prediction accuracy (PA) of the group regularized methods (mean and range values of PA across the different groupings), computed as the Pearson correlation coefficient between the true breeding values (TBVs) and the predicted breeding values (PBVs), for the simulated dataset, where T 1 − T 3 refer to three quantitative milk traits.Choice of λ was based on the 10-fold CV.Display refers to the mean, max and min values of PA across all the 10 grouping schemes.The mean squared and absolute prediction errors are also provided.* the groupings here are achieved through interactions between markers unlike in the previous methods; a single PA value is produced and reported, which is not a mean across the different groups;

Method
⋆ MSPE is multiplied by 10 3 and MAPE by 10 1 to enhance comparison with corresponding values for T 1 and T 2 ; five decimal places were needed for cMCP.

Table S1 :
List of R and Python packages used in this paper * , msaenet() * * These functions allow for internal parallelization of computations.

Table S2 :
Prediction accuracy (PA) of the regularized, adaptive regularized and Bayesian regularized methods, computed as the Pearson correlation coefficient between the true breeding values (TBVs) and the predicted breeding values (PBVs), for the simulated dataset, where T 1 − T 3 refer to three quantitative milk traits.The choice of λ, where applicable, was based on the 10-fold CV.The mean squared and absolute prediction errors are also provided.MSPE is multiplied by 10 3 and MAPE by 10 1 to enhance comparison with corresponding values for T 1 and T 2 .

Table S4 :
Prediction accuracy (PA) of the ensemble and instance-based methods, computed as the Pearson correlation coefficient between the true breeding values (TBVs) and the predicted breeding values (PBVs), for the simulated dataset, where T 1 − T 3 refer to three quantitative milk traits.thebest number of trees grown was 3000 for all traits; ‡ reported values refer to a single run of the SVM with the best cost λ = 100 for trait 1 and λ = 10 for traits 2 and 3;⋆ MSPE is multiplied by 10 3 and MAPE by 10 1 to enhance comparison with corresponding values for T 1 and T 2 . *

Table S5 :
Prediction accuracy (PA) of the deep learning methods, computed as the Pearson correlation coefficient between the true breeding values (TBVs) and the predicted breeding values (PBVs), for the simulated dataset, where T 1 − T 3 refer to three quantitative milk traits.MSPE is multiplied by 10 3 and MAPE by 10 1 to enhance comparison between methods. ⋆

Table S7 :
Best CNN model calibration parameters (Number of epochs/Learning rate) selected for each of the three quantitative milk traits T 1 − T 3 .

Table S8 :
Predictive ability (PA; mean and range values computed across the 5-fold validation datasets and 10 replicates) of the regularized, adaptive regularized, group regularized, Bayesian regularized, ensemble, instance-based and deep learning methods, computed as the Pearson correlation coefficient between the observed breeding values (OBVs) and the predicted breeding values (PBVs), for the KWS datasets.The choice of λ, where applicable, was based on 4-fold CV.