### Phenotypic and genomic information

The dataset was provided by Cobb-Vantress Inc. (Siloam Springs, AR), and included 79,367 body weight observations (mean = 2141.8 g and SD = 238.17 g), measured at 6 weeks of age on broilers from a single purebred line. The total number of birds in the pedigree was 342,383, with 680 sires and 6216 dams. All broilers recorded for body weight were genotyped with a 60 k single nucleotide polymorphism (SNP) panel. The genotype quality control was performed using the *PLINK* software [40] with markers excluded based on the following criteria: 1) Located at non-autosome chromosomes; 2) Unknown map position; 3) Minor allele frequency (MAF) < 0.01; 4) Call rate < 95%; and 5) Departures from Hardy-Weinberg equilibrium with *P* < 10^{− 10}. Subsequently, missing genotypes were imputed using the software *FImpute* [41]. Body weight observations with ±3.5 standard deviations away from the average of their contemporary group were treated as outliers and removed from the analysis. After these editing procedures, 77,476 broilers and 49,362 SNPs were retained for genome-enabled prediction analysis.

### Genome-enabled prediction analysis

Genome-enabled prediction of body weight in broilers was performed in two steps. In the first step, variance components were estimated using the AIRemlf90 software [42], and body weight was pre-adjusted by fitting the following linear mixed model:

$$ \mathbf{y}=\mathbf{X}\boldsymbol{\uptheta } +\mathbf{Zu}+\mathbf{Wc}+\mathbf{e} $$

where **y** is a vector of body weights; **θ** is a vector of fixed effects; **u** and **c** are vectors of random additive genetic and maternal permanent environmental effects, respectively; **X**, **Z**, and **W** are known incidence matrices of fixed, additive genetic, and maternal permanent environmental effects, respectively; and **e** is the vector of residuals. Fixed effects in the model were sex (2 levels) and contemporary groups (500 levels). Random effects were assumed to be independent from each other with distributions \( \mathbf{u}\sim \mathrm{N}\left(0,\mathbf{A}{\upsigma}_{\mathrm{u}}^2\right) \), \( \mathbf{c}\sim \mathrm{N}\left(0,\mathbf{I}{\upsigma}_{\mathrm{c}}^2\right) \), and \( \mathbf{e}\sim \mathrm{N}\left(0,\mathbf{I}{\upsigma}_{\mathrm{e}}^2\right) \), where **A** is the additive genetic relationship matrix (342,383 × 342,383), **I** is an identity matrix of appropriate order (i.e. 333 × 333 for maternal permanent environmental and 77,476 × 77,476 for residuals), and \( {\sigma}_u^2 \), \( {\sigma}_c^2 \), and \( {\sigma}_e^2 \) are the variance components for additive genetic, maternal permanent environmental and residual effects, respectively. The pre-adjusted body weight was defined as \( {\mathbf{y}}^{\ast}=\mathbf{y}-\mathbf{X}\hat{\boldsymbol{\uptheta}}-\mathbf{W}\hat{\mathbf{c}} \). In the second step, the pre-adjusted body weight was fit on the genetic markers by using three predictive approaches: BRR, Bayes Cπ, and DNN.

### Bayesian regression models

The Bayesian regression analyses were implemented with the R package *BGLR* [43]*.* The Markov Chain Monte Carlo sampling process was performed for each Bayesian model considering a chain of 30,000 iterations, from which the first 20,000 cycles were discarded as a warm-up period, keeping posterior samples for every 5 iterations. The general statistical model for BRR and Bayes Cπ was:

$$ {\mathbf{y}}^{\ast}=\mathbf{1}\mu +\sum \limits_{j=1}^k{\mathbf{z}}_{\mathbf{j}}{a}_j+\mathbf{e} $$

where **y**^{∗} is the n × 1 vector of pre-adjusted body weights; *μ* is an overall intercept; *k* is the number of markers; **z**_{j} is a n × 1 vector denoting the genotypes of the animals for marker *j*; *a*_{j} is the effect of marker *j*, and **e** is the vector of residuals. SNP genotypes were standardized to display mean 0 and variance 1. The vector of residuals **e** was assumed to follow a Gaussian distribution with mean zero and variance \( {\sigma}_e^2 \).

### Bayesian Ridge Regression

In BRR, an independent prior Gaussian distribution with mean zero and variance \( {\upsigma}_a^2 \) was assigned to each SNP, so that \( \mathrm{p}\left({a}_1,{a}_2,\dots, {a}_k|{\upsigma}_a^2\right)={\prod}_{j=1}^{\mathrm{k}}\mathrm{N}\left({a}_j|0,{\upsigma}_a^2\right) \). The unknown variance parameter \( {\sigma}_a^2 \) is the same for all genetic markers, and a scaled inverse chi-squared distribution \( {X}^{-2}\left({v}_a,{\mathrm{S}}_a^2\right) \) was specified as a prior distribution, where \( {\mathrm{S}}_a^2 \) and *v*_{a} are a scale parameter and degrees of freedom, respectively. A flat prior distribution was assigned for the overall constant *μ*, and similarly to \( {\sigma}_a^2 \), a scaled inverse chi-square distribution \( {X}^{-2}\left({v}_e,{\mathrm{S}}_e^2\right) \) was defined as the prior distribution for the residual variance \( {\sigma}_e^2 \).

### Bayes Cπ

The Bayes Cπ method was proposed by Habier et al. (2011) [10], and assumed that the vast majority of genetic markers have null effect, with a small proportion of SNPs with non-null effect. The prior distribution of the genetic markers (*a*_{j}) depends on the variance \( {\upsigma}_{a_j}^2 \) and the probability π of a marker having zero effect. These mixture of distributions are described as follows:

$$ {a}_j\mid \pi, {\sigma}_a^2=\left\{\begin{array}{cc}0& \mathrm{with}\ \mathrm{probability}\ \pi \\ {}\sim \mathrm{N}\left(0,{\sigma}_a^2\right)& \mathrm{with}\ \mathrm{probability}\ \left(1-\pi \right)\end{array}\right. $$

A scale inverse chi-square distribution with parameters *v*_{a} and \( {S}_a^2 \) was assumed for \( {\upsigma}_a^2 \), in which *v*_{a} = 4.2 and \( {S}_a^2=\frac{{\overset{\sim }{\sigma}}_{\mathrm{a}}^2\left({v}_a-2\right)}{v_a} \). The parameter \( {\overset{\sim }{\sigma}}_{\mathrm{a}}^2 \) is equal to \( \frac{{\overset{\sim }{\sigma}}_{\mathrm{s}}^2}{\left(1-\uppi \right){\sum}_{\mathrm{j}=1}^{\mathrm{k}}{p}_j\left(1-{p}_j\right)} \), where \( {\overset{\sim }{\sigma}}_s^2 \) is the variance explained by all markers and *p*_{j} is the allele frequency of the *j*^{th} SNP. In the Bayes Cπ method, π is treated as unknown and a uniform *prior* distribution (0,1) was assigned to this parameter. The inclusion/exclusion of each marker in the model is modeled by an indicator variable *δ*_{j}, which is equal to 1 if the marker *j* is fitted into the model, and zero otherwise. The prior distribution for the overall constant *μ* and the residual variance \( {\sigma}_e^2 \) were the same as those specified in the BRR model.

### Deep neural networks

Deep neural networks were implemented using a Multilayer Perceptron (MLPs) architecture, in which units from adjacent layers are fully connected, and contained a minimum of 3 layers (Fig. 4). The first layer was the input layer composed by 49,361 units, with each unit representing an SNP. Initially, the inputs for each unit were coded as 0, 1, and 2, representing the number of copies for the reference allele. The inputs were standardized (mean = 0, variance = 1) to avoid numerical issues during the training of DNN. In this setting, each unit (i.e., SNP) has similar interpretation of a covariate in a linear regression model. The last layer in the MPLs is called output layer and provides the prediction (output) of the pre-adjusted body weight. The layers between input and output layers are defined as hidden layers. Connections between units are established through a structure referred to as weights. The training process of DNN occurs in two steps: forward and backward propagations. In the forward propagation, the information flows from the input to the output layer, and each unit transforms a linear combination of weights and inputs from the previous layer with an activation function, and outputs it to the units in the next layer. The output of each unit is obtained as \( {o}_r={g}_t\left({\sum}_{s= 1}^z{w}_{rs}{x}_s\right) \), where *g*_{t}(.) is the activation function, *w*_{rs} is the weight of the *r*^{th} unit to *x*_{s} input of neuron s from the previous layer, and *z* is the number of units in the previous layer. The information propagates in a similar way through the DNN layers, reaching the output layer that provides the prediction for pre-adjusted body weight. A linear activation function, i.e. \( {\sum}_{s=1}^z{w}_{rs}{x}_s \), was applied to the output layer, while rectified linear activation functions, i.e. \( \max \left({\sum}_{s=1}^z{w}_{rs}{x}_s,0\right) \), were considered for neurons in the hidden layers. In the backward propagation step, the weights of the DNN are adjusted via a stochastic gradient descent algorithm, which involves the use of partial derivatives with the intent to minimize the loss function. Hence, the main goal during the training of DNN is to find optimal weights, such that the loss function is minimized. In the present study, weights were initialized with a normal distribution with mean zero and variance 10^{− 4}, and the loss function included a ridge penalization (L2) which is represented in matrix notation as follows:

$$ \mathrm{L}\left(\mathbf{y},\hat{\mathbf{o}}\right)={\left(\mathbf{y}-\hat{\mathbf{o}}\right)}^{\mathrm{t}}\left(\mathbf{y}-\hat{\mathbf{o}}\right)+\uplambda {\mathbf{w}}^{\mathrm{t}}\mathbf{w} $$

where \( \mathrm{L}\left(\mathbf{y},\hat{\mathbf{o}}\right) \) is the penalized loss function, **y** is the vector of pre-corrected body weights, \( \hat{\mathbf{o}} \) is vector of predicted body weights, **w** is a vector of DNN weights, and λ is the regularization parameter. Moreover, the DNN optimization was performed with the Adam algorithm, which is an extension of the stochastic gradient descent that accounts for the first and second moments of the gradients [45]. Deep neural networks were trained with a learning rate of 10^{− 4} and a mini-batch size of 256 instances.

A common challenge when modeling DNN is overfitting. Besides to the ridge penalization, we implemented DNN using two additional approaches to tackle potential overfitting: early stopping and dropout [44]. In early stopping, a part of the training set, referred to as tuning set, was used exclusively to monitor the performance of DNN. The mean square error (MSE) in the tuning set was computed for every 5 epochs, and the learning process stopped if no improvement was observed after 5 consecutive assessments of the MSE in the tuning set. The maximum number of epochs allowed to train each DNN was 1000. In the dropout approach, units from DNN are randomly removed during training in a way that in the forward propagation, units are not active, and in the backward propagation their corresponding weights are not updated. Either early stopping or dropout strategies are important approaches to improve generalization of DNN models, and thus mitigate overfitting. Another critical step of DNN is to search for the best architecture since it can drastically influence predictive performance. To address this problem, we have searched for neural networks hyperparameters by taking advantage of a parallel computing system available at University of Wisconsin-Madison through the Center for High Throughput Computing (CHTC), in which 200 different architectures of DNN were tested simultaneously. Random combinations of all hyperparameters included in the search space [i.e. number of hidden layers, number of units per layer, ridge regularization (L2), and a common dropout rate to each layer] were tested to select the best DNN model. More detailed information about the hyperparameters used in the searching process are depicted in Table 1. All DNN analyses were implemented in Python using the TensorFlow library [46].

### Assessment of predictive performance

The predictive performance of each method was assessed by splitting the data into training and testing sets, in which the chronological information of 18 maternal grandsire generations was used as a criterion to divide the entire data. The former set included 67,741 broilers related to the oldest 16 generations of maternal grandsire, while the latter set contained 9735 broilers originated from the newest two generations of maternal grandsire. This validation scheme was considered in our study because, in practice, genomic breeding values of selection candidates (younger individuals) are predicted using marker effects estimated based on the information of a reference population (older individuals). To investigate the impact of the training sample size on predictive performance, 15 subsets of the training set were created, by sub-sampling 1, 3, 5, 7, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, and 100% of the original training set size. Broilers were randomly assigned to each subset with the constraint that larger subsets contained all animals from smaller sets (i.e., 1% ⊂ 3% ⊂ … ⊂ 100%). Furthermore, a tuning set was created to implement the early stopping approach for DNN. The tuning set included 4225 broilers descendants from the most recent generation of maternal grandsire in the training set. Therefore, all predictive approaches were trained using broilers related to the oldest 15 generations of maternal grandsire, and the tuning set was used exclusively to monitor the predictive performance in DNN. Keeping an exclusive data set to fine-tune hyperparameters favors DNN compared to Bayesian regression approaches as the latter do not required parameter optimization. To make comparisons fairer, the Bayesian Ridge Regression and Bayes Cπ were also refitted by including the tuning set of the DNN in each sub-sampling of the training set. Therefore, these new models (denoted by BRR-WT and the Bayes Cπ-WT) correspond to the same BRR and Bayes Cπ models previously described, but fitted with the addition of the tuning set into each sub-sampling of the training set.

The criteria used to assess predictive performance were the prediction correlation, i.e. correlation between the pre-adjusted and the predicted body weights, and the MSEP, given by:

$$ MSEP=\frac{\sum \limits_{i=1}^{n_{test}}{\left({y}_i^{\ast }-{\hat{y}}_i\right)}^2}{n_{test}}, $$

where \( {y}_i^{\ast } \) is the pre-adjusted body weight from the *i*^{th} broiler in the testing set, \( {\overset{\frown }{y}}_i \) is the predicted genomic breeding value for body weight, and *n*_{test} is the number of records in the testing set. Furthermore, improvements in prediction correlation and MSEP of DNN compared with Bayesian models were assessed using the relative gain (RG), which was measured as follows:

$$ \mathrm{RG}=\frac{\left({\mathrm{r}}_1-{\mathrm{r}}_2\right)}{{\mathrm{r}}_2}\times 100, $$

where r_{1} and r_{2} are the predictive criterion (i.e. prediction correlation or MSEP) for DNN and Bayesian approaches, respectively.

The predictive bias was also investigated for each genome-enabled prediction approach as the deviation of the regression coefficient between pre-adjusted and predicted body weight from the unit. In addition, the Spearman rank correlation between predicted body weights, and the agreement on the top 10-ranked broilers were used to assess the similarity between the different genome-enabled prediction approaches.