Fish material and challenge test
The breeding program for Atlantic salmon was started by the company AquaChile (Puerto Montt, Chile) in the year 1997 with the aim of improving economically important traits. The base population of the breeding nucleus originated from the Irish strain Fanad-Mowi (originally from Norway) [32, 33] and was introduced during the 1990s to Chile through commercial agreements. At present, the breeding program is managed by the breeding company Aquainnovo SA at Salmones Chaicas (Puerto Montt, Chile). All fish material used in this study was from the same breeding program and corresponds to the year-class 2010, which has undergone four generations of selection mainly for harvest weight in Chilean farming conditions. The fish were hatched during May 2010 with an approximate mating ratio of one sire to two dams in most cases. A total of 118 families were produced as progeny of 40 sires and 118 dams and reared in separate tanks until tagging. The required number of fish from all families were tagged individually at an average weight of 13.1 g (SD = 3.4 g) using Passive Integrated Transponder-tag (PIT-tag), in order to keep pedigree information. Tagged fish were reared in a single communal tank for about 14 months before transfer to the Aquainnovo’s Research Station located in Lenca River, Xth Region, Chile. After a 29 day acclimation period in salt water (31 ppt) a total of 2601 fish, an average of 22 (ranging between 9 and 24) fish per family, weighing on average 274.8 g (SD = 90.6 g) were experimentally challenged with P.salmonis as described previously [10, 18]. In brief, prior to the challenge test, the fish tested negative for the presence of Infectious Salmon Anaemia virus, Infectious Pancreatic Necrosis virus, Renibacterium salmoninarum by RT-PCR and negative for Flavobacterium spp. culture. To induce infection, fish were injected with 0.2 ml of a LD50 inoculum of P.salmonis through intra-peritoneal (IP) injection. Post IP injection, infected fish were distributed equally by family into three different tanks with salt water (31 ppt) such that each of the 118 full-sib families were represented in all three tanks. The challenge test continued for 40 days and mortalities were recorded daily. The Kaplan-Meier curves of the survival function was plotted for the test period to show the cumulative mortality across the challenge (Additional file 1: Figure S2). All surviving fish at day 40 were anesthetized and euthanized. Tissue samples (fin clips) for genomic DNA isolation were taken from all fish and preserved in 95% ethanol at −80 °C. The procedures for challenge and sampling were approved by The Comite de Bioetica Animal, Facultad de Ciencias Veterinarias y Pecuarias, Universidad de Chile (Certificate N0 08–2015).
Genotype data
Genomic DNA was isolated from the stored fin clip samples of all 2601 challenge tested fish using a commercial kit (DNeasy Blood & Tissue Kit, Qiagen), following manufacturer’s protocol. Genotyping was performed using a 50 K Affymetrix® Axiom® myDesign™ SNP Genotyping array designed by the joint collaboration of AquaInnovo SA and the University of Chile. The 50 K SNP array used in this study was derived from a previously developed and validated custom made 200 K SNP array [34] based on several quality control criteria as described previously [18]. Importantly, the 50 K SNP array had markers distributed equally across the genome with a distance of more than 10Kb from its neighboring SNP [18]. Genotypes of all fish samples were obtained following Best Practices Analysis Workflow from Affymetrix [35] and selection of Poly-high-resolution and the No-minor-homozygote SNPs using SNPolisher [35]. To filter SNPs quality control of the SNP genotype data was performed based on the Hardy Weinberg equilibrium (p < 1 × 10−10), Minor Allele Frequency (>0.001) and the call rate for SNPs and samples (>0.95). The quality control step resulted in a total of 2392 individuals and 49,684 SNPs distributed across the genome for further analysis.
Phenotypic records and trait definitions
Resistance to P. salmonis was considered to be challenge survival, defined as the time to death measured in days (DAYS) with values ranging from 1 to 40 depending on the day the fish died; and as binary survival status (STATUS), scored as 1 if the fish died during the 40-day challenge and 2 if the fish survived until the end of the challenge. Thus, the fish with higher DAYS and a STATUS of 2, were assumed to be more resistant animals. Test tank designation and final body weight on the day of death or at the end of the challenge for survivors were recorded.
Breeding value estimation
The two resistance traits DAYS and STATUS were analyzed separately as a linear trait and threshold trait, respectively in univariate models. The EBV were estimated using polygenic pedigree based BLUP (PBLUP) [36]. The SNP effects and GEBV were estimated using polygenic pedigree and/or SNP genotype information on the basis of genomic BLUP (GBLUP) [36], SNPBLUP [37], Bayes C [38, 39] and Bayesian LASSO [30, 40].
Pedigree based BLUP
The conventional pedigree-based variance components and EBV were estimated using PBLUP:
$$ \boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta } +\boldsymbol{T}\upmu +\boldsymbol{e} $$
(M1)
where β is a vector of fixed overall mean and age of fish at challenge (AGE) as co-variate, \( \boldsymbol{\upmu} \) is a vector of random additive genetic polygenic effects with a distribution \( \sim \mathrm{N}\left(0,\boldsymbol{A}{\sigma}_u^2\right) \), \( \boldsymbol{e} \) is the vector of random error effects with a distribution \( \sim \mathrm{N}\left(0,\boldsymbol{I}{\sigma}_e^2\right) \), X and T are the incidence matrices, \( \mathbf{A} \) is the pedigree-based additive genetic relationship matrix [41] and \( \boldsymbol{I} \) is the identity matrix. The trait DAYS was analyzed as a linear trait using AIREMLF90 and GIBBS1F90, whereas, the trait STATUS was analyzed as a threshold-linear trait in THRGIBBS1F90 in BLUPF90 family programs [36]. Briefly, AIREMLF90 uses Average-Information REML for estimating variance components for linear traits, GIBBS1F90 is for the Bayesian analysis of linear traits and THRGIBBS1F90 is for the Bayesian analysis of threshold categorical traits [36]. For Bayesian analysis, the Gibbs sampler was run for 120 000 iterations with a burn in of 20 000 iterations, and samples from every 100th sample were saved.
Genomic BLUP
The SNP based variance components and GEBV were estimated using GBLUP, similar to the PBLUP model (M1) described above. However, in GBLUP, \( \boldsymbol{\mu} \) is a vector of random additive genetic polygenic effects with a distribution \( \sim \mathrm{N}\left(0,\mathbf{G}{\sigma}_u^2\right) \). Here, \( \mathbf{G} \) is the genomic relationship matrix, created as described by VanRaden [42]. All other parameters and details of the analysis for trait DAYS and STATUS are the same as PBLUP (M1).
SNP based BLUP method
The SNP based BLUP method used to estimate marker effects and to predict GEBV was similar to GBLUP, where a priori distribution of additive marker locus effects was considered to be normal [20, 42]. The model used, PSNPBLUP, combined both marker effects as well as polygenic effect (infinitesimal effect with pedigree) for all genotyped fish:
$$ \boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta } +\boldsymbol{Z}\boldsymbol{a}+\boldsymbol{T}\upmu +\boldsymbol{e} $$
(M2)
where \( \boldsymbol{a} \) is the additive marker locus effect, \( \boldsymbol{Z} \) is the incidence matrix relating to marker genotype and all other parameters are the same as PBLUP (M1). It was assumed that \( \boldsymbol{a} \) follows a priori a normal distribution \( \sim \mathrm{N}\left(0,\boldsymbol{I}{\sigma}_a^2\right) \), where \( \boldsymbol{I} \) is an identity matrix. This model is often called ridge-regression best linear unbiased prediction with a normal distribution of marker effects [20, 42]. The trait DAYS was analyzed as a linear-mixed model in the context of Henderson’s BLUP [43] with known variances for all random effects using the keyword BLUP in the GS3 software [37]. The trait STATUS was analyzed as a threshold (probit) model assuming known variances. Random effects were estimated via Gibbs sampler using the keyword MCMCBLUP in GS3 software [37]. For both traits, the initial genetic variance \( {\sigma}_{u\;}^2 \) and residual variance \( {\sigma}_e^2 \) estimated from the model PBLUP were used to estimate additive marker variance \( {\sigma}_{a\;}^2 \) = \( {\sigma}_{u\;}^2/2\sum {p}_i{q}_i \). Breeding values for both traits were estimated using marker effects only (without pedigree-based polygenic effect) using the model SNPBLUP:
$$ \boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta } +\boldsymbol{Z}\boldsymbol{a}+\boldsymbol{e} $$
(M3)
All model parameters are as described above. The BLUP was run for 10 000 iterations with convergence criteria of 10−12 (1d-12) and correction every 100 iterations. For MCMCBLUP, a single chain with a length of 150 000 iterations was run. The burn-in period and the thinning interval used was 50 000 and 100 iterations, respectively.
Bayesian estimation method: Bayes C
The Bayes C method is a mixture model for SNP effects with an assumption that there is a large group of SNPs with zero or near zero effects and a second smaller group of SNPs with larger effect [39, 44]. The Bayes C method was fitted using the same model equations as in PSNPBLUP (M2) and SNPBLUP (M3), hereafter referred to as PBAYESC and BAYESC, respectively. All model parameters are defined as above, except the elements of vector \( \boldsymbol{a} \) which was calculated for each fish as:
$$ {\displaystyle \sum_{i=1}^N\left({z}_i{a}_i{\delta}_i\right)} $$
where \( {z}_i \) is the genotype of i
th marker, \( {a}_i \) is the effect marker i, and \( {\delta}_i \) is an indicator variable that explains if the i
th marker has an effect or not. In turn, variables \( \delta \) have a binomial (Bernoulli) distribution with a probability of π being 0 (marker has zero effect on the trait) and with a probability of \( 1-\pi \) being 1 (marker has non-zero effect on the trait). An informative beta distribution (with α and β parameters) was assumed for \( \pi \) (α = 1 and β = 1, implying uniform distribution of this parameter) and inverted chi-squared distributions for the different variances \( {\sigma}_a^2 \), \( {\sigma}_u^2 \) and \( {\sigma}_e^2 \).
Bayesian estimation method: Bayesian LASSO
The Bayesian LASSO method was implemented in the context of a linear mixed model assuming an exponential prior distribution for variances of SNP marker effects [37, 45]. An alternative Bayesian implementation of the LASSO procedure [40], similar to the model equations PSNPBLUP (M2) and SNPBLUP (M3) as proposed by Legarra et al. [30], PBLASSO and BLASSO, respectively, were used. All model parameters are defined as above, except the a priori distributions of individual SNP effects (\( {a}_i \)) which was calculated as:
$$ \Pr \left({a}_i\Big|{\tau}^2\right)=N\left(1,{\tau}_i^2\right)\ \mathrm{and}\; \Pr \left({\tau}_i^2\right)=\frac{\lambda^2}{2} \exp \left(-{\lambda}^2\left|{\tau}_i^2\right|\right). $$
Individual variances for each SNP (i.e., \( {\tau}_i^2 \)) are estimated conditionally on a regularization parameter λ, which was estimated by using an a priori gamma distribution bounded between 0 and 107. Initial value for parameter λ as \( {\lambda}^2=2/{\sigma}_a^2 \) was used and flat priors were assumed for \( {\sigma}_a^2 \), \( {\sigma}_u^2 \) and \( {\sigma}_e^2 \).
All model parameters and SNP effects estimated in a Bayesian framework (PBAYESC, BAYESC, PBLASSO and BLASSO) were estimated using the Gibbs sampling algorithm implemented in GS3 software [37]. A single chain with a length of 150 000 iterations was run. The burn-in period and the thinning interval used was 50 000 and 100 iterations, respectively.
Genetic parameters and GEBV
The total additive genetic variance \( \Big({\sigma}_{u\;}^2 \)) estimated in PBLUP and GBLUP models was calculated using relationship matrix \( \mathbf{A} \) and \( \mathbf{G} \), respectively. For both trait (DAYS and STATUS), the heritabilities were computed as:
$$ {h}^2=\frac{\sigma_u^2}{\sigma_u^2+{\sigma}_e^2}. $$
In contrast, for Bayesian models, the total additive genetic variance (\( {V}_{A\;}^{\prime}\Big) \) was estimated as the sum of additive marker (\( 2{\sigma}_{a\;}^2\pi \sum {p}_i{q}_i \)) and polygenic-pedigree (\( {\sigma}_{u\;}^2 \)) based additive genetic variance; i.e., \( {V}_{A\;}^{\prime }=2{\sigma}_{a\;}^2\pi \sum {p}_i{q}_i+{\sigma}_{u\;}^2 \) (\( {\sigma}_{u\;}^2=0, \) if pedigree was not used). Here, the heritabilities were computed as:
$$ {h}^2=\frac{V_A^{\prime }}{V_A^{\prime }+{\sigma}_e^2}. $$
Additionally, in SNP based BLUP and Bayesian GS methods, the predicted GEBV were either generalized genomic breeding values (i.e., the sum of polygenic pedigree based EBV and SNP effects) or just SNP effects, depending on whether pedigree and SNP information or only SNP information was used.
Cross validation scheme for model comparison
Predictive abilities of the different models described above (PBLUP, GBLUP, PSNPBLUP, SNPBLUP, PBAYESC, BAYESC, PBLASSO and BLASSO) were assessed through a five-fold cross validation (CV) scheme. All the fish with both phenotypes and genotypes were randomly sampled into five validation sets. The GEBV of the validation data sets were predicted one at a time where the phenotype of the validation fish (20% of the population) was masked (set to missing) and all remaining fish with phenotype and genotype (80% of the population) were used as training data. To reduce the stochastic effects, the CV analysis was replicated 10 times. Predictive ability was presented as reliability, which was estimated as:
$$ {R}_{EBV,BV}^2=\frac{R_{EBV,y}^2}{h^2}, $$
where \( {R}_{EBV,\;BV}^2 \) is the squared correlation between predicted (G)EBV for fish in the validation data in a given model (predicted from the training data), the recorded phenotype (y), and a “common” heritability (\( {h}^2\Big) \) of the trait which was calculated using PBLUP with full-data set and without marker information.
The Pearson’s correlation coefficients between the (G)EBV obtained by the different models was used to measure the degree of similarity between the rankings of fish. Also, for all models, the slope of regression of actual phenotype (either DAYS or STATUS) on (G)EBV were calculated and used as a measure to indicate the bias of the (G)EBV [46]. A slope of regression coefficient close to 1 indicates no bias in the model and breeding values are equal in magnitude [47]. Whereas, a slope of less than 1 or greater than 1 indicates a biased underestimation or overestimation in the (G)EBV prediction, respectively [48]. The reliability, Spearman’s rank correlation and slope of regression for each model, were reported as the average of the CV schemes used.
In addition, the effect of marker densities on the estimated reliabilities in different GS models was tested. For this, a random sample of 500, 1 K, 3 K and 20 K SNPs was used separately to predict GEBV for all GS models under the CV scheme described above. Using the lowest possible SNP density with higher or similar reliability of the 50 K SNP could help reduce genotyping costs.