Fish samples and phenotypic data
In this study, a total of 752 DNA samples of yellowtail kingfish (YTK) were sequenced using DArT-seq technology (see section “Library preparation and DArT sequencing”). The animals originated from a selective breeding program for YTK at Clean Sea Tuna Ltd. in South Australia. The fish samples analysed in this study were offspring of 35 families produced from the first generation (G1) and wild brood stock parents. The average number of offspring per family was 17. Breeding was conducted in tank, comprising three males and three females. In total, 16 sires and 31 dams successfully produced offspring in this study. After a nursing/rearing period of about 120 days in tanks, fingerlings were transferred to culture in sea cages. When the fish reached an average body weight of 3 kg, they were harvested and then anesthetised using clove oil (40 mg/L) and cold water before morphometric measurements and fin tissue collection. The morphometric and phenotypic measurements included body weight (W), fork length (L, measured from the tip of the snout to the end of the caudal fin rays). The condition index (factor) was calculated as k = W/L3. Deformities [17] included a range of measures, namely deformed snout, water belly – a condition where the belly is distended, deformed tail, deformed operculum and lower jaw). Skin fluke is due to the monogenean fluke parasite Benedenia seriolae; this fluke inhabits the skin and fins of Seriola spp. and feeds on mucus and epithelia cells. Both deformity and fluke were recorded as binary traits depending on their presence or absence on the body of the fish at harvest (~ 3 kg) and coded as 1 and zero, respectively. The incidence of skin fluke and deformity recorded under field condition in this population was low (4.3 and 17.6%, respectively) and hence, results from genomic prediction for these traits were not tabulated in this paper.
The fin caudal tissue sample was collected from caudal fin of each fish and stored in 80% ethanol for DNA extraction at a laboratory of the University of the Sunshine Coast (USC). Fish biometrics, fluke abundancy and deformities were recorded before they were released back to the cage.
A detailed description of the population is given in our earlier studies [2, 3, 17].
DNA extraction, genotyping and parentage analysis
Fin tissue samples of a total of 1568 fish were used to extract total genomic DNA (gDNA) using DNeasy Blood and Tissue kits (Qiagen, Germany) and NucleoMag® 96 Tissue kit (MACHEREY-NAGEL GmbH & Co. KG, Germany). A panel of eight microsatellite markers that consisted of five newly developed candidates from S. lalandi transcriptome sequences (YTK002, YTK008, YTK011, YTK017 and YTK019) [3] and three loci (Sdu21, Sdu32 and Sdu46) validated from the literature [18, 19] were used to genotype the experiment fish for parentage analysis in this study. Genotyping the broodstock fish was completed in our previous study [2]. All of these markers were proved and validated to use with yellowtail kingfish previously [2, 3].
DNA amplification was achieved using Qiagen Multiplex PCR PLUS Kits (Qiagen, Germany) in 13.5 μL reactions, each containing 1.25 μL of 10× primer mix, 6.25 μL of Multiplex PCR Master Mix, 2.75 μL of RNase free water, 1.25 μL of Q-Solution and 2.0 μL of approximately 20 ng template gDNA. Amplification was performed using an Eppendorf Mastercycler nexus (Hamburg, Germany) with cycling conditions as follows: initial denaturation at 95 °C for 15 min, followed by 35 cycles of 95 °C for 30 s, 57 °C for 90 s, and 72 °C for 30 s; with a final extension at 68 °C for 30 min.
PCR products were separated by capillary electrophoresis on an AB 3500 Genetic Analyser (Applied Biosystems) at the University of the Sunshine Coast. Fragment sizes were determined relative to an internal lane standard (GS-600 LIZ; Applied Biosystems) using GENEMARKER v1.95 software (SoftGenetics; State College, USA) and double-checked manually. Individuals with low or missing peaks were amplified/genotyped a second time and checked for evidence of large allele dropout, extreme stuttering and null alleles, based on 1000 bootstraps and a 95% confidence interval. Tests for HWE at each locus and genotypic linkage equilibrium among pairs of loci, numbers of alleles and the observed and expected heterozygosities of each locus were determined using GENALEX v6.5, while polymorphic information content (PIC) was computed in CERVUS v3.0. Parentage assignment was completed using COLONY software [20] with confidence scores of above 95%. The pedigree included 65 full-sib groups (16 dams and 31 sires), with the family size of 3 to 108 offspring. A total of 1568 offsprings out of 1998 were assigned to full sib families and the family size. Based on this pedigree, large size families (35 full-sib families and averaging 17 fish per family) were chosen to send to DArT Ptd Ltd. in Canberra, Australia for sequencing.
Library preparation and DArT sequencing
Four methods of DArTseq™ complexity reduction were tested in Seriola lalandi (data not presented) and the PstI-SphI method was selected. DNA samples were processed in digestion/ligation reactions principally as per Kilian et al. [10] but replacing a single PstI-compatible adaptor with two different adaptors corresponding to two different Restriction Enzyme (RE) overhangs. The PstI-compatible adapter was designed to include Illumina flowcell attachment sequence, sequencing primer sequence and “staggered”, varying length barcode region, similar to the sequence reported by Elshire et al. [21]. Reverse adapter contained flowcell attachment region and SphI-compatible overhang sequence.
Only “mixed fragments” (PstI-SphI) were effectively amplified in 30 rounds of PCR using the following reaction conditions: 94 °C for 1 min, then 30 cycles of 94 °C for 20 s, 58 °C for 30 s, 72 °C for 45 s and 72 °C for 7 min.
After PCR equimolar amounts of amplification products from each sample of the 96-well microtiter plate were bulked and applied to c-Bot (Illumina) bridge PCR followed by sequencing on Illumina Hiseq2500. The sequencing (single read) was run for 77 cycles.
Sequences generated from each lane were processed using proprietary DArT analytical pipelines. In the primary pipeline the fastq files were first processed to filter away poor quality sequences, applying more stringent selection criteria to the barcode region compared to the rest of the sequence (minimum Phred pass score = 30 and minimum pass percentage = 75). In that way the assignments of the sequences to specific samples carried in the “barcode split” step were very reliable. Approximately 2,500,000 sequences per barcode/sample were identified and used in marker calling. Finally, identical sequences were collapsed into “fastqcoll files”. The fastqcoll files were “groomed” using DArT PL’s proprietary algorithm which corrects low quality base from singleton tag into a correct base using collapsed tags with multiple members as a template. The “groomed” fastqcoll files were used in the secondary pipeline for DArT PL’s proprietary SNP and SilicoDArT (presence/absence of restriction fragments in representation) calling algorithms (DArTsoft14).
For SNP calling all tags from all libraries included in theDArTsoft14 analysis are clustered using DArT PL’s C++ algorithm at the threshold distance of 3, followed by parsing of the clusters into separate SNP loci using a range of technical parameters, especially the balance of read counts for the allelic pairs. Additional selection criteria were added to the algorithm based on analysis of approximately 1000 controlled cross populations. Testing for Mendelian distribution of alleles in these populations facilitated selection of technical parameters discriminating well true allelic variants from paralogous sequences. In addition, multiple samples were processed from DNA to allelic calls as technical replicates and scoring consistency was used as the main selection criteria for high quality/low error rate markers. Calling quality was assured by high average read depth per locus (Average across all markers was over 30 reads/locus). Regarding whole read quality, the minimum Phred pass score was set at 10 and the minimum pass percentage was 50. The average SNP calling rate was 92% and the fish sample calling rate was 95%. The raw sequence data (accession number SRP130211) is available at https://www.ncbi.nlm.nih.gov/sra/SRP130211 (Release date: 2019-02-28) Release date: 2019-02-28.
GBS data analysis
Sample and markers statistics is given in Additional file 1: Tables S1 and S2, respectively. In total, there were 14,448 SNP markers. The PIC value for the SNPs was 0.16 under additive genetic model, whereas it was substantially higher, 0.46 under codominant model. The average proportion of missing genotype (SNPs) was only 14.8%. The frequency of minor allele was 0.29. The missing genotype was also imputed using theDArTsoft14 analysis pipeline.
Pedigree based analysis of heritability
Restricted maximum likelihood (REML) method applied to a linear (animal) mixed model was used to estimate heritability for traits studied. The model included the fixed effect of stock origin (wild and F1 fish) and the additive genetic random effect of individual fish in the pedigree. Our preliminary analysis using logarithmic likelihood ratio test (LRT) indicated that the common full-sib effect was not significant (P > 0.05). This has been a result of early communal rearing of all families soon after birth.
Heritability (h2) for the traits studied was estimated as h2 = σ2A/σ2P where σ2A is the additive genetic variance, σ2P the phenotypic variance (σ2P = σ2A + σ2E) and σ2E is the residual variance. Pedigree based analysis of heritability were carried out using the ASREML software package [22].
Genomic prediction
Genomic best linear unbiased prediction (gBLUP) method was used to assess the predictive ability of the model for traits studied, using SVS Suite (Golden Helix, 2016). This method (gBLUP) generally shows similar predictive capacity to other non-linear approaches, especially for growth traits as used in this study. Theoretical statistics of the gBLUP method was discussed in earlier reports [23]. In brief, the statistical model and assumptions regarding SNP distribution and its variance are given below.
The (gBLUP) method uses genomic relationship matrix (G) derived from DNA marker information to calculate genomic breeding values for each individual in the pedigree Clark et al., [24]. The model is written in a matrix notation as follows:
where y is the vector of phenotypic observations (body weight, length and condition index), m = overall mean, X is the incidence matrix containing fixed effects (i.e. stock origin in this study) in b. The matrix Z relates records to genomic values, a is the SNP effect and e is the residual error. In this model, gBLUP assumes equal variance σ2z for all loci/SNPs:
$$ Var(g)={ZZ}^{\prime }{\sigma^2}_z= G\phi \kern0.5em {\sigma^2}_z=G\kern0.5em {\sigma^2}_g $$
where σ2g is the genetic variance and ϕ is the normalisation factor.
The pseudo heritability was calculated from the gBLUP model as \( {h}_p^2={\widehat{\sigma}}_g^2 \) /(\( {\widehat{\sigma}}_g^2 \)+\( {\widehat{\sigma}}_e^2 \)) where \( {\widehat{\sigma}}_g^2 \) is the genomic variance and \( {\widehat{\sigma}}_e^2 \) is the component of error variance.
The accuracy or predictive ability of the gBLUP model was defined as the correlations between the predicted breeding values and actual phenotypes (\( {r}_{y,\widehat{y}} \)), divided by square root of the trait heritability.
Within population prediction
We conducted within population prediction using five-fold cross validation approach to assess how well our model can predict the phenotype. With this approach, the training data consisting of 752 samples with both phenotype and genotype were randomly partitioned into five equal sub-samples (paternal half-sibs were present in all sub-sets), and in each round (e.g. k-1), one sub-set was selected as a test (validation) and the model fitted with four folds to predict the validation set.
The gBLUP five-fold cross validation model used was the same as described above. The allele substitution effects (ASE) and fixed effect coefficients obtained from iterations and k-folds that gave the largest R2-value were used to predict phenotypes with the following model:
$$ \widehat{y}=\mathbf{X}\widehat{\beta }+\mathbf{M}\widehat{\alpha } $$
where \( \widehat{y} \) are the predicted phenotypes, X is the fixed effects matrix, \( \widehat{\beta} \) are the fixed effect coefficients, M is the genotype matrix, and \( \widehat{\alpha} \) are the ASE values.
The five-fold cross validation analyses for all the traits were conducted in SVS Suite (Golden Helix, 2016). We compared the actual and predicted phenotypes by association test and linear regression analysis. The coefficient of determination (R2) from the regression analysis (e.g. VanRaden et al. [25]) was also used to evaluate the predictive ability of the gBLUP model.