Creation of haploid Atlantic salmon
Atlantic salmon milt (Landcatch, UK) was diluted to a concentration of 5 × 108 ml-1 in modified Cortland’s solution, then a 2 ml aliquot was placed in a 5 cm diameter petri dish and irradiated with 254 nm UV light for 8 min at a dose rate of 170 μWcm-2 (optimization of irradiation protocol not shown). Irradiated milt was used to fertilize Atlantic salmon eggs (Landcatch, UK), which were then incubated under standard conditions. Putative haploids were sampled at 300 degree-days post-fertilization. Haploidy was verified by: (i) genotyping a sub-sample of these embryos (along with parents and diploid controls) using the 10 microsatellite marker multiplex system described in ; (ii) another sample from the same group was incubated to hatch to verify that they showed the typical “haploid syndrome” (small size and curved trunk compared to diploid controls ). Production of haploid embryos complied with the Animals (Scientific Procedures) Act 1986.
Animals and preparation of sequencing libraries
Six libraries were created for RR-Seq using the restriction enzyme HaeIII. Libraries 1 - 4 each corresponded to a pool of genomic DNA of ten fish (five male, five female) from each of the four year-group subpopulations of the LNS broodstock population. Library 5 comprised a pool of genomic DNA from 16 wild fish (sex unknown) with four from each of four populations sources in Scotland, Norway, Ireland and Spain, respectively. Library 6 comprised a single haploid fish and was sequenced for the purpose of identification and exclusion of PSV. A heterozygous base called in this single haploid individual most likely represent variation between paralogous loci (i.e. a PSV) rather than genuine SNP variation at a single unique genomic location. For libraries 1 - 5, equal amounts of individual genomic DNA was multiplexed to form pools of total 15 μg and, for library 6.5 μg genomic DNA from the haploid sample was used. These pools were subsequently digested with 15U HaeIII (Promega, USA) for 3 hours. Genomic DNA fragments of between 450-550 bp were size-selected by agarose gel electrophoresis and the gel slices were purified using a MinElute Gel Extraction kit (Qiagen, UK). The Illumina Truseq DNA Sample Preparation Kit v2 (Illumina Inc., USA) protocol was then followed. Libraries were quantified using the Bioanalyzer 2100 (Agilent, USA), library-specific nucleotide barcodes were added, and they were sequenced in multiplexed pools on the Illumina Hiseq 2000 instrument using a 100 base paired-end sequencing strategy (v3 chemistry). All RR sequence data were deposited in the European Nucleotide Archive (ENA) under accession number PRJEB4796.
RAD sequencing was undertaken for five randomly selected family groups (both parents and six offspring; n = 40 individuals) from each of the four year-group subpopulations that comprise the LNS broodstock population. For each year-group, four RAD libraries were constructed; two parental libraries (five individuals each) and two offspring libraries (15 individuals each). Equimolar amounts of all four libraries were combined and run on a single Illumina Hiseq 2000 lane, giving three-fold deeper coverage of parental samples cf. offspring. The RAD library preparation protocol employed in this study has been fully documented elsewhere . Essentially it is the methodology originally described by Baird et al. and comprehensively detailed by Etter et al., with minor procedural modifications. In brief, DNA was extracted using Biosprint96 DNA extraction kits (Qiagen, UK) following the manufacturers protocol and treated with RNase to remove residual RNA. DNAs were quantified by spectrophotometry (Nanodrop), quality assessed by agarose gel electrophoresis, and was finally diluted to a concentration of 50 ng/μL in 5 mmol/L Tris, pH 8.5. Each sample (1.5 μg parental DNA or 0.5 μg offspring DNA) was digested at 37°C for 45 minutes with SbfI high fidelity restriction enzyme (New England Biolabs, USA; NEB) using 6U SbfI per μg genomic DNA in 1× Reaction Buffer 4 (NEB) at a final concentration of c. 1 μg DNA per 50 μL reaction volume. Following heat inactivation at 65°C for 20 minutes, individual specific P1 adapters, each with a unique 5 base barcode were ligated to the SbfI digested DNA. Following heat inactivation individual ligation reactions were then combined in appropriate multiplex pools / libraries (5 parental samples or 15 offspring samples each). Shearing (Covaris S2 sonication) and initial size selection (250 – 500 bp) by agarose gel separation was followed by gel purification, end repair, dA overhang addition, P2 paired-end adapter ligation, library amplification, as in the original RAD protocol . A total of 150 μL of each amplified library (14 - 16 PCR cycles) was size selected (c. 300 - 550 bp) by gel electrophoresis and eluted into 20 μL EB buffer (MinElute Gel Purification Kit, Qiagen, UK.) Libraries were accurately quantified by qPCR (Kapa Library), combined as appropriate and run on an Illumina Hiseq 2000. Two of the four year-class sample sets were pair-end sequenced, the other two were single end sequenced (v3 chemistry; 100 base reads). Raw reads were processed using RTA 126.96.36.199 and Casava 1.6 (Illumina, USA) and all sequence data were deposited in the ENA under accession numbers PRJEB4783 (paired-end data) and PRJEB4785 (single-end data).
The sequence data used to generate the RNA-Seq SNP dataset were part of a larger ongoing study with the aim of investigating the transcriptome of Atlantic salmon fry with disparate genetic resistance to the Infectious Pancreatic Necrosis Virus (IPNV). Briefly, three families derived from Landcatch (UK) broodstock were challenged with IPNV at the Centre for Environment, Fisheries and Aquaculture Science (Cefas) in Weymouth, UK. Details on the challenge protocol have been described previously . From each family, one group of fry were sampled prior to challenge and one group were sampled one day post-challenge and stored at -80°C until processing. Fish were euthanised using a non-schedule 1 method under a procedure specifically listed on the appropriate Home Office (UK) license and all experiments were performed under approval of Cefas ethical review committee and complied with the Animals Scientific Procedures Act .
RNA-Seq libraries each comprised of six individual homogenised whole fry (each ~0.5 g) per family per timepoint (total n = 72). Each fry was homogenised in 5 ml TRI Reagent (Sigma, USA) using a Polytron mechanical homogeniser (Kinemetica, Switzerland). The RNA was isolated from 1 ml of the homogenate, using 0.5 vol. RNA precipitation solution (1.2 mol/L sodium chloride; 0.8 mol/L sodium citrate sesquihydrate) and 0.5 vol. isopropanol. Following re-suspension in nuclease-free water, the RNA was purified using the RNeasy Mini kit (Qiagen, UK). The RNA integrity numbers from the Bioanalyzer 2100 (Agilent, USA) were all over 9.9. Thereafter, the Illumina Truseq RNA Sample Preparation kit v1 protocol was followed directly, using 4 μg of RNA per sample as starting material. Libraries were checked for quality and quantified using the Bioanalyzer 2100 (Agilent, USA), before being sequenced in barcoded pools of 12 individual fish on the Illumina Hiseq 2000 instrument (100 base paired-end sequencing, v3 chemistry) and all sequence data were deposited in the ENA under accession number ERP003968.
SNP discovery and filtering
All reads were aligned to the Atlantic salmon reference genome assembly (NCBI Assembly GCA_000233375.1) using BWA 0.5.9 , allowing up to 4 mismatches per 100 bases. Predicted allele frequencies were derived from SAMtools  mpileup v0.1.19 using default settings. To exclude putative PSVs, genotypes were called in the library derived from the haploid Atlantic salmon embryo using GATK UnifiedGenotyper v2.1.9  and any SNP showing a heterozygous genotype (genotype quality >20) was removed (n = 133,029). Of the putative SNPs remaining, those with an allele frequency of ≤ 0.1 or a read depth of ≤ 10 (n = 172,501) were removed. Finally, SNPs occurring within known genomic repeat elements defined according to the salmonid-specific repeat-masker (http://grasp.mbb.sfu.ca/GRASPRepetitive.html) (n = 67,445) were removed leaving 99,097 candidate RR-Seq-derived SNPs.
All reads were aligned to the Atlantic salmon reference genome assembly (NCBI Assembly GCA_000233375.1) using BWA 0.5.9 , allowing up to 4 mismatches per 100 bases. Duplicated reads originating from PCR were marked using Picard and subsequently ignored. GATK UnifiedGenotyper v 2.1-9  was used to detect and genotype putative SNPs, enabling the base-alignment quality (BAQ) calculation and otherwise using the default parameters. Genotypes with a quality score of >20 were retained and SNPs that demonstrated two or more mendelian errors or significant mendelian distortion (chi2 P <0.05) in any of the families (n = 344,278) were removed. The remainder were repeat-masked as above (39,839 removed) leaving 83,151 candidate RAD-Seq-derived SNPs.
Bowtie2 v2.1.0 alignment software  was used for alignment of the generated RNA-seq reads with requirements of a perfect end-to-end and gapless alignment of seed substrings of 32-mers. Each sample was aligned to the Atlantic salmon genome assembly (NCBI Assembly GCA_000233375.1). SAMtools v0.1.19  was then used to identify any SNPs within the aligned sequences or between the Atlantic salmon genome assembly and the aligned sequences. SNP calls were generated with default SAMtools  pileup settings and standard SNP filters. Only the 426,135 transversions (which are best-suited for inclusion on the Axiom array) with a predicted MAF ≥ 0.1 were retained. These were repeat-masked as above (196,381 removed) which left 229,754 candidate RNA-Seq-derived SNPs.
All newly discovered filtered SNPs from the RR-Seq, RAD-Seq and RNA-Seq experiments were submitted to dbSNP (NCBI ss# 947429275 - 947844429) .
Publicly-available and other SNPs
A non-exhaustive list of publicly-available Atlantic salmon SNPs (n = 9,084) was created as an additional set of candidate SNPs for inclusion on the ssalar01 array. This list included all SNPs in dbSNP  which included the SNPs described in Lien et al., the SNPs described in Moen et al. and the QTL-linked SNPs described in Houston et al.. An additional eight unpublished SNPs discovered in our laboratories were added. The flanking sequence for these SNPs was aligned to the reference genome and the SNPs were included as candidates for submission to Affymetrix if they mapped to a single unique genomic location and contained sufficient flanking sequence for probe design (30 bases of flanking sequence).
Y-specific probes for test of genetic sex
A homology search of the rainbow trout Y-specific master sex-determining gene sdY ( Accession AB626896.1) identified an Atlantic salmon EST (Accession CK897399.1) comprising part of Exon 4 and the 3′UTR sequence of SRY in Atlantic salmon. Using PCR primer sets designed from both these sequences, partial sequences from the SRY gene in Atlantic salmon were obtained by direct sequencing of amplicons. Two contigs were produced (Additional file 4); a 2,149 nt fragment comprising most of exon 2 and exon 3 with intervening intron and a 1,147 nt fragment comprising exon 4 with partial upstream intron and downstream 3′ UTR sequence. Following repeat masking using the salmonid-specific repeat-masker (http://grasp.mbb.sfu.ca/GRASPRepetitive.html), a series of 87 partly overlapping, potentially Y-specific probes (Additional file 5) were designed to both DNA strands, according to Axiom non-polymorphic gender probe guidelines .
Affymetrix Axiom array creation and genotyping
The candidate SNPs were provided to Affymetrix as 71-mer nucleotide sequences from the forward strand with the alleles at the target SNP highlighted at position 36. Using proprietary software, ‘p-convert’ values (representing the probability of a given SNP converting to a reliable SNP assay on the Axiom array system; see ) were computed for each submitted SNP sequence. Potential probes were designed for each SNP in both the forward and reverse direction, each of which is designated as ‘recommended’, ‘neutral’, or ‘not recommended’ based on p-convert values. All ‘recommended’ probes were included and ‘neutral’ probes were included if paired with a ‘recommended’ or a ‘neutral’ probe, resulting in the tiling of probes for 266,105 putative SNPs, this being 93% of the capacity of the array. To fill the remainder of the array, the following categories of putative SNPs were included: (i) putative SNPs discovered in more than one sequencing experiment with low p-convert score; (ii) SNPs mapping to two locations in the reference genome with a p-convert score over 0.6; and (iii) previously verified SNPs (from the ‘Publicly available and other’ category) with a non-zero p-convert score. In the final array, most of the SNPs are interrogated by two independent probesets, designed at the 5′ and at the 3′ of the SNP. The R package ‘SNPolisher’ is used to choose the best performing probeset for every SNP. A probeset will have one or two different probes on the array, depending on the base change (A/T and C/G SNPs require two different probes). Each probe is tiled twice on the array, which means that there are two identical independent copies of each probe spatially separated on the array to provide robustness against potential local image artifacts. During the analysis, the signal from the two probes is summarized to provide a single signal estimate for each SNP.
A test plate of 96 genomic DNA samples from Atlantic salmon of various sources was genotyped using the ssalar01 array (Additional file 2). These samples comprised 47 representative samples of Atlantic salmon distributed across all four yeargroups of the Landcatch Natural Selection (Ormsary, UK) breeding program (termed ‘Farmed Scottish’), eight Atlantic salmon originating from Aquagen (Trondheim, Norway) and eight Atlantic salmon originating from Salmobreed (Bergen, Norway) (together termed ‘Farmed Norwegian’), 24 Atlantic salmon from Br5 and Br6 SalMap families , three Atlantic salmon sourced from the River Dee (Scotland), two from the River Corrib (Ireland), two from the River Hopselv (Norway) and two from the River Lerez (Spain) (together termed ‘Wild’). Details of the Axiom SNP genotyping and quality-control procedures are given elsewhere [37, 56]. Briefly, each SNP allele generates a hybridisation signal and the size and contrast of these signals is computed for each SNP for each individual to generate genotype clusters using the Axiom GT1 algorithm. The analysis consists of a pre-processing stage which includes image artefact reduction and an algorithm that filters out contiguous probes with unexpected intensity level, if they occur. This is followed by a quantile normalization on the two Axiom channels separately and median polish summarization to generate intensity signals for the A and B alleles. For the genotype calling, the allele-signal estimates derived in the pre-processing stage are the input values to the clustering algorithm. These signal values are transformed into the contrast-size (also called MvA) space used for clustering  defined in the following way: Contrast = logA – logB, and Size = (logA + logB)/2. The first stage of clustering evaluates all possible placements of two vertical boundaries (to define three genotype clusters) between data on the X axis, computing for each a posterior likelihood given the data and a Bayesian prior on cluster locations. After identifying the labeling of maximum likelihood, the prior two-dimensional Gaussian mixture model is updated in a Bayesian fashion to produce a posterior model that is used to make genotype calls; the same posterior can also be used as a prior for future clusterings. In the final stage, genotype calls are assigned by associating each sample to the closest posterior model.
The SNPs were split into categories according to their clustering performance with respect to various Axiom-generated quality-control criteria; (i) ‘polymorphic high resolution’ where the SNP passes all QC, (ii) ‘monomorphic high resolution’ where the SNP passes all QC except the presence of a minor allele in two or more samples, (iii) ‘call rate below threshold’ where genotype call rate is under 97%, (iv) ‘no minor homozygote’ where the SNP passes all QC but only two clusters are observed, (v) ‘off-target variant’ (OTV) where atypical cluster properties arise from variants in the SNP flanking region, and (vi) ‘other’ where the SNP does not fall into any of the previous categories. OTVs are reproducible and previously uncharacterized variants that interfere with genotyping a SNP and usually display substantially low hybridization intensities and are centred at zero in the contrast dimension (A - B). This could be due to a SNP in the flanking sequence of one or both Atlantic salmon paralogues. This can result in miscalling of individuals as heterozygous (AB). However, they usually sit below the heterozgous cluster on the y-axis [(A + B)/2]. Such miscalled heterozygotes can be identified using ‘OTV_Caller’ which is part of the SNPolisher (an R package available from Affymetrix). The Expectation-Maximization (EM) algorithm is used with the posterior information to identify which samples should be in the OTV cluster and which samples should remain in the AA, AB, or BB clusters. In this study, only SNPs from categories (i) and (iv) were included in further analyses. These filtered SNP data were analysed for allele frequency distribution and Mendelian inheritance using the software Plink .
To map a subset of the QC-filtered SNPs to chromosomes, a sire-based linkage analysis was performed for a subset of offspring in the two ‘SalMap’ families  (all parents and 10 offspring per family; total n = 24) using the CriMap software  as modified by Xuelu Liu (Monsanto, USA). This analysis relied on the lack of male recombination in centromeric regions of the male salmonid genome, and this feature facilitated mapping of markers to linkage groups according to identical or near-identical sire-based inheritance patterns. The number of offspring per family was too small to determine marker positions within those linkage groups. Firstly, the QC-filtered SNPs which had the segregation pattern AB (sire) × AA or BB (dam) in at least one of the families were identified. Secondly, a ‘two-point’ linkage analysis was performed to determine the LOD scores between all pairs of markers in randomly selected pools of ~5,000 SNPs including anchor markers from each of the 29 pairs of Atlantic salmon chromosomes (Additional file 1: Table S2). Thirdly, the ‘autogroup’ option was used to cluster markers into linkage groups, starting with more stringent parameters and proceeding to less stringent parameters. The parameter settings for ‘autogroup’ were: Layer 1 (5, 2.0, 4, 0.9); Layer 2 (4, 1.5, 4, 0.7); Layer 3 (3, 1.0, 4, 0.6); Layer 4 (2.5, 0.5, 4, 0.3). The final layer corresponded to a LOD score of 2.5 which was necessarily lower than the typical threshold of 3.0 to include SNPs that were segregating in only one sire and were inherited without recombination (LOD ~ 2.7). For chromosomes 2 and 6, and chromosomes 22 and 23, the sire-based inheritance pattern was very similar in one of the families which resulted in conflicting linkage assignments. Therefore, those linkage groups were defined using sire-segregation of markers in the other family only.
The ability of the ‘ssalar01’ Axiom array to identify distinct genetic populations and population structure was evaluated on all the unrelated samples (as Table 3) based on pairwise IBS distance calculated using the software Plink . A multidimensional scaling analysis on the N × N matrix of genome-wide IBS pairwise distances was performed and a scatterplot of the individuals based on their position on the first two dimensions was created.
Availability of supporting data
The sequencing data from this study have been deposited in the European Nucleotide Archive (ENA) http://www.ebi.ac.uk/ena/ under accession numbers PRJEB4796, PRJEB4783, PRJEB4785 and ERP003968, and the SNP details have been submitted to dbSNP https://www.ncbi.nlm.nih.gov/SNP/ under NCBI ss# 947429275 - 947844429. Other supporting data are available as additional files.