Plant materials
SNP genotyping was conducted on two populations of coastal Douglas-fir (C1 and C2) and one population of interior Douglas-fir (I1). C1 consisted of 1825 trees (1907 samples) from two breeding populations managed by the Northwest Tree Improvement Cooperative: Coos Bay Low and South Central Coast. Trees were selected using three cycles of breeding and testing. The first-cycle selections consisted of 61 trees from native stands in coastal southern Oregon, plus 7 of their progeny growing in first-cycle field tests (i.e., 5 open-pollinated trees and 2 controlled-cross progeny). Next, we selected 609 trees from the second-cycle field tests; i.e., controlled-cross progeny of first-cycle selections. Finally, we genotyped 1033 progeny of the second-cycle selections. These trees came from 24 full-sib families that were growing in the greenhouse. Because some inter-generational crosses were used, the resulting pedigree was complex. In addition to these pedigreed trees, we genotyped 59 trees from a single woodsrun seedlot and 56 trees of uncertain parentage. Because the duplicated tree samples were handled separately (i.e., independent DNA isolations and SNP genotyping), these samples were included in analyses designed to test array performance. However, we used 112 unrelated trees (one sample per tree) to calculate population genetic statistics. Some of the unrelated C1 trees were derived from crosses between trees collected from different geographic locations. C2 consisted of 384 coastal Douglas-fir trees from western Oregon and Washington (one sample per tree). We used these trees to help judge the performance of the Axiom array, and then selected 283 unrelated trees for calculating population genetic statistics. The I1 samples consisted of foliage collected from 13 trees in native stands. All of the genotyped trees described above came from the same broad areas that were sampled for SNP discovery [17, 32]. For each C1, C2, and I1 sample, 10–15 young needles were placed in vials with granular silica gel desiccant (Activa Flower Drying Art), and then stored at room temperature. Once dry, 3 needles were cut into ~ 2 mm lengths, placed in 96-well plates, and then stored at − 20 °C.
DNA isolation
Dry needles were pre-treated with liquid nitrogen, and then pulverized in a shaker with tungsten beads for two cycles of 60s at 20 Hz. DNA was isolated using the DNeasy-96 Plant Kit (Qiagen), with the addition of a proteinase-K treatment. DNA concentrations were measured using the Pico Green fluorescent dye (Invitrogen) and a Gemini XPS microplate reader (Molecular Devices). Samples with concentrations ≥ 20 ng·μl− 1 were used for SNP genotyping.
Selection of SNPs for the Axiom array
SNPs tested on the Axiom array were derived from the Oregon State University (OSU) dataset described by Howe et al. [17] and the University of Hohenheim (UH) dataset described by Müller et al. [32]. The OSU discovery panel consisted of coastal Douglas-fir trees sampled across Oregon and Washington, plus interior Douglas-fir trees collected across much of its range [17]. The UH discovery panel consisted of Douglas-fir trees from British Colombia, Washington, Colorado, and New Mexico [32]. For the OSU SNPs, we used SNP probabilities and past genotyping success to select SNPs for the Axiom array. PS and PF are the p-values associated with a SNP being a true target SNP or a true variant in the SNP flanking region (i.e., SNP or indel). These were calculated using the methods described by Wei et al. [42], using a MAF value of 0.01 and sequence error rate of 0.01 [17]. From the OSU dataset of 676,030 SNPs, we selected a total of 338,663 SNPs to be evaluated for inclusion on the Axiom array (Fig. 1). Of these, 337,938 were selected because they were detected using a target SNP probability (PS) of 0.001 (high-confidence SNPs; Fig. 1). Although they had higher p-values, we added another 725 SNPs because they had been successfully genotyped using the Infinium platform [17]. Overall, the evaluated dataset included 5847 SNPs that had been previously genotyped using the Infinium array. Of these SNPs, we identified 208,258 that were ‘buildable’; i.e., had a least one 35-nt flanking sequence with no other SNPs or indels using a flanking SNP probability (PF) of 0.001. To this dataset, we added 13,410 buildable SNPs from the UH dataset that were chosen to target transcripts not already represented in the OSU dataset. To identify novel transcripts, we compared 141,626 UH assembled isotigs (excluding singletons [32]) to the OSU reference transcriptome using a BLAST E-value cutoff of 10− 10. Isotigs are transcript variants assembled using the Newbler de novo assembler [17, 32]. We identified 63,286 novel isotigs, 8617 of which contained biallelic SNPs (40,206 SNPs). From these, we selected 16,859 high-confidence SNPs that were detected by two or three SNP detection programs [32]. From these, we identified 13,410 SNPs that were ‘buildable;’ i.e., had a least one 35-nt flanking sequence that did not have flanking SNPs detected by two or three SNP detection programs. In total, we sent 221,668 candidate SNPs from the OSU and UH datasets to Affymetrix (now Thermo Fisher Scientific) to be evaluated by their proprietary software (Fig. 1). After the filtering steps described below, we included 55,766 SNPs on the Axiom array.
Each candidate SNP sent to Affymetrix consisted of the target SNP plus two 35-nt flanking sequences (total = 71 nt). For each of the two flanking sequences per SNP (forward and reverse), Affymetrix calculated a Repetitive indicator variable, pConvert score, and a Recommendation. For the Repetitive variable (T, F), Affymetrix counts the number of 16-nt hits between the SNP sequence and the supplied reference genome. Any flanking sequence with more than 300 hits was classified as repetitive (T). Affymetrix used the v0.5 Douglas-fir reference sequence for this analysis (asm-1.scafSeq.fasta, 5/11/2015 [16]). The pConvert score (0–1) reflects the relative probability of probe success based on the thermodynamics of the probe and the number of 16-nt matches to the reference genome. Probesets with a Repetitive score of T were assigned a pConvert score of 0, and higher pConvert scores indicate a greater probability of SNP success. Affymetrix classified probesets as either ‘not possible’ or ‘buildable,’ and then for the buildable probesets, used the pConvert score to classify each probeset as either ‘recommended’ (0.6 ≤ pConvert ≤1.0), ‘neutral’ (0.4 ≤ pConvert < 0.6), or ‘not recommended’ (0 ≤ pConvert < 0.4).
To design the array, we first removed candidate SNPs that had no acceptable probesets. Unacceptable probesets were those with (1) no corresponding 71-nt matches in the reference genome, (2) SNPs or indels in their target sequences (i.e., in the forward or reverse flanking sequences, PF = 0.001), or (3) Affymetrix classifications of ‘not recommended’ or ‘not possible.’ However, we did not remove SNPs if they had already been successfully assayed using the Infinium platform [17], as long as they had at least one buildable probeset. Second, we removed most A/T and C/G SNPs because they occupy twice as much room on the array (i.e., require two probesets to assay).
To select probesets for the array, we first ranked transcripts and probesets-within-transcripts based on the various criteria described below. Then, to maximize genome coverage, we selected the best probeset from each transcript. Because the number of SNPs on the array exceeded the number of transcripts, we cycled through the ranked list of transcripts more than once. In general, we selected one probeset per SNP. However, when this was impossible (i.e., when the transcript had too few SNPs), we selected two probesets per SNP to increase the number of transcripts with successful SNP assays. Thus, on the final array, most SNPs were interrogated by the single best forward or reverse probeset using the criteria described below. Ultimately, we included 58,350 probesets representing 55,766 SNPs on the array. The ranking criteria used in this process are described next.
We ranked the OSU transcripts using three variables. The first variable was the number of BLAST hits between the SNP sequence and the reference genome averaged across all SNPs in the transcript. One BLAST hit was counted for each 65-nt match between the 71-nt SNP sequence and the reference genome. The rank order for this variable (best to worst) was 1, > 1, and 0 hits. The second variable was the transcript confidence score described by Howe et al. [17] (lower is better). These confidence scores were previously derived by comparing our Douglas-fir transcripts to a set of white spruce (Picea glauca) unigenes [43]. White spruce was chosen because it represents a closely related genus in the Pinaceae, and had a particularly well curated set of transcript sequences available for comparison. Lower confidence scores represent simpler relationships and (hypothetically) greater confidence that the Douglas-fir assembly was correct. The third variable was the number of SNPs per transcript (higher is better). Transcripts with more SNPs were ranked higher because they have more probesets from which to select the best one. The UH transcripts did not have OSU confidence scores because they were identified as part of a separate SNP discovery project [32]. Thus, they were ranked only by the number of BLAST hits and number of SNPs per transcript.
After ranking the transcripts, we ranked probesets-within-transcripts based on five criteria. First, SNPs successfully genotyped with the Infinium platform were ranked at the top (other SNPs had not been tested using Infinium). The second variable reflected the likelihood of having SNPs or indels (variants) in the flanking sequences. This was accomplished by accounting for variants in the flanking sequences at multiple probability levels. For the OSU target SNPs, we used flanking probabilities of 0.1, 0.01, and 0.001 (PF) to identify low-, medium- and high-confidence SNPs. For the UH dataset, we identified low-, medium- and high-confidence SNPs based on the number of programs used to call a SNP (1, 2, or 3 SNP detection programs [32]). Next, we identified probesets that had no flanking SNPs, even when all possible SNPs were considered (i.e., low-, medium-, and high-confidence SNPs). These probesets were assigned a rank of 1 (highest priority) because they are least likely to have undetected SNPs in the flanking sequence. Then, we repeated this process after excluding the low confidence SNPs (rank = 2), and then after excluding all but the high confidence SNPs (rank = 3). The third variable was the number of perfect SNP alleles (71-nt sequences) found in the reference genome. We used BLAST to compare each of our two SNP alleles to the reference genome, and then counted the number of these alleles that had at least one match (i.e., possible counts are 1, 2, or 0 alleles, in rank order). Because the reference genome is haploid, a count of 1 suggests the SNP occupies a single genome location. A count of 2 indicates the SNP occupies more than one genome location (i.e., one locus for each SNP allele). A count of 0 suggests there are no matching sequences in the genome. This would occur if the reference had a third alternative allele (e.g., for triallelic SNPs), the target sequence spans an intron, or the target sequence is missing from the genome assembly. Counts of 2 or 0 may also occur via misassembly of our transcriptome sequence or reference genome. The fourth and fifth variables were the Affymetrix pConvert score (higher is better) and the probability of the target SNP. These probabilities were based on the OSU target SNP probabilities (PS for the OSU SNPs, smaller is better) or the number of programs that were used to call the target SNP (UH SNPs, higher is better).
Array processing and SNP calling
DNA samples were processed by GeneSeek (Neogen Genomics, Lincoln, NE) using the standard protocol for the Affymetrix Axiom array. Samples from populations C1 and I1 were processed jointly in two batches and then analyzed together. Samples from population C2 were processed and analyzed separately from C1 and I1. Raw SNP data were analyzed using Axiom Analysis Suite v.1.1.0.616 and the Best Practices Workflow (Affymetrix, Santa Clara, CA). We conducted three types of analyses (Default, Rescue, and Modified) using two phases of quality control (QC) filtering. The Default protocol used the Affymetrix diploid (default) QC thresholds [12]. SNPs were filtered using a SNP call rate cutoff (cr-cutoff) ≥ 97%. Samples (trees) were filtered using a Dish-QC threshold (axiom_dishqc_DQC) ≥ 0.82 and a sample call rate (qc_call_rate) ≥ 97%. The sample call rate is the average SNP call rate across all SNPs for a sample. Plates were filtered using a percent of passing samples (plate_qc_percentsamplespassed) ≥ 95% and a plate call rate (plate_qc_averagecallrate) ≥ 98.5%. The plate call rate is the average sample call rate for passing samples on a plate. Using the Default protocol, the Axiom Analysis Suite classifies SNPs into six categories: OTV, Other, CallRateBelowThreshold, NoMinorHom, MonoHighResolution, and PolyHighResolution [11]. SNPs in the PolyHighResolution class (polymorphic high-resolution) were considered successful Axiom SNPs. We also used four SNP rescue protocols that employed two phases of SNP filtering. We used the default thresholds in Phase 1, and then re-classified the SNPs in the Other and CallRateBelowThreshold categories into a Rescued category if they passed a (lowered) SNP call rate cut-off of 90, 80, 70, or 60%. In Phase 2, we used the Ps_CallAdjust SNPolisher function to lower the SNP Confidence Score threshold from the default of 0.15 to 0.10 [11]. The more stringent threshold (0.10) increases the number of missing values (no calls), but improves genotyping accuracy. In the Modified protocol, we changed the default thresholds in Phase 1 as follows: SNP call rate cutoff (cr-cutoff) ≥ 95%; sample Dish-QC threshold ≥ 0.50, sample call rate threshold (qc_call_rate) ≥ 80%, plate percent of passing samples (plate_qc_percentsamplespassed) ≥ 80%, and plate call rate threshold (qc_averagecallrate) ≥ 90% [12]. We then used a SNP call rate threshold ≥80% and SNP Confidence Score threshold ≥ 0.10 in Phase 2 [11]. For the Rescue and Modified protocols, SNPs in the PolyHighResolution and Rescued classes were considered successful SNPs.
SNP population genetic statistics
We selected 395 unrelated trees from the C1 and C2 populations, and then used the SAS ALLELE procedure (SAS v.9.4; Statistical Analysis System, Cary, NC) to calculate SNP call rate (CR), MAF, observed and expected heterozygosities (HETobs, HETexp), polymorphic information content (PIC), and probabilities of deviation from HWE using a chi-square goodness-of-fit test. We excluded SNPs that were not in HWE (P < 0.01), calculated statistics separately for each population (C1 and C2), and then averaged the values across the two populations. These analyses were conducted using the Default and Rescue protocols (CR = 97, 90, 80, 70, and 60%).
Genomic locations and annotations of Douglas-fir SNPs
BLASTN [44] was used to determine the genomic locations and annotations of the Douglas-fir SNPs. We ran BLASTN using default parameters, Axiom probeset sequences (71-mers) as the queries, and four reference genome datasets as the targets: psme.transcript.fna (08/25/2015), psme.allgenes.transcripts.fasta (08/22/2016), Psme_v1.0.scaffolds.fasta (11/12/2015), and Psme_v1.0.singletons.fasta (11/12/2015). These files are part of the Douglas-fir reference sequence database v1.0 [16]. Results from the four BLASTN runs were included in Additional file 2.
Predictors of SNP success
We evaluated two sets of variables as predictors of SNP success: (1) array design variables and (2) variables calculated using v1.0 of the Douglas-fir reference genome. As described above, some of the array design variables were calculated using v0.5 of the reference genome. After a new reference genome became available (v1.0), we calculated new BLAST variables using the same SNP sequences (71-mers) as queries. The draft reference genome (v0.5) had 18.5 M scaffolds, whereas the newer genome assembly (v1.0) had 2.8 M scaffolds [15].
First, we tested for associations between probeset success (i.e., success or failure; N = 58,350) and ten array design variables, including four transcript variables, five probeset-within-transcript variables, and one final rank variable. Although they were not used to select SNPs, we also analyzed the ‘recommended’ and ‘neutral’ classes of the Affymetrix Recommendation variable, which are bins of pConvert. Two variables that were approximately normally distributed (No. of SNPs per transcript and pConvert) were analyzed using a T-test and a Satterthwaite adjustment for heterogeneous variances. Two rank variables (Combined rank for transcripts and Final rank) were analyzed using a Wilcoxon rank test with Monte-Carlo estimation of p-values in SAS PROC NPAR1WAY. The remaining variables (see Results) were analyzed as categorical variables using a likelihood ratio chi-square test for independence. Probesets were considered a success if they resulted in polymorphic genotypes in either the C1/I1 or C2 population using the Rescue QC protocol and a CR of 60%.
Second, we tested for associations between SNP success (N = 55,766) and seven variables calculated using v1.0 of the reference genome. We used BLASTN to compare SNP sequences (71-mers) to v1.0 scaffolds, singletons, gene models and transcripts using a percent identity (PID) cutoff of 80% for missing values (i.e., no hits). The first two variables consisted of the PID of the best scaffold hit and the PID of the second-best scaffold hit. Although these variables were continuous, many SNPs fell into two classes (PID = 100% or PID = 80% for missing values). Therefore, to test for differences in SNP success, we binned the SNPs into two groups (PID > 80% and PID ≤ 80%). We also evaluated a third variable that captured differences in PID between the top two scaffold hits. For these analyses, we compared SNP success among three SNP categories: (1) PID > 80% for the top two scaffold hits, (2) PID > 80% for the top scaffold hit and ≤ 80% for the second-best hit, and (3) PID ≤ 80% for the top two scaffold hits. A likelihood ratio chi-square test was used to compare the successful and failed SNPs for each variable. For the remaining variables, we counted the number of hits at a PID > 90%, binned these numbers into three categories (1, > 1, and 0, in rank order), and then used a likelihood ratio chi-square test to compare the successful and failed SNPs. This analysis was conducted separately for scaffolds, singletons, gene models, and transcripts, resulting in four variables. SNPs were considered a success if they had a least one probeset that was successful using the criteria described above.
After examining the individual variables, we used logistic regression to develop multivariate prediction equations for SNP success. In the first model, we used the seven array design variables as independent variables (i.e., predictors). In the second model, we dropped the number of SNPs per transcript and replaced the two v0.5 reference genome variables with six v1.0 variables. We conducted these analyses using the SAS LOGISTIC procedure, stepwise model selection option, and cross-validation.