Exome capture
Exome capture is a target enrichment method for sequencing the protein coding regions in a genome. This makes analysis much more efficient. Through targeting this region, the focus is immediately resolved to those areas that are likely to contain sources of variation for the phenotype. The reduced cost and time compared to that of whole genome sequencing, as well as the ability to still capture a significant proportion of variants makes exome capture desirable as it is harder to find functional variants in noncoding regions [20, 27, 28]. Added to which, forest trees already tend to have large and complex genome sizes [20,21,22]. Exome capture has been recognized as an effective method for increasing knowledge of unmapped large genomes [19, 28].
Whilst exome capture is expected to produce some missing marker information, the method surpasses whole genome shotgun sequencing or genotyping methods using genome complexity reduction, in depth of coverage. Given a restricted budget (with no option for re- sequencing) the depth achieved using exome sequencing is expected to be much greater than genotyping-by-sequencing [30], with the added benefit of obtaining more reliable calls. After the sequencing process, those markers with a large proportion of missing information can be filtered out, however Rutkoski et al. [31] admit that this step is unnecessary since it affects the GS prediction accuracy little. In order to maintain power in the subsequent analyses, marker imputation should be used to infer missing data in those records not filtered out, conserving the majority of the original SNPs. This was carried out in the current study, although the ratio of missing data was very low and therefore the impact of this procedure is expected to be minimal. Given that currently there is no genetic map available for Douglas-fir, imputation methods in this case are restricted to those which support unordered data. There have been various imputation methods proposed for unordered markers however their efficacy and suitability is yet to be fully determined in genomic selection [31]. Though some methods do show some promising results when compared to mean imputation, notably random forest regression produced greater GS prediction accuracy than its counterparts in both Rutkoski et al. [31] and Poland et al. [32] studies.
Although we managed to capture significant family effects using this type of SNP data, we found little evidence that the GS models were able to capture marker-QTL LD using this genotypic data set (see below for discussion). It is highly likely that substantially more SNPs will be required to capture significant effects for these traits (i.e., LD). In addition to this, our genotyping efforts were focused on a restricted portion of the genome, the exome, which is limited to the available 40 K probes used in the present study. In humans, the exome constitutes a mere 1% of the total genome [33], thus considering the unique genome size and complexity of conifers [34] we expect that the population of SNPs used in this study represents a very small fraction of the Douglas-fir genome. Additionally, in this case the revealed exome which represents functional genes that, by default, are under selection and thus are conserved. By focusing the present study on this region we were not able to capture the variation among families, as this was not represented in the exome. In order to seize this variation in intergenic regions, an alternative, whole genome approach must be used for genotyping, for example genotyping-by-sequencing as it uncovers unordered SNPs across the entire genome. These intergenic regions are not under the same selection pressure as the exome, and may contain important regulatory sequences which correspond to the control of the traits we investigated. Another approach used by Fuentes-Utrilla et al. [16], was to use restriction-site associated DNA sequencing (RADseq) technology. They found that in a species with no whole genome assembly (in this case Sitka spruce (Picea sitchensis (Bong.) Carr)), a SNP panel could be constructed from randomly located markers generated from RADseq. However, it must be noted that their GS analysis was performed strictly within a single family.
Heritability
The use of ABLUP instead of GBLUP is likely falsely inflating the heritability estimate. Though the impact of heritability on the predictive accuracy seems to be low in this study. Our results show that even with modest heritability, predictive accuracies can be high. As shown by the combined site analyses, the heritability for HT12, HT35, and wood density were modest (0.17, 0.17, and 0.43, respectively), yet the prediction accuracies were 0.91 (± 0.004), 0.88 (± 0.006), and 0.83 (± 0.009), respectively (RR-BLUP) (Table
2
). The large sample size and low effective population size of the present study (Ne = 21) likely helped in negating the effect of low trait heritability [5]. Mӓrtens et al. [35] provided evidence that increased relatedness between training and validation populations leads to higher prediction accuracy in their study on yeast. Furthermore, the use of correlation between pedigree-based and marker-based breeding values, approximates correlation between unknown true breeding value and genomic breeding value. The accuracy can go above heritability in this case because both values are representing only genetic effects, this is in line with results obtained by Gamal El-Dien et al. [12].
Genomic selection
The desired outcome of genomic selection is to produce unbiased marker effect estimates [5], and to avoid the Beavis effect [36] which hinders MAS causing marker effects to be overestimated [37]. Instead of selecting markers based on a significance threshold, GS estimates all marker effects simultaneously causing a different problem; there are more predictor effects (p, markers in this case) to be estimated than there are observations (n, samples) [3]. Least squares cannot be used to estimate all the effects at once since there are not enough degrees of freedom. In addition, multi-collinearity of markers would cause any model to be over fitted [3].
To address this issue (large p, small n), various statistical models have been proposed. They generally fall into the following categories: shrinkage models, variable selection models, and kernel methods. Shrinkage models (e.g., ridge regression BLUP [RR-BLUP], Whittaker et al. [38]) fit all marker effects which are all shrunk to the same degree. With RR-BLUP, it is assumed that the trait in question more closely resembles Fisher’s infinitesimal model (many loci with small effects), and marker effects are samples from a normal distribution (with equal variance). Variable selection (e.g., generalized ridge-regression [GRR], Shen et al. [39]) however, reduces the number of markers used, and in doing so assumes that the trait is controlled by fewer, strong effect loci [3]. Kernel methods convert the predictor variables to distances in effect creating a matrix similar to an additive genetic relationship matrix. The kernel matrix quantifies the distance between individuals but also smoothing parameters are added [3]. This method is flexible and can incorporate complex relationships between markers, for this reason it is useful in cases where non-additive effects are suspected to occur [40]. Indeed, Douglas-fir height and wood density have proven to have non-additive genetic variance component [41]; however, its unpredictability has driven the species’ advanced generation breeding and selection methods to be additive genetic gain-dependent [41,42,43]. Since different traits have differing genetic architecture, there does not exist one model that is necessarily the best for all traits or populations [3].
In the present study, we assessed the prediction accuracies of two GS models (RR-BLUP and GRR) in two phases; first when trained on EBVs, and second when trained on DEBVs. In the first instance, using EBVs and by virtue of retaining family means, our data contained both contemporary and historical pedigree information as well as marker-QTL LD information. Without further adjustment, all this information was parsed into the GS models and resulted in high prediction accuracies (for both model types). Subsequent to this, in phase two, we deregressed the EBVs, removing the parent average effect in order to disassociate the pedigree information from the marker-QTL LD. This resulting in DEBVs that contained the marker-QTL LD information only without the between family genetic variance. Using these DEBVs to train the GS models, we obtained prediction accuracies which for all practical purposes were 0.0. The success of the first phase can be attributed to the large within and among family genetic variances. This pedigree-driven approach dominated the analysis and produced the observed high prediction accuracies, which were comparable to the ABLUP accuracies. When this influence of pedigree was removed, the generated models were no longer able to provide useful predictions. A similar effect was seen in interior spruce (Picea glauca x Picea engelmannii) where cross-validation using family folding resulted in decreased prediction accuracy, because between family variance was dominating the analysis (Gamal El-Dien et al. 2017, unpublished). In addition to this Fuentes-Utrilla et al. [16], when studying GS in Sitka spruce, found that within family predictions cannot be extrapolated to between families. Furthermore, when examining marker transferability between families in white spruce, Beaulieu et al. [44] also found within family predictions to be more precise than between family predictions. Similarly, to Gamal El-Dien et al. (2017, unpublished) they also found that when a family had no representation in the training group, the accuracies obtained for that family were very small, occasionally negative, and often not statistically significant from zero. Much like using the DEBVs here, which try to predict between family variance, but have family means stripped away. Fuentes-Utrilla et al. [16] concluded that species with large effective population sizes (notably conifers), have a reduced ability to make predictions across families. With this in mind, GS may best be employed to produce GEBVs for within large full-sib families in conifers as it captures the Mendelian sampling term.
This discrepancy in the results from the two GS phases, is indicative that we simply were not able to capture the marker-QTL LD with the available SNPs. Whilst other studies have had some success in this area, it is important to note that these particular investigations have focused on tree species with much smaller genomes, for example Resende et al. [7]. Resende et al. [17] make this point in their 2017 study of Eucalyptus, that although other studies may have failed to produce significant predictions between unrelated populations, this may be due to low marker density. This is likely inhibiting the ability to capture short-range LD, the result being that prediction models rely almost entirely on relatedness.
In their study, Resende et al. [7] used 7680 DArT markers on Eucalyptus which is estimated to have a genome size of 609 Mbp [7], equivalent to 12.61 markers/Mbp. In contrast, here we used 69,951 SNP markers on the Douglas-fir genome which is estimated at 18,700 Mbp, giving 3.74 markers/Mbp. To obtain a similar coverage as Resende et al. [7] would require approximately 235,800 markers, many more than we currently have. With such a large genome, it is likely that many more SNPs will be required (≈235,800+), with greater genomic coverage in order to capture this, so far elusive, LD. In a more recent study, also using Eucalyptus, Müller et al. [18] managed to capture some short-range historical LD using 5000–10,000 SNPs. Yet they too concluded that the genomic prediction in this case was largely driven by relatedness.
The similar predictions given by the RR-BLUP and GRR methods, and the similarly high ABLUP accuracies, was again the result on heavy reliance on between family variance, and thus we have gained no new information as to the genetic control or architecture of the traits in question. Although similar findings have been noted in more successful GS studies. For example, Resende et al. [8] when comparing GS methods, found there to be little difference between the predictive abilities of shrinkage and variable selection methods (4 in total) even considering they have different a priori assumptions. They used a Pinus taeda training population with 951 individuals and 4853 SNPs in the analyses. Prediction accuracies for 17 traits (including growth, disease resistance and development) ranged from 0.17 to 0.51. Only one trait (fusiform rust resistance) showed any significant difference between the models. Higher prediction accuracies were obtained using variable selection methods for this trait. This reflects the genetic architecture of the trait which is controlled by few, large effect loci [8].
Single and multi-site cross-validation
The combined site GS analysis produced higher prediction accuracies than the single site analyses on average (Fig.
2a and b). The combined site training population having more individuals than the single site populations. This is in agreement with what the literature states should happen; Grattapaglia [2] noted that increasing the training population size increases accuracy up to a point (around 1000 individuals). In addition to increased sample size for predictive accuracy improvement, the multi-site approach incorporated the present GxE in the model, resulting in further improvement.
The prediction accuracies are high for HT12, HT35, and WDres GEBVs (0.87–0.92, 0.79–0.92, and 0.78–0.88, respectively) (Fig.
2a and b). They are moderately higher than those in previous studies including other forest tree species [7, 8, 10, 11, 15]. Largely due to the inclusion of the pedigree structure. In this case the GS methods are not giving much advantage over ABLUP (ABLUP multi-site cross-validation accuracies: 0.88 ± 0.002, 0.86 ± 0.003, and 0.84 ± 0.003, for HT12, HT35, and WDres respectively), thus both are predicting only family means. The prediction accuracies for both the GS models and the ABLUP model are much higher than theoretical accuracy of the EBVs (0.61, 0.68, and 0.76; for HT12, HT35, and WDres, respectively), which indicates that both the EBVs and GEBVs are converging on the family means, and are far from the true breeding values.
The high prediction accuracies for the GEBVs may also partially be the result of using a relatively large training population, known to correlate with accuracy. In addition, there is an interacting effect of the relatively low effective population size (Ne = 21), and both these data characteristics increase the accuracy of predictions [3]. Although in this case the low effective population/family size also meant that the Mendelian sampling term could not be defined. Indeed, Gamal El-Dien et al. [12] using an interior spruce population with Ne = 93.8 (estimated assuming the OP families are unrelated and contributed equally to the experiment) had lower prediction accuracies than obtained here. But these in turn were vastly higher than results from a study with 214 open-pollinated white spruce (Picea glauca) families in Quebec with Ne = 622.5 [9]. Another component responsible for the increased accuracy of predictions, was the training of the models on EBVs rather than raw phenotypes [12].
Prediction accuracies fell dramatically for models trained on DEBVs, as marker-QTL LD could not be recovered using the available SNP data set, indicating that the SNP markers effectively tracked the pedigree.
Cross-site validation
Similar prediction accuracies were observed in the cross-site compared to the single-site and combined-site GS analyses (with models trained on EBVs). Genotype x Environment interaction is an important consideration of forest tree selective breeding, more so than in animal breeding where individuals are considered to share a common environment [45,46,48]. Prediction accuracy for HT12 across sites was relatively high. An unexpected result given the fact that the heritability was only 0.17 (combined sites), and there is expected to be a competition effect at early stand development (i.e., strong environmental component). This is possibly an indication of the partially controlled experimental environment (not natural stands) in which the trees were growing. Given these experimental conditions, it is also conceivable that GxE effects are minimal. Which is reflected in the cross-site predictions (Fig.
2a and b), which are of a similar magnitude to the within-site prediction accuracies for all attributes.
Time- time and trait- trait correlations
It is thought that forward selection in Douglas-fir should be carried out at a minimum of 17 years [49], when accurate predictions of phenotype at time of harvest can be measured. We tested the correlation of height at 12 years and height at 35 years (HT12-HT35), and wood density (38 years) (traitHT12-traitWDres38) and positive (0.71 for height) and negative (−0.46 for wood density) correlations were detected by the GS models trained on EBV data. Although they did not offer accuracy above that of ABLUP, indicative of a strong reliance on pedigree information rather than marker-QTL LD. The results give an indication of how useful early selection could be. Correlations for height are moderately strong and low-moderate for wood density. Marker-trait associations are known to vary according to the tree age, limiting any correlation. This would certainly hamper efforts to create a highly correlated age-age model in trees [9, 50]. At the moment providing such predictions at an age any younger than 12 years would not be recommended (note that height at age 12 was the earliest measurement available in the present study). Since larger age differences have been shown to produce less accurate models [51]. The discrepancy in prediction accuracy between the time-time correlations and the cross-validations suggest that there is some inconsistency between EBVs and marker effects at these two ages (12 and 35 years). Although the discrepancy is relatively small, and so meaningful results may still be obtained through age-age and trait-trait correlations. This is, in addition to the varied environmental conditions the trees endure over their long lifespans, lessening time- time correlations. To this effect and as White et al. [52] suggested, training models will need to be updated with mid-rotation phenotypes in order to accurately predict mature trait values.
Genomic selection and forest tree breeding
Several genomic selection “proof-of-concept” studies have been conducted on few coniferous tree species (e.g., loblolly pine, maritime pine, Sitka spruce, white spruce, and white-Engelmann spruce hybrid), all concluded that GS has the potential to increase the genetic gain through speeding breeding generation turn-over and increasing the selection differential. It should be stated that, with the exception of the maritime pine study [14], none of the derived GS predictive models have been validated on independent “validation” populations. The success of GS is dependent on the linkage disequilibrium between the markers used (i.e., SNP panels) and causal genes underpinning the traits of interest, and the degree of relatedness between the training and validation populations. Therefore, caution is required during GS implementation as LD changes after every round of breeding (i.e., recombination); the fact that it does rapidly decay called for using dense marker coverage to overcome this caveat. Still we have found that by only using sequence capture data, we were unable to successfully resolve this marker-QTL LD. We only had success in capturing among family effects (i.e., pedigree). Even with a SNP panel designed appropriately based on an informative SNP library, and large enough to handle a conifer genome, there are additional hurdles. LD only survives over relatively short distances in conifers compared to livestock species due to their relatively large Ne [53]. This led Fuentes-Utrilla et al. [16] to conclude that GS may only be useful in tree populations with reduced Ne, for example seed orchards, or lines/ sub-groups which have been produced through selective breeding. Though as they demonstrate with their analysis, it is possible to generate very large full-sib families in trees, by controlled crossing. In this type of population, LD extends over longer distances than in open-pollinated populations. As a result, they suggest that GS could be employed to make selections within families.
Based on comparable studies: 1) a greater number of markers, and 2) wider coverage throughout the whole genome or dense unordered SNP genotyping platform (e.g., GBS), would be needed to capture this LD and additional intergenic variation [17]. Or indeed a shift in the level at which GS is applied, i.e. to within families rather than across families [16]. Whilst GS still has the potential to deliver unprecedented gains, it does not seem likely that was achieved in the present study as the prediction driving force was the pedigree rather than LD. Despite the low Ne, the small family size prevented the accurate assessment of the Mendelian sampling term in this population. Therefore, the EBVs were heavily shrunk toward the family mean and all within family deviations were not estimated precisely.
Finally, it should be emphasized that any gains captured through GS require unique tree improvement delivery methods as the traditional seed orchards’ production mode requires time for reaching sexual maturity, even under intensive management such as top grafting or hormonal applications, and its sexual-production effectively breaks the LD between selected traits and markers.