Genotyping by sequencing for genomic prediction in a soybean breeding population

Background Advances in genotyping technology, such as genotyping by sequencing (GBS), are making genomic prediction more attractive to reduce breeding cycle times and costs associated with phenotyping. Genomic prediction and selection has been studied in several crop species, but no reports exist in soybean. The objectives of this study were (i) evaluate prospects for genomic selection using GBS in a typical soybean breeding program and (ii) evaluate the effect of GBS marker selection and imputation on genomic prediction accuracy. To achieve these objectives, a set of soybean lines sampled from the University of Nebraska Soybean Breeding Program were genotyped using GBS and evaluated for yield and other agronomic traits at multiple Nebraska locations. Results Genotyping by sequencing scored 16,502 single nucleotide polymorphisms (SNPs) with minor-allele frequency (MAF) > 0.05 and percentage of missing values ≤ 5% on 301 elite soybean breeding lines. When SNPs with up to 80% missing values were included, 52,349 SNPs were scored. Prediction accuracy for grain yield, assessed using cross validation, was estimated to be 0.64, indicating good potential for using genomic selection for grain yield in soybean. Filtering SNPs based on missing data percentage had little to no effect on prediction accuracy, especially when random forest imputation was used to impute missing values. The highest accuracies were observed when random forest imputation was used on all SNPs, but differences were not significant. A standard additive G-BLUP model was robust; modeling additive-by-additive epistasis did not provide any improvement in prediction accuracy. The effect of training population size on accuracy began to plateau around 100, but accuracy steadily climbed until the largest possible size was used in this analysis. Including only SNPs with MAF > 0.30 provided higher accuracies when training populations were smaller. Conclusions Using GBS for genomic prediction in soybean holds good potential to expedite genetic gain. Our results suggest that standard additive G-BLUP models can be used on unfiltered, imputed GBS data without loss in accuracy.


Background
Marker-assisted selection (MAS) has played an important role in soybean breeding, particularly for traits that are difficult to evaluate phenotypically such as soybean cyst nematode (SCN) resistance. An early demonstration of successful MAS for SCN resistance allowed accurate identification of resistant lines using microsatellite markers [1]. Use of MAS for improving grain yield, however, has been met with limited success in soybean. A host of QTL mapping studies have reported QTL for grain yield in exotic soybean populations [2][3][4][5], but introgression of yield QTL has not been consistent across different genetic backgrounds [6]. Moreover, the QTL mappingintrogression approach is difficult to justify unless large effect QTL alleles are identified, which is rarely the case for grain yield.
Sebastian et al. [7] reported on a MAS approach that involved the sub-lining of existing soybean elite cultivars derived from single F3 or F4 plants. The authors called this approach context-specific MAS. Essentially, MAS is performed within narrow populations (i.e., elite cultivars with residual heterogeneity) with the goal of obtaining more precise estimates of genetic value in early field trials consisting of only a single replication. Lines were selected and advanced for further testing on the basis of marker scores calculated using significant marker effects estimated within populations. Significant superiority in grain yield for some of the selected sublines, relative to their "mother lines" (i.e., the elite cultivars with residual heterogeneity), was demonstrated using this approach [7].
This common-sense approach is ideal for increasing accuracy of preliminary yield tests: marker effects are estimated more accurately because marker alleles are highly replicated across individuals comprising a large population, whereas phenotypic values are estimated using only a single observation. Marker effects, however, can only be applied for calculating marker scores within a single biparental population and therefore, the current generation. It would be desirable to predict breeding values after one or more generations of recombination and selection in order to facilitate rapid cycling of parents. Furthermore, pooling genotypic and phenotypic information across populations could allow for more populations to be evaluated for grain yield within the same field space. Fewer individuals could be phenotypically evaluated in each biparental population and the breeding value of remaining (non-evaluated) individuals could be predicted using markers. Current trends strongly indicate plant breeding programs will be limited by phenotyping capacity, not genotyping capacity, thus increasing the attractiveness of this strategy through time.
Genomic selection (GS) has become the predominant method of applying molecular markers for selection of complex traits in plant breeding programs [8]. Briefly, genomic selection entails building a prediction model through associating marker information with phenotypic information in a "model training" step. Individuals that have been genotyped and phenotyped comprise the "training population" or "calibration set". The prediction model is applied to a set of selection candidates that have been genotyped but not evaluated phenotypically. The primary difference between GS and traditional forms of MAS is that QTL mapping is not performed and markers are not chosen for inclusion in the model based on a statistical analysis, but rather all marker information is used simultaneously. The types of models used to deal with the "large p, small n" problem created by the genomic approach to prediction have been reviewed and compared elsewhere [8][9][10].
Dramatic advances in sequencing technologies are providing highly dimensional molecular marker information at low cost. Genotyping by sequencing [11] is a method well described by its name: polymorphisms are scored using next-generation sequencing technologies followed by a bioinformatics pipeline. The advantage of GBS is that it reduces cost through an enzyme-based genomic complexity reduction step and the use of barcoded adapters for multiplexing [12]. Genotyping by sequencing has been applied to investigations of genetic diversity in maize [13] as well as to studies on GS [14][15][16]. Working in soybean, Sonah et al. [17] developed a novel GBS protocol and reliably called 10,120 highquality SNPs among eight diverse lines. These authors called high-quality SNPs displaying only a small percentage of missing data, whereas many applications of GBS [16] have tolerated SNPs with very high frequencies of missing data, sometimes up to 80%.
Given this high rate of missing values in GBS data, imputation of marker scores is typically performed. The best imputation method, and whether imputated GBS data provides better predictions than simply selecting SNPs with low rates of missing data, however, is not known. Rutkoski et al. [18] showed a slight advantage to using imputation when markers were used with high missing data rates. In maize, however, Crossa et al. [16] failed to find any improvement in prediction accuracy using a haplotype-based imputation method on GBS data. Poland et al. [14] showed that a random forest imputation method provided the most accurate imputations, but the effect on genomic prediction accuracy was not significant.
A large number of studies on GS in multiple crops has been reported [2,[19][20][21][22]. A study on GS in soybean, however, has not. Moreover, there are only a few reports on the use of GBS for GS [14][15][16]. In light of the current research on GS and the dearth of reported research on GS and GBS in soybean breeding, the objectives of this study were (i) evaluate prospects for GS using GBS in a typical soybean breeding program and (ii) evaluate the effect of GBS marker selection and imputation on genomic prediction accuracy. To achieve these objectives, a set of soybean experimental lines sampled from the University of Nebraska-Lincoln Soybean Breeding Program were genotyped using GBS and evaluated for grain yield and other agronomic traits at multiple Nebraska locations. Reported findings are important to the application of GBS to selection for grain yield in future soybean breeding efforts.

Germplasm and phenotyping
Three hundred and one soybean experimental lines currently in advanced stages of the University of Nebraska-Lincoln Soybean Breeding Program were sampled. Two hundred and seventy-five lines were in the F 5:8 generation and twenty-six lines were in the F 4:7 generation. Soybean lines belonged to maturity groups I (N = 64), II (N = 213), and III (N = 24) ( Table 1) and represent 34 biparental families ranging in size from one to 28 lines per family. Median family size was 8.
During the summer of 2011, soybean lines were grown in two-row plots (0.76 m apart, 2.9 m long) seeded to a density of 26 seeds per square meter. Plots were arranged in an augmented incomplete block design with two replications. Blocks consisted of 27 -39 experimental entries and three check cultivars. Lines belonging to maturity groups I and II were evaluated at the Nebraska locations Beemer, Phillips, Cotesfield, and Mead. Lines belong to maturity group III were evaluated at the Nebraska locations Phillips, Mead, Lincoln, and Clay Center (Table 1). Grain yield (GY; Mg ha −1 ) was measured at all locations, plant height (PH; cm) was measured only at Mead, and days to maturity (MD) was measured at Beemer, Phillips and Mead. Grain yield was recorded as machine harvestable grain yield adjusted to 13% moisture. Plant height was measured as the distance (in centimeters) between the surface of the soil and the main-stem apical meristem. Days to maturity was defined as the number of days from planting until the R8 stage when 95% of the pods were mature and brown in color.
Phenotypes were adjusted to remove location and block effects according to the model: where g i represents the effect of the i th genotype (i.e., soybean line), e j represents the effect of the j th location, r k(j) represents the effect of the k th replicate nested in location j, b l(k) represents the effect of the l th incomplete block nested within replicate k, and. Best linear unbiased estimates were calculated for soybean lines and input into the genomic prediction models described below. The genotype effect was also treated as random in order to estimate variance components for the purpose of estimating heritability. Broad-sense heritability (H 2 ) on an entry-mean basis was calculated as where σ 2 G is the variance among soybean lines, σ 2 GE is the genotype-by-environment interaction variance, σ 2 ε is the residual variation, and e and r in this context are the number of environements and replications within environments, respectively.

Genotyping
Leaf discs were collected from 12 random plants of each soybean line at approximately the V6 growth stage. DNA isolation was performed using the Qiagen DNeasy Plant 96 kit. Samples were sent to the Institute of Genomic Diversity at Cornell University for genotyping by sequencing as described in [11] and at www.maizegenetics. net/gbs-overview. Briefly, DNA samples were digested with the ApeKI restriction enzyme followed by ligation of adapters to fragment ends. Adapters consisted of Illumina sequencing primers and a barcode adapter. After adapter ligation, samples are combined into pools consisting of 96 samples. A PCR amplification is carried out to create the GBS libraries, which are submitted to a single Illumina HiSeq2000 flow cell for sequencing. Four sequenced libraries produced on average 247,255,883 reads, of which on average 219,580,690 were good, barcoded reads.

Genomic prediction models
Base pair calls contained in the hapmap file obtained from the GBS analysis pipeline were converted to numerical genotype scores, x ∈ {0, 1, 2}, where x is the number of copies of the major allele.
Two genomic prediction models were studied: a standard G-BLUP model including only additive effects, and an extended version of the G-BLUP model also including additive-by-additive effects. Two different formulations of additive-by-additive effects have been presented in the literature [23,24] and both of them were considered.
The standard G-BLUP model including additive effects only is where y i represents the phenotype of the i th line, μ represents the intercept, g i represents the additive genetic value, and e i represents the residual. The additive genetic value can be estimated as g i ¼ X p j¼1 x ij b j , where x ij is the genotype score at the j th (j = 1,…,p) locus in the i th (i = 1,…,n) line, and b j is the allelic substitution effect (marker effect) at the j th marker locus. Marker effects were considered as IID random variables from a normal distributions such that b j e . From properties of the multivariate normal distribution the vector g = Xb, (g = [g 1 , …, g n ]′), follows a multivariate normal distribution with null mean and covariance matrix Cov Hereafter, we centered and standardized genotype scores by dividing by , where θ j is the estimated allele frequency at the j th marker locus. The G matrix is a genomic realized relationships matrix whose entries are given Additive-by-additive effects were modeled using two different covariance structures among lines. Several authors [25,26] modeled additive-by-additive epistasis through a G ∘ G matrix following Cockerham (1954) and Kempthorne (1954), where ∘ represents the Hadamard, or element-wise, multiplication operation. The first model including additive-by-additive epistasis was with gg e N 0;G∘Gσ 2 gg and ε i e More recently, Xu [24] proposed an alternative way to include these interaction effects using the covariance structure given by Using this assumption a different version of the common epistasis model is given by: with k e N 0;Kσ 2 aa À Á and ε i e IID N 0; σ 2 ε À Á . Modeling additive and additive-by-additive components was conducted to assess the proportion of the phenotypic variance accounted for by these effects and improvements in accuracy of genomic prediction. By combining models (1) and (2), a model including additive and epistatic effects was formulated: with g e N 0;Gσ 2 g , gg e N 0;G∘Gσ 2 gg and ε i e IID N 0; σ 2 ε À Á . The alternative additive and additive-by-additive model following Xu [24] was built combining models (1) and (3): with g e N 0;Gσ 2 g , k e N 0;Kσ 2 aa À Á and ε i e IID N 0;σ 2 ε À Á . Models (1)-(5) were fitted to the full data set using computational methods described in [27]. All the statistical analyses were implemented in the R statistical software [28].

Marker imputation
Genotyping-by-sequencing data sets typically have high rates of missing data [14,16]. Three imputation methods were considered to impute missing values of the soybean GBS data set. (i) Naïve imputation substitutes missing values at each locus with 2θ j , where θ j is the estimated frequency of the major allele at the j th locus. This method is not expected to add information, but rather serves the purpose of ensuring unchanged allele frequencies after imputation and provides a marker matrix containing no missing data so that analytical operations can be performed. (ii) Random forest imputation is based on random forest regression introduced by Breiman [29]. Marker imputation for this study was performed using the MissForest R package according to [18]. The algorithm was performed chromosome-wise and for each PMV and MAF combination in parallel. (iii) FILLIN (HI) is a novel imputation method based on haplotypes, which is implemented in TASSEL 5.0. Default settings were used with the exception of maximum heterozygosity, which was set to 0.30 using the option -mxHet. Detailed information can be found in the TAS-SEL 5.0 User Guide at www.maizegenetics.net.

Varying factors affecting genomic prediction accuracy
To evaluate the effects of GBS marker selection and imputation methods on genomic prediction accuracy, two criteria for filtering SNPs were considered. Filtering of GBS SNPs was done sequentially, first filtering based on percent missing values (PMV), and then, for minor-allele frequency (MAF). Levels for PMV (27) and MAF (12) were ( combinations of PMV and MAF. After filtering, remaining missing values were imputed using each of the three imputation methods described above. This produced 972 marker datasets (e.g., 27 PMV levels × 12 MAF levels × 3 imputation methods).
All comparisons were made on the basis of the correlation between the observed phenotype and the predicted breeding value, which is referred to as predictive ability, following [30]. We reserve the term prediction accuracy (r gp ) for the correlation between the prediction and the true breeding value. Prediction accuracy can be approximated by dividing predictive ability by ffiffiffiffiffiffi H 2 p [8,31]. Predictive ability ( rĝ p ) of each marker filtering criteria was evaluated using 10-fold cross validation replicated 200 times. The mean predictive ability across the 200 replicates was calculated and bootstrap confidence intervals. The impact of training population size on prediction accuracy was evaluated using a validation set comprised of 50 randomly selected lines and training sets of variable sizes. The validation set was formed by randomly sampling 50 lines without replacement. From the remaining 251 lines, the training population of size n was formed sequentially such that its size ranged between 2 and 251. First, two lines were sampled (i.e., n = 2) without replacement, then, from the remaining 251-n lines, additional lines were incorporated to the training set, by increments of one. Once a line was sampled, it remained in the training set. The validation set was held constant with the initial 50 lines. Two GBS marker subsets were used to evaluate training population size effect: 1) PMV ≤ 5% and MAF > 0.05; and 2) PMV ≤ 80% and MAF > 0.3. This procedure was repeated 1000 times and accuracies at each training population size were averaged across replicates.

Results
A total of 5,770,366 unique 64-bp sequence tags were identified across all four soybean libraries, of which 68.75% aligned uniquely to the reference genome, 11.32% aligned to multiple positions and 19.92% could not be aligned. The mean (median) sequencing depth per SNP locus was 11 (6), with mean (median) proportion "missingness" per SNP locus of 0.18 (0.08).
Unique tag counts and SNP density were higher towards the chromosome ends compared to pericentromeric regions ( Figure 1). The GBS protocol targets gene-rich regions, such as the chromosome ends, through the use of methylation sensitive restriction enzymes. Related to the distribution of uniqe tag counts, percent missing values were lower towards chromosome ends and higher towards the pericentromeric regions. There was no apparent pattern regarding MAF with the exception of some chromosomes harboring more diversity than others (e.g., chromosomes 11 and 20 versus chromosomes 15 and 18) (Figure 1). The number of SNPs remaining after filtering by MAF and PMV is shown in Figure 2. The number of SNPs available with cuttoff values of PMV ≤ 80% and MAF > 0.05 was 52,349. There were 16,502 SNPs with PMV ≤ 5% and MAF > 0.05.
The high quality of the phenotypic data was reflected with relatively high heritabilities of 0.69 for GY and 0.94 for MD ( Table 2). As expected, the genotype-byenvironment interaction source of variation was greater for GY compared to MD. The overall rĝ p of G-BLUP using a SNP dataset with cuttoff values of PMV ≤ 80% and MAF ≥ 0.05, and the Naïve imputation method, was 0.565 for GY, 0.374 for PH, and 0.644 for MD. Prediction accuracy estimates for GY, PH, and MD were, 0.64, 0.42 and 0.65, respectively.
The effect of SNP filtering on rĝ p was assessed by building a series of G-BLUP models using SNP datasets created by applying combinations of MAF and PMV filtering criteria. Number of SNPs is quickly reduced as SNPs are filtered based on MAF and PMV ( Figure 2). Overall, marker filtering criteria did not have a large effect on rĝ p for GY, but some important effects were observed for PH and MD (Figure 3). For PH, rĝ p was greater when markers with MAF between 0.08 and 0.10 were used compared to all other MAF cutoffs. When considering jointly both filtering criteria, the rĝ p of a trait were maximized at different combinations between PMV and MAF. For GY the maximum rĝ p (0.59) was obtained with a marker dataset including SNPs with up to 80 PMV and MAF greater than 0.30. For PH and MD, rĝ p was maximized when only SNPs with lower PMV were included (Figure 3).

Imputation
No significant differences in rĝ p were found among the imputation methods (Figure 4). When Naïve imputation was used, rĝ p was slightly reduced by including SNPs with high levels of PMV. When random forest imputation was used, however, rĝ p was maximized when all SNPs were included in the model. A random forest imputation with 80 PMV provided the highest rĝ p overall ( Figure 4). The random forest method provided numerically higher rĝ p than the HI method at high PMV levels, but differences were not statistically significant.

Model comparison
Contribution of polygenic additive-by-additive epistatic interactions to total phenotypic variation was assessed by constructing epistatic relationship matrices using the Hadamard product of the additive relationship matrix [23], as well as the marker-generated additive-byadditive relationship matrix as described by Xu [24]. For these comparisons, a marker set including SNPs with PMV ≤ 80% and MAF > 0.05 was used. Missing values were imputed using the Naïve method.
The percentages of phenotypic variation accounted for by each model term varied across traits. For GY, the realized additive relationship matrix captured 91.2% of total phenotypic variance. Since the G°G and K aa matrices are highly collinear with G (data not shown), the epistatic relationship matrices accounted for similar amounts of variation (Table 3). When combining both additive and epistatic effects into the same model, the epistatic effects account for variable amounts of phenotypic variation. Nevertheless, the percentage of residual variation is fairly constant ( Table 3), indicating that including an additive-by-additive epistasis relationship matrix provides no improvement over standard additive G-BLUP models. This was also observed using the cross-validation approach to evaluate rĝ p . No difference was observed among the models regarding rĝ p (Table 4).

Training population size
For a set of SNPs with PMV ≤5% and MAF > 0.05, rĝ p plateaued around a training population size just greater than 100 ( Figure 5). Predictive ability, however, did steadily increase up until the maximum training population size possible in the cross validation strategy. Predictive ability was improved at lower TP sizes only when a MAF > 0.30 was used to construct G. For example, using a MAF > 0.05 and TP size of 50, rĝ p was only 0.28, Table 2 Summary of phenotypic data analysis for grain yield (GY), plant height (PH) and days to maturity (MD)  but when a MAF > 0.30 was used, rĝ p was increased to 0.47. A TP consisting of at least 100 individuals was required to reach this rĝ p using SNPs with MAF > 0.05 ( Figure 5).

Discussion
The results of this study suggest that the use of GBS for genomic selection holds good potential for improving soybean grain yield. Using a cross-validation approach, genomic predictions explained~32% of the variation in yield phenotypes. After using the phenotype heritability to correct for random environmental deviations included in the validation phenotypes, approximately 41% of the variation in genotypic values was explained by genomic predictions. Because validation phenotypes (i.e., soybean line means) include both additive and non-additive effects, and genomic predictions using the G-BLUP model include only additive effects, this estimated prediction accuracy is likely biased downward. In order to quantify the relative benefit of genomic selection over phenotypic selection, Technow et al. [32] rearranged the formula for relative response to indirect selection to obtain the inequality L Y < r A H X L X , where L Y is the cycle length of genomic selection, r A is the genomic prediction accuracy, H X is the phenotypic selection accuracy, and L X is the cycle length of phenotypic selection [32]. Substituting the values estimated herein for grain yield into this formula indicates that genomic selection is expected to be superior to phenotypic selection in terms of genetic gain per unit time if the cycle length of genomic selection is less than 77% the cycle length of phenotypic selection. It is entirely possible for a genomic selection cycle to be 66% of a phenotypic selection cycle based on the structure of the University of Nebraska Soybean Breeding Program. On top of this, the above formula assumes equal selection intensities for genomic and phenotypic selection. As genotyping costs continue to decline, selection intensity for genomic prediction could be increased compared to phenotypic selection at equal cost. Before soybean breeding programs incorporate genomic selection on a large scale, these results need to be validated through comparisons of phenotypes and genomic predictions across years, as well as by comparison of progenies from phenotypic and genomic selection programs.
The high genomic prediction accuracy observed was fairly consistent across SNP datasets with differing levels of PMV. More than 16,000 SNPs were scored with less than 5 PMV using GBS, which is surely more than is needed to ensure good SNP-QTL linkage disequilibrium among elite soybean breeding progenies [33]. Little to no sacrifice in accuracy was observed when SNPs with up to 80 PMV were included. It might be desirable to reduce the SNP numbers to ease computational requirements when predicting individual SNP effects and summing effects to calculate genomic predictions. However, more saturated SNP datasets may be more desirable for computing genomic predictions of multi-family selection schemes of more diverse germplasm. The G-BLUP approach is more computationally efficient with computational demands scaling with individual number rather than marker number. Knowing that data filtering steps are not likely needed for using GBS data for genomic prediction reduces the number of optimization steps and simplifies the process.
We failed to find significant differences among imputation methods, including differences between Naïve imputation and the other two which use covariance information between nearby SNPs. While not significantly better, the machine learning, non-parametric method called random forest performed best when SNPs with up to 80 PMV were included. This was also observed by Rutkoski et al. [18], but these authors did not compare random forest imputation with a method using marker order information. We observed that a haplotype-based method, which used marker order information from the soybean physical map, was not superior to random forest imputation. Random Figure 4 Average predictive ability and corresponding 95% bootstrap confidence intervals for multiple levels of percent missing values (PMV) and three imputation methods: Naïve, random forest imputation (RFI), and haplotype-based imputation (HI). Table 3 Percentage of phenotypic variation in grain yield (GY), plant height (PH), and days to maturity (MD) explained by additive and non-additive effects included in models 1 -5 forest has often been used for imputing markers for genomic selection in plant breeding, especially when a reference genome is not available [10,15,18]. A haplotype-based method was also used for GBS data by Crossa et al. [16], but these authors also observed very little to no advantage over Naïve imputation. This general lack of benefit to imputation is likely due to the fact the genomic prediction is robust to missing marker data [18] and the number of markers with relatively low PMV provided by GBS is more than enough to cover the linkage disequilibrium space in crop breeding germplasm. Rather than compare shrinkage models and Bayesian variable selection models for prediction accuracy as has been frequently performed previously (e.g., [10,21]), we compared G-BLUP models including additive effects only against those also including additive-by-additive effects. Additive-by-additive interaction effects were incorporated into the model in the Cockerham-Kempthorn fashion by including a random polygenic interaction effect with a covariance structure specified as the Hadamard product of the additive genomic relationship matrix [34]. This model makes many assumptions, and the soybean population clearly violates the assumptions of linkage equilibrium between loci and randomly mating individuals. Because of this violation, another formulation of the additive-byadditive relationship matrix according to Xu [24] was used. It turned out the K aa matrix calculated according to Xu [24] was highly collinear with the simple Cockerham-Kempthorn Hadamard product and explained similar amounts of phenotypic variation. Neither G ∘ G nor K aa was orthogonal to the G matrix as can be seen by the variance component estimation. Similar amounts of variation were explained when any of these effects were included in the model alone or together. The amount of residual variation was actually slightly smaller when only G was modeled and genomic prediction accuracies were not enhanced by including additive-by-additive effects using either the Cockerham-Kempthorn or Xu [24] formulation.
Low to moderately sized training populations could be used in a soybean breeding program to achieve adequate prediction accuracies ( Figure 5). Although it's probably not necessary to reduce TP sizes down to such a low level, training population sizes could be reduced further if only SNPs with higher MAF are included. The underlying reason for this is not clear. It is possible that SNPs with low MAF are not sampled by chance when small training populations are sampled and phenotyped. If they are not polymorphic in the TP, they cannot contribute information to the relationships between individuals, which is the basis of predictions in G-BLUP. When TP size is increased, SNP alleles with low frequency are more likely to be adequately represented in the TP. When MAF threshold is higher, this problem is reduced by the fact that all SNP alleles have a reasonable chance of contributing to relationship even when TPS sizes are small.

Conclusions
This first look at GBS for genomic prediction in soybean suggests GBS holds good potential to enhance genetic gain in soybean. Over 16,000 SNPs were scored with less than 5% missing data. Filtering markers based on amount of missing data had little to no effect. No differences were observed among imputation methods. The highest accuracies were observed when random forest imputation was used on all SNPs, but differences were not significant. A standard additive G-BLUP model was robust; modeling additive-by-additive epistasis did not provide any improvement in prediction accuracy.