Skip to main content
  • Research article
  • Open access
  • Published:

Genotyping by sequencing for genomic prediction in a soybean breeding population



Advances in genotyping technology, such as genotyping by sequencing (GBS), are making genomic prediction more attractive to reduce breeding cycle times and costs associated with phenotyping. Genomic prediction and selection has been studied in several crop species, but no reports exist in soybean. The objectives of this study were (i) evaluate prospects for genomic selection using GBS in a typical soybean breeding program and (ii) evaluate the effect of GBS marker selection and imputation on genomic prediction accuracy. To achieve these objectives, a set of soybean lines sampled from the University of Nebraska Soybean Breeding Program were genotyped using GBS and evaluated for yield and other agronomic traits at multiple Nebraska locations.


Genotyping by sequencing scored 16,502 single nucleotide polymorphisms (SNPs) with minor-allele frequency (MAF) > 0.05 and percentage of missing values ≤ 5% on 301 elite soybean breeding lines. When SNPs with up to 80% missing values were included, 52,349 SNPs were scored. Prediction accuracy for grain yield, assessed using cross validation, was estimated to be 0.64, indicating good potential for using genomic selection for grain yield in soybean. Filtering SNPs based on missing data percentage had little to no effect on prediction accuracy, especially when random forest imputation was used to impute missing values. The highest accuracies were observed when random forest imputation was used on all SNPs, but differences were not significant. A standard additive G-BLUP model was robust; modeling additive-by-additive epistasis did not provide any improvement in prediction accuracy. The effect of training population size on accuracy began to plateau around 100, but accuracy steadily climbed until the largest possible size was used in this analysis. Including only SNPs with MAF > 0.30 provided higher accuracies when training populations were smaller.


Using GBS for genomic prediction in soybean holds good potential to expedite genetic gain. Our results suggest that standard additive G-BLUP models can be used on unfiltered, imputed GBS data without loss in accuracy.


Marker-assisted selection (MAS) has played an important role in soybean breeding, particularly for traits that are difficult to evaluate phenotypically such as soybean cyst nematode (SCN) resistance. An early demonstration of successful MAS for SCN resistance allowed accurate identification of resistant lines using microsatellite markers [1]. Use of MAS for improving grain yield, however, has been met with limited success in soybean. A host of QTL mapping studies have reported QTL for grain yield in exotic soybean populations [25], but introgression of yield QTL has not been consistent across different genetic backgrounds [6]. Moreover, the QTL mapping – introgression approach is difficult to justify unless large effect QTL alleles are identified, which is rarely the case for grain yield.

Sebastian et al. [7] reported on a MAS approach that involved the sub-lining of existing soybean elite cultivars derived from single F3 or F4 plants. The authors called this approach context-specific MAS. Essentially, MAS is performed within narrow populations (i.e., elite cultivars with residual heterogeneity) with the goal of obtaining more precise estimates of genetic value in early field trials consisting of only a single replication. Lines were selected and advanced for further testing on the basis of marker scores calculated using significant marker effects estimated within populations. Significant superiority in grain yield for some of the selected sublines, relative to their “mother lines” (i.e., the elite cultivars with residual heterogeneity), was demonstrated using this approach [7].

This common-sense approach is ideal for increasing accuracy of preliminary yield tests: marker effects are estimated more accurately because marker alleles are highly replicated across individuals comprising a large population, whereas phenotypic values are estimated using only a single observation. Marker effects, however, can only be applied for calculating marker scores within a single bi-parental population and therefore, the current generation. It would be desirable to predict breeding values after one or more generations of recombination and selection in order to facilitate rapid cycling of parents. Furthermore, pooling genotypic and phenotypic information across populations could allow for more populations to be evaluated for grain yield within the same field space. Fewer individuals could be phenotypically evaluated in each biparental population and the breeding value of remaining (non-evaluated) individuals could be predicted using markers. Current trends strongly indicate plant breeding programs will be limited by phenotyping capacity, not genotyping capacity, thus increasing the attractiveness of this strategy through time.

Genomic selection (GS) has become the predominant method of applying molecular markers for selection of complex traits in plant breeding programs [8]. Briefly, genomic selection entails building a prediction model through associating marker information with phenotypic information in a “model training” step. Individuals that have been genotyped and phenotyped comprise the “training population” or “calibration set”. The prediction model is applied to a set of selection candidates that have been genotyped but not evaluated phenotypically. The primary difference between GS and traditional forms of MAS is that QTL mapping is not performed and markers are not chosen for inclusion in the model based on a statistical analysis, but rather all marker information is used simultaneously. The types of models used to deal with the “large p, small n” problem created by the genomic approach to prediction have been reviewed and compared elsewhere [810].

Dramatic advances in sequencing technologies are providing highly dimensional molecular marker information at low cost. Genotyping by sequencing [11] is a method well described by its name: polymorphisms are scored using next-generation sequencing technologies followed by a bioinformatics pipeline. The advantage of GBS is that it reduces cost through an enzyme-based genomic complexity reduction step and the use of barcoded adapters for multiplexing [12]. Genotyping by sequencing has been applied to investigations of genetic diversity in maize [13] as well as to studies on GS [1416]. Working in soybean, Sonah et al. [17] developed a novel GBS protocol and reliably called 10,120 high-quality SNPs among eight diverse lines. These authors called high-quality SNPs displaying only a small percentage of missing data, whereas many applications of GBS [16] have tolerated SNPs with very high frequencies of missing data, sometimes up to 80%.

Given this high rate of missing values in GBS data, imputation of marker scores is typically performed. The best imputation method, and whether imputated GBS data provides better predictions than simply selecting SNPs with low rates of missing data, however, is not known. Rutkoski et al. [18] showed a slight advantage to using imputation when markers were used with high missing data rates. In maize, however, Crossa et al. [16] failed to find any improvement in prediction accuracy using a haplotype-based imputation method on GBS data. Poland et al. [14] showed that a random forest imputation method provided the most accurate imputations, but the effect on genomic prediction accuracy was not significant.

A large number of studies on GS in multiple crops has been reported [2, 1922]. A study on GS in soybean, however, has not. Moreover, there are only a few reports on the use of GBS for GS [1416]. In light of the current research on GS and the dearth of reported research on GS and GBS in soybean breeding, the objectives of this study were (i) evaluate prospects for GS using GBS in a typical soybean breeding program and (ii) evaluate the effect of GBS marker selection and imputation on genomic prediction accuracy. To achieve these objectives, a set of soybean experimental lines sampled from the University of Nebraska-Lincoln Soybean Breeding Program were genotyped using GBS and evaluated for grain yield and other agronomic traits at multiple Nebraska locations. Reported findings are important to the application of GBS to selection for grain yield in future soybean breeding efforts.


Germplasm and phenotyping

Three hundred and one soybean experimental lines currently in advanced stages of the University of Nebraska-Lincoln Soybean Breeding Program were sampled. Two hundred and seventy-five lines were in the F5:8 generation and twenty-six lines were in the F4:7 generation. Soybean lines belonged to maturity groups I (N = 64), II (N = 213), and III (N = 24) (Table 1) and represent 34 biparental families ranging in size from one to 28 lines per family. Median family size was 8.

Table 1 Number of lines belonging to each maturity group (MG) and grown at each Nebraska location

During the summer of 2011, soybean lines were grown in two-row plots (0.76 m apart, 2.9 m long) seeded to a density of 26 seeds per square meter. Plots were arranged in an augmented incomplete block design with two replications. Blocks consisted of 27 – 39 experimental entries and three check cultivars. Lines belonging to maturity groups I and II were evaluated at the Nebraska locations Beemer, Phillips, Cotesfield, and Mead. Lines belong to maturity group III were evaluated at the Nebraska locations Phillips, Mead, Lincoln, and Clay Center (Table 1). Grain yield (GY; Mg ha-1) was measured at all locations, plant height (PH; cm) was measured only at Mead, and days to maturity (MD) was measured at Beemer, Phillips and Mead. Grain yield was recorded as machine harvestable grain yield adjusted to 13% moisture. Plant height was measured as the distance (in centimeters) between the surface of the soil and the main-stem apical meristem. Days to maturity was defined as the number of days from planting until the R8 stage when 95% of the pods were mature and brown in color.

Phenotypes were adjusted to remove location and block effects according to the model:

y ijkl =μ+ g i + e j + r k j + b l k +g e ij + ϵ ijkl

where g i represents the effect of the ith genotype (i.e., soybean line), e j represents the effect of the jth location, r k(j) represents the effect of the kth replicate nested in location j, b l(k) represents the effect of the lth incomplete block nested within replicate k, and. Best linear unbiased estimates were calculated for soybean lines and input into the genomic prediction models described below. The genotype effect was also treated as random in order to estimate variance components for the purpose of estimating heritability. Broad-sense heritability (H2) on an entry-mean basis was calculated as H 2 = σ G 2 σ G 2 + σ GE 2 er + σ ϵ 2 r where σ G 2 is the variance among soybean lines, σ GE 2 is the genotype-by-environment interaction variance, σ ϵ 2 is the residual variation, and e and r in this context are the number of environements and replications within environments, respectively.


Leaf discs were collected from 12 random plants of each soybean line at approximately the V6 growth stage. DNA isolation was performed using the Qiagen DNeasy Plant 96 kit. Samples were sent to the Institute of Genomic Diversity at Cornell University for genotyping by sequencing as described in [11] and at Briefly, DNA samples were digested with the ApeKI restriction enzyme followed by ligation of adapters to fragment ends. Adapters consisted of Illumina sequencing primers and a barcode adapter. After adapter ligation, samples are combined into pools consisting of 96 samples. A PCR amplification is carried out to create the GBS libraries, which are submitted to a single Illumina HiSeq2000 flow cell for sequencing. Four sequenced libraries produced on average 247,255,883 reads, of which on average 219,580,690 were good, barcoded reads.

The GBS analysis pipeline as implemented in Tassel Version 3.0.156 was used to call SNPs. Briefly, 1) tag counts were generated from fastq files with the FastqToTagCountPlugin (options: -s 300000000, -c 1), 2) tag counts were merged with the MergeMultipleTagCountPlugin (options: -c 5), 3) tags were aligned to the reference genome Gmax_109_softmasked.fa.gz, which was downloaded from on 16 July 2012 and indexed for use with bwa version 0.6.1-r104. BWA version 0.7.3a-r367 was used for alignment (-n 0.04). Chromosomes were renamed for compatibility with the GBS pipeline by removing the leading 'Gm’ and 'scaffold_’, and then converted to the tags-on-physical-map format using the SAMConverterPlugin 4). Counts of tags per individual (taxa) were generated with the FastqToTBTPlugin (options: -c 1 –s 300000000, -y), 5) Counts of tags per individual were merged with the MergeTagsByTaxaFilesPlugin (options: -s 300000000, -x), 6) SNPs were called using the TagsToSNPByAlignmentPlugin (options: -mnMAF 0.01, -mnLCov 0.1, -mnMAC 10, -mxSites 1,000,000). Duplicate sites were merged with the MergeDuplicateSNPsPlugin (options: -callHets, -misMat 0.05), and duplicated taxa were merged with the MergeIdenticalTaxaPlugin (options: -hetFreq 0.8). Scaffolds were ignored for SNP calling.

Genomic prediction models

Base pair calls contained in the hapmap file obtained from the GBS analysis pipeline were converted to numerical genotype scores, x {0, 1, 2}, where x is the number of copies of the major allele.

Two genomic prediction models were studied: a standard G-BLUP model including only additive effects, and an extended version of the G-BLUP model also including additive-by-additive effects. Two different formulations of additive-by-additive effects have been presented in the literature [23, 24] and both of them were considered.

The standard G-BLUP model including additive effects only is

y i =μ+ g i + ϵ i

where y i represents the phenotype of the ith line, μ represents the intercept, g i represents the additive genetic value, and e i represents the residual. The additive genetic value can be estimated as g i = j = 1 p x ij b j , where x ij is the genotype score at the jth (j = 1,…,p) locus in the ith (i = 1,…,n) line, and b j is the allelic substitution effect (marker effect) at the jth marker locus. Marker effects were considered as IID random variables from a normal distributions such that b j ~ IID N 0 , σ b 2 . From properties of the multivariate normal distribution the vector g = Xb, (g = [g1, …, g n ]′), follows a multivariate normal distribution with null mean and covariance matrix Cov g =X X σ b 2 =G σ g 2 where G= 1 p X X and σ g 2 =p σ b 2 . Hereafter, we centered and standardized genotype scores by dividing by 2 θ j 1 - θ j , where θ j is the estimated allele frequency at the jth marker locus. The G matrix is a genomic realized relationships matrix whose entries are given by G i i = j = 1 p x ij - 2 θ j x i j - 2 θ j j = 1 p 2 θ j 1 - θ j . Summarizing above stated assumptions the model (1) becomes a mixed effects model with g ¯ ~N 0 , G σ g 2 and ϵ i ~ IID N 0 , σ ϵ 2 . Using this model, the lines are related through the off-diagonal values of G matrix, allowing the borrowing of information between lines to predict performance of lines not phenotyped.

Additive-by-additive effects were modeled using two different covariance structures among lines. Several authors [25, 26] modeled additive-by-additive epistasis through a GG matrix following Cockerham (1954) and Kempthorne (1954), where represents the Hadamard, or element-wise, multiplication operation. The first model including additive-by-additive epistasis was

y i =μ+g g i + ϵ i

with gg ¯ ~N 0 , G G σ gg 2 and ϵ i ~ IID N 0 , σ ϵ 2 .

More recently, Xu [24] proposed an alternative way to include these interaction effects using the covariance structure given by Κ aa = 1 C aa K aa * , with K aa * = j = 1 p j = k + 1 p - 1 Z k Z k Z k Z k , C aa =mean diag K aa * and Z k is the jth marker locus such that Z ij = + 1 for A 0 for H 1 for B . Using this assumption a different version of the common epistasis model is given by:

y i =μ+ k i + ϵ i

with k ¯ ~N 0 , K σ aa 2 and ϵ i ~ IID N 0 , σ ϵ 2 .

Modeling additive and additive-by-additive components was conducted to assess the proportion of the phenotypic variance accounted for by these effects and improvements in accuracy of genomic prediction. By combining models (1) and (2), a model including additive and epistatic effects was formulated:

y i =μ+ g i +g g i + ϵ i

with g ¯ ~N 0 , G σ g 2 , gg ¯ ~N 0 , G G σ gg 2 and ϵ i ~ IID N 0 , σ ϵ 2 .

The alternative additive and additive-by-additive model following Xu [24] was built combining models (1) and (3):

y i =μ+ g i + k i + ϵ i

with g ¯ ~N 0 , G σ g 2 , k ¯ ~N 0 , K σ aa 2 and ϵ i ~ IID N 0 , σ ϵ 2 .

Models (1)-(5) were fitted to the full data set using computational methods described in [27]. All the statistical analyses were implemented in the R statistical software [28].

Marker imputation

Genotyping-by-sequencing data sets typically have high rates of missing data [14, 16]. Three imputation methods were considered to impute missing values of the soybean GBS data set. (i) Naïve imputation substitutes missing values at each locus with 2θ j , where θ j is the estimated frequency of the major allele at the jth locus. This method is not expected to add information, but rather serves the purpose of ensuring unchanged allele frequencies after imputation and provides a marker matrix containing no missing data so that analytical operations can be performed. (ii) Random forest imputation is based on random forest regression introduced by Breiman [29]. Marker imputation for this study was performed using the MissForest R package according to [18]. The algorithm was performed chromosome-wise and for each PMV and MAF combination in parallel. (iii) FILLIN (HI) is a novel imputation method based on haplotypes, which is implemented in TASSEL 5.0. Default settings were used with the exception of maximum heterozygosity, which was set to 0.30 using the option -mxHet. Detailed information can be found in the TASSEL 5.0 User Guide at

Varying factors affecting genomic prediction accuracy

To evaluate the effects of GBS marker selection and imputation methods on genomic prediction accuracy, two criteria for filtering SNPs were considered. Filtering of GBS SNPs was done sequentially, first filtering based on percent missing values (PMV), and then, for minor-allele frequency (MAF). Levels for PMV (27) and MAF (12) were (l) 120, 25, 30, 40, 50, 60, 70, and 80%, and (m) 0.05-0.1, 0.15, 0.20, 0.25, 0.30, 0.35, and 0.40, respectively. Markers were filtered based on all possible combinations of PMV and MAF. After filtering, remaining missing values were imputed using each of the three imputation methods described above. This produced 972 marker datasets (e.g., 27 PMV levels × 12 MAF levels × 3 imputation methods).

All comparisons were made on the basis of the correlation between the observed phenotype and the predicted breeding value, which is referred to as predictive ability, following [30]. We reserve the term prediction accuracy (r gp ) for the correlation between the prediction and the true breeding value. Prediction accuracy can be approximated by dividing predictive ability by H 2 [8, 31]. Predictive ability ( r g ^ p ) of each marker filtering criteria was evaluated using 10-fold cross validation replicated 200 times. The mean predictive ability across the 200 replicates was calculated and bootstrap confidence intervals.

The impact of training population size on prediction accuracy was evaluated using a validation set comprised of 50 randomly selected lines and training sets of variable sizes. The validation set was formed by randomly sampling 50 lines without replacement. From the remaining 251 lines, the training population of size n was formed sequentially such that its size ranged between 2 and 251. First, two lines were sampled (i.e., n = 2) without replacement, then, from the remaining 251-n lines, additional lines were incorporated to the training set, by increments of one. Once a line was sampled, it remained in the training set. The validation set was held constant with the initial 50 lines. Two GBS marker subsets were used to evaluate training population size effect: 1) PMV ≤ 5% and MAF > 0.05; and 2) PMV ≤ 80% and MAF > 0.3. This procedure was repeated 1000 times and accuracies at each training population size were averaged across replicates.


A total of 5,770,366 unique 64-bp sequence tags were identified across all four soybean libraries, of which 68.75% aligned uniquely to the reference genome, 11.32% aligned to multiple positions and 19.92% could not be aligned. The mean (median) sequencing depth per SNP locus was 11 (6), with mean (median) proportion “missingness” per SNP locus of 0.18 (0.08).Unique tag counts and SNP density were higher towards the chromosome ends compared to pericentromeric regions (Figure 1). The GBS protocol targets gene-rich regions, such as the chromosome ends, through the use of methylation sensitive restriction enzymes. Related to the distribution of uniqe tag counts, percent missing values were lower towards chromosome ends and higher towards the pericentromeric regions. There was no apparent pattern regarding MAF with the exception of some chromosomes harboring more diversity than others (e.g., chromosomes 11 and 20 versus chromosomes 15 and 18) (Figure 1). The number of SNPs remaining after filtering by MAF and PMV is shown in Figure 2. The number of SNPs available with cuttoff values of PMV ≤ 80% and MAF > 0.05 was 52,349. There were 16,502 SNPs with PMV ≤ 5% and MAF > 0.05.

Figure 1
figure 1

Genoptyping by sequencing parameters on 301 elite soybean breeding lines. Parameters were calculated using a 100 kbp window with a 50 kbp slide. From outside to inside: 1) Unique 64-bp sequence tags per window; 2) SNP density; 3) Minor-allele frequency; 4) Percent missing values. For the 64-bp sequence tag and SNP density heatmaps, hot colors represent larger values on a log base three scale.

Figure 2
figure 2

Number of SNPs remaining after applying filtering by combinations of minor-allele frequency and percent missing values.

The high quality of the phenotypic data was reflected with relatively high heritabilities of 0.69 for GY and 0.94 for MD (Table 2). As expected, the genotype-by-environment interaction source of variation was greater for GY compared to MD. The overall r g ^ p of G-BLUP using a SNP dataset with cuttoff values of PMV ≤ 80% and MAF ≥ 0.05, and the Naïve imputation method, was 0.565 for GY, 0.374 for PH, and 0.644 for MD. Prediction accuracy estimates for GY, PH, and MD were, 0.64, 0.42 and 0.65, respectively.

Table 2 Summary of phenotypic data analysis for grain yield (GY), plant height (PH) and days to maturity (MD)

The effect of SNP filtering on r g ^ p was assessed by building a series of G-BLUP models using SNP datasets created by applying combinations of MAF and PMV filtering criteria. Number of SNPs is quickly reduced as SNPs are filtered based on MAF and PMV (Figure 2). Overall, marker filtering criteria did not have a large effect on r g ^ p for GY, but some important effects were observed for PH and MD (Figure 3). For PH, r g ^ p was greater when markers with MAF between 0.08 and 0.10 were used compared to all other MAF cutoffs. When considering jointly both filtering criteria, the r g ^ p of a trait were maximized at different combinations between PMV and MAF. For GY the maximum r g ^ p (0.59) was obtained with a marker dataset including SNPs with up to 80 PMV and MAF greater than 0.30. For PH and MD, r g ^ p was maximized when only SNPs with lower PMV were included (Figure 3).

Figure 3
figure 3

Average predictive abilities (y-axis) at each combination of minor-allele frequency (MAF) and percent missing value (PMV) for grain yield (GY), plant height (PH) and days to maturity (MD). Naïve imputation was used to fill in missing values.


No significant differences in r g ^ p were found among the imputation methods (Figure 4). When Naïve imputation was used, r g ^ p was slightly reduced by including SNPs with high levels of PMV. When random forest imputation was used, however, r g ^ p was maximized when all SNPs were included in the model. A random forest imputation with 80 PMV provided the highest r g ^ p overall (Figure 4). The random forest method provided numerically higher r g ^ p than the HI method at high PMV levels, but differences were not statistically significant.

Figure 4
figure 4

Average predictive ability and corresponding 95% bootstrap confidence intervals for multiple levels of percent missing values (PMV) and three imputation methods: Naïve, random forest imputation (RFI), and haplotype-based imputation (HI).

Model comparison

Contribution of polygenic additive-by-additive epistatic interactions to total phenotypic variation was assessed by constructing epistatic relationship matrices using the Hadamard product of the additive relationship matrix [23], as well as the marker-generated additive-by-additive relationship matrix as described by Xu [24]. For these comparisons, a marker set including SNPs with PMV ≤ 80% and MAF > 0.05 was used. Missing values were imputed using the Naïve method.

The percentages of phenotypic variation accounted for by each model term varied across traits. For GY, the realized additive relationship matrix captured 91.2% of total phenotypic variance. Since the G°G and Kaa matrices are highly collinear with G (data not shown), the epistatic relationship matrices accounted for similar amounts of variation (Table 3). When combining both additive and epistatic effects into the same model, the epistatic effects account for variable amounts of phenotypic variation. Nevertheless, the percentage of residual variation is fairly constant (Table 3), indicating that including an additive-by-additive epistasis relationship matrix provides no improvement over standard additive G-BLUP models. This was also observed using the cross-validation approach to evaluate r g ^ p . No difference was observed among the models regarding r g ^ p (Table 4).

Table 3 Percentage of phenotypic variation in grain yield (GY), plant height (PH), and days to maturity (MD) explained by additive and non-additive effects included in models 1 – 5
Table 4 Predictive abilities for grain yield (GY), plant height (PH) and days to maturity (MD) under models [1] – [5]

Training population size

For a set of SNPs with PMV ≤5% and MAF > 0.05, r g ^ p plateaued around a training population size just greater than 100 (Figure 5). Predictive ability, however, did steadily increase up until the maximum training population size possible in the cross validation strategy. Predictive ability was improved at lower TP sizes only when a MAF > 0.30 was used to construct G. For example, using a MAF > 0.05 and TP size of 50, r g ^ p was only 0.28, but when a MAF > 0.30 was used, r g ^ p was increased to 0.47. A TP consisting of at least 100 individuals was required to reach this r g ^ p using SNPs with MAF > 0.05 (Figure 5).

Figure 5
figure 5

Relationship between predictive ability and training population size for multiple levels of percent missing values (PMV) and minor-allele frequency (MAF). The trait displayed here is grain yield (GY).


The results of this study suggest that the use of GBS for genomic selection holds good potential for improving soybean grain yield. Using a cross-validation approach, genomic predictions explained ~32% of the variation in yield phenotypes. After using the phenotype heritability to correct for random environmental deviations included in the validation phenotypes, approximately 41% of the variation in genotypic values was explained by genomic predictions. Because validation phenotypes (i.e., soybean line means) include both additive and non-additive effects, and genomic predictions using the G-BLUP model include only additive effects, this estimated prediction accuracy is likely biased downward.

In order to quantify the relative benefit of genomic selection over phenotypic selection, Technow et al. [32] rearranged the formula for relative response to indirect selection to obtain the inequality L Y < r A H X L X , where LY is the cycle length of genomic selection, rA is the genomic prediction accuracy, HX is the phenotypic selection accuracy, and LX is the cycle length of phenotypic selection [32]. Substituting the values estimated herein for grain yield into this formula indicates that genomic selection is expected to be superior to phenotypic selection in terms of genetic gain per unit time if the cycle length of genomic selection is less than 77% the cycle length of phenotypic selection. It is entirely possible for a genomic selection cycle to be 66% of a phenotypic selection cycle based on the structure of the University of Nebraska Soybean Breeding Program. On top of this, the above formula assumes equal selection intensities for genomic and phenotypic selection. As genotyping costs continue to decline, selection intensity for genomic prediction could be increased compared to phenotypic selection at equal cost. Before soybean breeding programs incorporate genomic selection on a large scale, these results need to be validated through comparisons of phenotypes and genomic predictions across years, as well as by comparison of progenies from phenotypic and genomic selection programs.

The high genomic prediction accuracy observed was fairly consistent across SNP datasets with differing levels of PMV. More than 16,000 SNPs were scored with less than 5 PMV using GBS, which is surely more than is needed to ensure good SNP-QTL linkage disequilibrium among elite soybean breeding progenies [33]. Little to no sacrifice in accuracy was observed when SNPs with up to 80 PMV were included. It might be desirable to reduce the SNP numbers to ease computational requirements when predicting individual SNP effects and summing effects to calculate genomic predictions. However, more saturated SNP datasets may be more desirable for computing genomic predictions of multi-family selection schemes of more diverse germplasm. The G-BLUP approach is more computationally efficient with computational demands scaling with individual number rather than marker number. Knowing that data filtering steps are not likely needed for using GBS data for genomic prediction reduces the number of optimization steps and simplifies the process.

We failed to find significant differences among imputation methods, including differences between Naïve imputation and the other two which use covariance information between nearby SNPs. While not significantly better, the machine learning, non-parametric method called random forest performed best when SNPs with up to 80 PMV were included. This was also observed by Rutkoski et al. [18], but these authors did not compare random forest imputation with a method using marker order information. We observed that a haplotype-based method, which used marker order information from the soybean physical map, was not superior to random forest imputation. Random forest has often been used for imputing markers for genomic selection in plant breeding, especially when a reference genome is not available [10, 15, 18]. A haplotype-based method was also used for GBS data by Crossa et al. [16], but these authors also observed very little to no advantage over Naïve imputation. This general lack of benefit to imputation is likely due to the fact the genomic prediction is robust to missing marker data [18] and the number of markers with relatively low PMV provided by GBS is more than enough to cover the linkage disequilibrium space in crop breeding germplasm.

Rather than compare shrinkage models and Bayesian variable selection models for prediction accuracy as has been frequently performed previously (e.g., [10, 21]), we compared G-BLUP models including additive effects only against those also including additive-by-additive effects. Additive-by-additive interaction effects were incorporated into the model in the Cockerham-Kempthorn fashion by including a random polygenic interaction effect with a covariance structure specified as the Hadamard product of the additive genomic relationship matrix [34]. This model makes many assumptions, and the soybean population clearly violates the assumptions of linkage equilibrium between loci and randomly mating individuals. Because of this violation, another formulation of the additive-by-additive relationship matrix according to Xu [24] was used. It turned out the Kaa matrix calculated according to Xu [24] was highly collinear with the simple Cockerham-Kempthorn Hadamard product and explained similar amounts of phenotypic variation. Neither GG nor Kaa was orthogonal to the G matrix as can be seen by the variance component estimation. Similar amounts of variation were explained when any of these effects were included in the model alone or together. The amount of residual variation was actually slightly smaller when only G was modeled and genomic prediction accuracies were not enhanced by including additive-by-additive effects using either the Cockerham-Kempthorn or Xu [24] formulation.Low to moderately sized training populations could be used in a soybean breeding program to achieve adequate prediction accuracies (Figure 5). Although it’s probably not necessary to reduce TP sizes down to such a low level, training population sizes could be reduced further if only SNPs with higher MAF are included. The underlying reason for this is not clear. It is possible that SNPs with low MAF are not sampled by chance when small training populations are sampled and phenotyped. If they are not polymorphic in the TP, they cannot contribute information to the relationships between individuals, which is the basis of predictions in G-BLUP. When TP size is increased, SNP alleles with low frequency are more likely to be adequately represented in the TP. When MAF threshold is higher, this problem is reduced by the fact that all SNP alleles have a reasonable chance of contributing to relationship even when TPS sizes are small.


This first look at GBS for genomic prediction in soybean suggests GBS holds good potential to enhance genetic gain in soybean. Over 16,000 SNPs were scored with less than 5% missing data. Filtering markers based on amount of missing data had little to no effect. No differences were observed among imputation methods. The highest accuracies were observed when random forest imputation was used on all SNPs, but differences were not significant. A standard additive G-BLUP model was robust; modeling additive-by-additive epistasis did not provide any improvement in prediction accuracy.



Minor-allele frequency


Percent missing values


Random forest imputation


Marker-assisted selection


Genomic selection


Genomic best linear unbiased prediction

r PP :

Predictive ability


Grain yield


Plant height


Days to maturity


Soybean cyst nematode


Genotyping by sequencing


Single nucleotide polymorphism


Haplotype-based imputation.


  1. Mudge J, Cregan PB, Kenworthy JP, Kenworthy WJ, Orf JH, Young ND: Two Microsatellite Markers That Flank the Major Soybean Cyst Nematode Resistance Locus. Crop Sci. 1997, 37: 1611-10.2135/cropsci1997.0011183X003700050034x.

    Article  CAS  Google Scholar 

  2. Zhao Y, Gowda M, Liu W, Würschum T, Maurer HP, Longin FH, Ranc N, Reif JC: Accuracy of genomic selection in European maize elite breeding populations. Theor Appl Genet. 2012, 124: 769-776. 10.1007/s00122-011-1745-y.

    Article  PubMed  Google Scholar 

  3. Guzman PS, Diers BW, Neece DJ, St. Martin SK, LeRoy AR, Grau CR, Hughes TJ, Nelson RL: QTL Associated with Yield in Three Backcross-Derived Populations of Soybean. Crop Sci. 2007, 47: 111-10.2135/cropsci2006.01.0003.

    Article  CAS  Google Scholar 

  4. Orf JH, Chase K, Jarvik T, Mansur LM, Cregan PB, Adler FR, Lark KG: Genetics of soybean agronomic traits. I. Comparison of three related recombinant inbred populations. Crop Sci. 1999, 39 (6): 1642-1651. 10.2135/cropsci1999.3961642x.

    Article  Google Scholar 

  5. Yuan J, Njiti VN, Meksem K, Iqbal MJ, Triwitayakorn K, Kassem MA, Davis GT, Schmidt ME, Lightfoot DA: Quantitative trait loci in Two Soybean Recombinant Inbred Line Populations Segregating for Yield and Disease Resistance. Crop Sci. 2002, 42: 271-277. 10.2135/cropsci2002.0271.

    Article  CAS  PubMed  Google Scholar 

  6. Concibido VC, La Vallee B, McLaird P, Pineda N, Meyer J, Hummel L, Yang J, Wu K, Delannay X: Introgression of a quantitative trait locus for yield from Glycine soja into commercial soybean cultivars. Theor Appl Genet. 2003, 106: 575-582.

    CAS  PubMed  Google Scholar 

  7. Sebastian SA, Streit LG, Stephens PA, Thompson JA, Hedges BR, Fabrizius MA, Soper JF, Schmidt DH, Kallem RL, Hinds MA, Feng L, Hoeck JA: Context-Specific Marker-Assisted Selection for Improved Grain Yield in Elite Soybean Populations. Crop Sci. 2010, 50: 1196-10.2135/cropsci2009.02.0078.

    Article  Google Scholar 

  8. Lorenz AJ, Chao S, Asoro FG, Heffner EL, Hayashi T, Iwata H, Smith KP, Sorrels ME, Jannink J-L: Chapter Two – Genomic Selection in Plant Breeding: Knowledge and Prospects. Adv Agron. 2011, 110: 77-123.

    Article  Google Scholar 

  9. de los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL: Whole Genome Regression and Prediction Methods Applied to Plant and Animal Breeding. Genetics. 2013, 193: 327-345. 10.1534/genetics.112.143313.

    Article  PubMed Central  PubMed  Google Scholar 

  10. Heslot N, Rutkoski J, Poland J, Jannink J-L, Sorrells ME: Impact of Marker Ascertainment Bias on Genomic Selection Accuracy and Estimates of Genetic Diversity. PLoS One. 2013, 8: e74612-10.1371/journal.pone.0074612.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  11. Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, Mitchell SE: A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species. PLoS One. 2011, 6: e19379-10.1371/journal.pone.0019379.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  12. Poland JA, Rife TW: Genotyping-by-Sequencing for Plant Breeding and Genetics. Plant Gen. 2012, 5: 92-10.3835/plantgenome2012.05.0005.

    Article  CAS  Google Scholar 

  13. Romay MC, Millard MJ, Glaubitz JC, Peiffer JA, Swarts KL, Casstevens TM, Elshire RJ, Acharya CB, Mitchell SE, Flint-Garcia SA, McMullen MD, Holland JB, Buckler ES, Gardner CA: Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biol. 2013, 14: R55-10.1186/gb-2013-14-6-r55.

    Article  PubMed Central  PubMed  Google Scholar 

  14. Poland J, Endelman J, Dawson J, Rutkoski J, Wu S, Manes Y, Dreisigacker S, Crossa J, Sánchez-Villeda H, Sorrells M, Jannink J-L: Genomic Selection in Wheat Breeding using Genotyping-by-Sequencing. Plant Gen. 2012, 5: 103-10.3835/plantgenome2012.06.0006.

    Article  CAS  Google Scholar 

  15. Ly D, Hamblin M, Rabbi I, Melaku G, Bakare M, Gauch HG, Okechukwu R, Dixon AGO, Kulakow P, Jannink J-L: Relatedness and Genotype × Environment Interaction Affect Prediction Accuracies in Genomic Selection: A Study in Cassava. Crop Sci. 2013, 53: 1312-10.2135/cropsci2012.11.0653.

    Article  Google Scholar 

  16. Crossa J, Beyene Y, Kassa S, Pérez P, Hickey JM, Chen C, de los Campos G, Burgueño J, Windhausen VS, Buckler E, Jannink J-L, Cruz MAL, Babu R: Genomic Prediction in Maize Breeding Populations with Genotyping-by-Sequencing. G3. 2013, 3: 1903-1926. 2013.

    Article  PubMed Central  PubMed  Google Scholar 

  17. Sonah H, Bastien M, Iquira E, Tardivel A, Légaré G, Boyle B, Normandeau É, Laroche J, Larose S, Jean M, Belzile F: An Improved Genotyping by Sequencing (GBS) Approach Offering Increased Versatility and Efficiency of SNP Discovery and Genotyping. PLoS One. 2013, 8: e54603-10.1371/journal.pone.0054603.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  18. Rutkoski JE, Poland J, Jannink J-L, Sorrells ME: Imputation of unordered markers and the impact on genomic selection accuracy. G3. 2013, 3: 427-439. 2013.

    Article  PubMed Central  PubMed  Google Scholar 

  19. Burgueño J, de los Campos G, Weigel K, Crossa J: Genomic prediction of breeding values when modeling genotype × environment interaction using pedigree and dense molecular markers. Crop Sci. 2012, 52: 707-10.2135/cropsci2011.06.0299.

    Article  Google Scholar 

  20. Heslot N, Yang H-P, Sorrells ME, Jannink J-L: Genomic Selection in Plant Breeding: A Comparison of Models. Crop Sci. 2012, 52: 146-10.2135/cropsci2011.06.0297.

    Article  Google Scholar 

  21. Lorenz AJ, Smith KP, Jannink J-L: Potential and Optimization of Genomic Selection for Fusarium Head Blight Resistance in Six-Row Barley. Crop Sci. 2012, 52: 1609-10.2135/cropsci2011.09.0503.

    Article  Google Scholar 

  22. Massman JM, Jung H-JG, Bernardo R: Genomewide Selection versus Marker-assisted Recurrent Selection to Improve Grain Yield and Stover-quality Traits for Cellulosic Ethanol in Maize. Crop Sci. 2012, 53: 58-66.

    Article  Google Scholar 

  23. Cockerham CC: An Extension of the Concept of Partitioning Hereditary Variance for Analysis of Covariances among Relatives When Epistasis Is Present. Genetics. 1954, 39: 859-882.

    CAS  PubMed Central  PubMed  Google Scholar 

  24. Xu S: Mapping Quantitative Trait Loci by Controlling Polygenic Background Effects. Genetics. 2013, 195: 1209-1222. 10.1534/genetics.113.157032.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  25. Henderson CR: Applications of Linear Models in Animal Breeding. 1984, Ontario, Canada: University of Guelph

    Google Scholar 

  26. Ober U, Ayroles JF, Stone EA, Richards S, Zhu D, Gibbs RA, Stricker C, Gianola D, Schlather M, Mackay TFC, Simianer H: Using Whole-Genome Sequence Data to Predict Quantitative Trait Phenotypes in Drosophila melanogaster. PLoS Genet. 2012, 8: e1002685-10.1371/journal.pgen.1002685.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  27. de los Campos G, Gianola D, Rosa GJM, Weigel KA, Crossa J: Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genet Res. 2010, 92: 295-308. 10.1017/S0016672310000285.

    Article  CAS  Google Scholar 

  28. R Core Team: A Language and Environment for Statistical Computing. 2013, Vienna: R Foundation for Statistical Computing

    Google Scholar 

  29. Breiman L: Random Forests. Mach Learn. 2001, 45: 5-32. 10.1023/A:1010933404324.

    Article  Google Scholar 

  30. Riedelsheimer C, Czedik-Eysenberg A, Grieder C, Lisec J, Technow F, Sulpice R, Altmann T, Stitt M, Willmitzer L, Melchinger AE: Genomic and metabolic prediction of complex heterotic traits in hybrid maize. Nat Genet. 2012, 44: 217-220. 10.1038/ng.1033.

    Article  CAS  PubMed  Google Scholar 

  31. Legarra A, Robert-Granié C, Manfredi E, Elsen J-M: Performance of Genomic Selection in Mice. Genetics. 2008, 180: 611-618. 10.1534/genetics.108.088575.

    Article  PubMed Central  PubMed  Google Scholar 

  32. Technow F, Bürger A, Melchinger AE: Genomic prediction of northern corn leaf blight resistance in maize with combined or separated training sets for heterotic groups. G3. 2013, 3: 197-203. 2013.

    Article  PubMed Central  PubMed  Google Scholar 

  33. Hyten DL, Choi I-Y, Song Q, Shoemaker RC, Nelson RL, Costa JM, Specht JE, Cregan PB: Highly Variable Patterns of Linkage Disequilibrium in Multiple Soybean Populations. Genetics. 2007, 175: 1937-1944. 10.1534/genetics.106.069740.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  34. Gianola D, de los Campos G: Inferring genetic values for quantitative traits non-parametrically. Genet Res. 2008, 90: 525-540. 10.1017/S0016672308009890.

    Article  CAS  Google Scholar 

Download references


The authors are grateful to the Nebraska Soybean Board for providing funding.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Aaron Lorenz.

Additional information

Competing interests

The authors declare they have no competing interests.

Authors’ contributions

DJ performed the genomic predictions, statistical analysis, and drafted the manuscript. KK participated in the collection of phenotype and genotype data, analyzed the phenotypic data, and participated in drafting the manuscript. LP participated in the collection of genotypic data and edited the manuscript. KH performed the bioinformatics, summarized the SNP data, and edited the manuscript. JJ participated in the collection of phenotype and genotype data. GG managed the data collection, participated in the design of the study, and edited the manuscript. AL participated in the design of the study, managed the statistical analysis, and drafted the manuscript. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit

The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jarquín, D., Kocak, K., Posadas, L. et al. Genotyping by sequencing for genomic prediction in a soybean breeding population. BMC Genomics 15, 740 (2014).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: