Genome-wide association study and genomic selection for soybean chlorophyll content associated with soybean cyst nematode tolerance

Background Soybean cyst nematode (SCN), Heterodera glycines Ichinohe, has been one of the most devastating pathogens affecting soybean production. In the United States alone, SCN damage accounted for more than $1 billion loss annually. With a narrow genetic background of the currently available SCN-resistant commercial cultivars, high risk of resistance breakdown can occur. The objectives of this study were to conduct a genome-wide association study (GWAS) to identify QTL, SNP markers, and candidate genes associated with soybean leaf chlorophyll content tolerance to SCN infection, and to carry out a genomic selection (GS) study for the chlorophyll content tolerance. Results A total of 172 soybean genotypes were evaluated for the effect of SCN HG Type 1.2.3.5.6.7 (race 4) on soybean leaf chlorophyll. The soybean lines were genotyped using a total of 4089 filtered and high-quality SNPs. Results showed that (1) a large variation in SCN tolerance based on leaf chlorophyll content indices (CCI); (2) a total of 22, 14, and 16 SNPs associated with CCI of non-SCN-infected plants, SCN-infected plants, and reduction of CCI SCN, respectively; (3) a new locus of chlorophyll content tolerance to SCN mapped on chromosome 3; (4) candidate genes encoding for Leucine-rich repeat protein, plant hormone signaling molecules, and biomolecule transporters; and (5) an average GS accuracy ranging from 0.31 to 0.46 with all SNPs and varying from 0.55 to 0.76 when GWAS-derived SNP markers were used across five models. This study demonstrated the potential of using genome-wide selection to breed chlorophyll-content-tolerant soybean for managing SCN. Conclusions In this study, soybean accessions with higher CCI under SCN infestation, and molecular markers associated with chlorophyll content related to SCN were identified. In addition, a total of 15 candidate genes associated with chlorophyll content tolerance to SCN in soybean were also identified. These candidate genes will lead to a better understanding of the molecular mechanisms that control chlorophyll content tolerance to SCN in soybean. Genomic selection analysis of chlorophyll content tolerance to SCN showed that using significant SNPs obtained from GWAS could provide better GS accuracy.


Key message
To the best of our knowledge, this is the first report of QTL associated with chlorophyll content tolerance to soybean cyst nematode (SCN) in soybean.

Background
Soybean [Glycine max (L.) Merr.] is one of the most important legumes worldwide by providing oil and being a source of vegetable protein. Developing soybean-derived biofuel has been recently increasing, with an estimated value exceeding $35 billion in the United States (www. soystats.com). Soybean cyst nematode (SCN), Heterodera glycines Ichinohe, is an important pest with total annual yield losses about $1.5 billion in the U.S. alone [1]. The SCN is an obligate endoparasite, which feeds on soybean roots, depletes carbon of soybean plants and results in yield losses [2]. One pathway of SCN damage to soybean is induction or enhancement of nutritional deficiency of soybean such as iron, potassium, and/or nitrogen deficiencies that result in chlorophyll content reduction or in severe cases the typical chlorosis symptom [3,4]. Iron-deficiency chlorosis (IDC) of soybean, in particular, is common in the North Central region, the major soybean production region in the USA. It occurs in high pH soil, but many biotic and abiotic factors affect its occurrence [5][6][7][8]. The SCN is present in most soybean fields in the region, and high pH also favors reproduction of SCN and its damage to soybean plants [9]. Therefore managing SCN and nutritional deficiencies is important for soybean productivity in many fields in the North Central USA and some other regions in the world.
Use of SCN-resistant soybean cultivars and crop rotation involving a non-host crop is the best way to manage SCN [10,11]. Development of new SCN-resistant soybean cultivars requires a better understanding of the genetic mechanisms underlying SCN resistance. To date, at least 216 SCN-resistant QTL have been reported (www.soybase.org). A large number of those QTL have not been fully investigated [12]. Among the QTL conferring resistance to SCN, two loci, rhg1 and Rhg4, which are located on chromosomes 18 and 8, respectively, have been commonly used to deploy SCN resistance in soybean germplasm [13]. Both rhg1 and Rhg4 are required in the soybean cultivar 'Forest' to exhibit resistance to SCN, with Rhg4 being dominant [14]. This resistance has been known as Peking-type resistance because the source of resistance was from Peking. In contrast, the resistance in cultivars with PI 88788 source requires only rhg1, and the resistance is known as PI 88788-type [15].
Some studies of the genetic mechanism between the two aforementioned SCN-resistant loci have been reported. A gene mapped at the Rhg4 locus and conferring SCN resistance has been cloned [16]. This gene encodes for a serine hydroxymethyltransferase [16]. The SCN-resistant gene within the Rhg4 locus was derived from an artificial selection occurring during soybean domestication [17]. Resistance to SCN conferred by the rhg1 locus has been associated to copy number variation and DNA methylation, which can enhance the expression of SCN resistance genes within that locus [18]. Three genes in the rhg 1 locus encoding an amino acid transporter, an α-SNAP protein, a WI12 (wound-inducible domain) protein contribute to the SCN resistance [19,20].
The utilization of molecular markers through marker-assisted selection (MAS) in soybean breeding programs has been proven to accelerate the development of disease-resistant cultivars [21]. Recently, tools such as genome-wide association mapping (GWAS) and genomic selection (GS) have increasingly become popular in efforts towards uncovering the genetic basis of traits of interest in agriculture and identifying important new loci. GWAS has been used to identify new markers and loci associated with resistance to SCN. A total of 6 SSR markers associated with SCN resistance were identified in a set of 159 soybean lines [22]. GWAS was conducted on a total of 282 soybean genotypes to identify SNP markers associated with resistance to SCN HG type 0 [12]. Out of the 1536 SNPs used, a total of 7 SNP markers were associated with SCN resistance. Most of those significant SNP markers were located in the rhg1 locus. In addition, two genes, FGAM1 and Glyma18g46201, were located in the vicinity of two significant SNPs. A total of 19 SNP markers were reported to be associated with resistance to SCN HG type 0 and HG type 1.2.3.5.7 in an association panel consisting of 440 soybean genotypes, of which, three were mapped to loci that have not yet been reported [23]. A total of 553 soybean genotypes were evaluated for resistance to SCN HG type 0 and GWAS allowed for the discovery of 8 new loci associated with SCN on this association panel [24].
Genomic selection has been frequently used to achieve faster genetic gain in plant breeding [25]. Genomic selection has often been proven to have superior features over the traditional MAS when dealing with complex traits [12]. In the earliest genomic selection study on resistance to SCN [12], genomic selection accuracy for the SCN resistance was in the range of 0.59 to 0.67.
The objectives of this study were (i) to conduct a genome-wide association study to identify QTL associated with leaf chlorophyll content in soybean in SCN infested and non-infested soils, and the QTL associated with reduction of chlorophyll content by SCN; (ii) identify SNP markers and candidate genes associated with the traits; (iii) to carry out a genomic selection study for tolerance of soybean chlorophyll content to SCN infection.

Results
Chlorophyll content phenotyping associated with SCN Soybean leaf chlorophyll content (CCI) in non-SCNinfestation recorded at 8 weeks after planting was significantly different among the genotypes (F-value = 11.17, p-value< 0.0001) ( Table 1). The CCI was approximately normally distributed (Fig. 1) Table S1).
The distribution of CCI of soybean in the SCNinfested soil was nearly normal (Fig. 1). Significant differences in CCI in the SCN-infected plants were found among the genotypes (F-value = 9.43, p-value< 0.0001) ( Table 1) Table S1). The lowest CCI under SCN infestation was found for the genotypes PI257428 (19.  Table S1). Of the top 10 genotypes having the highest CCI under non-SCN infestation, 7 (MN1011CN, MN1106CN, M98240104, AGASSIZ, GRANDE, CHICO, and MN0502) had the highest CCI when grown in SCN-infested soils. Of the 10 genotypes grown in SCN free soils and having the lowest CCI, 4 (PI257428, MN1008SP, NORMAN, and PI437228) still showed the lowest CCI when grown in SCN-infested soils.
Tolerance to SCN based on CCI was assessed by computing the percentage reduction in CCI due to SCN infection. Percentage reduction in CCI by SCN was approximately normally distributed ( Fig. 1 Table S1) were the most affected by SCN, suggesting that the leaf chlorophyll content of these genotypes could be highly sensitive to SCN infection. Pearson's correlation coefficient between reduction in CCI and CCI without SCN was − 0.24. However, the correlation between reduction in CCI and CCI with SCN was − 0.85.

SNP profile
A total of 4089 high-quality SNPs were used for genome-wide association analysis. The average SNP number per chromosome was in the range of 144 to 269 SNPs, with an average of 204. Chromosome 11 with 144 SNPs had the lowest number of SNPs, whereas chromosome 18 with 269 SNPs had the highest number of SNPs ( Table 2). The average distance between two SNPs per chromosome varied from 119 kb to 352 kb, with an average of 251 kb. The shortest average distance between SNPs was found on chromosome 15, whereas the longest one was on chromosome 11 (Table 2). Average minor allele frequency (MAF) per chromosome ranged between 16.14 and 24.80%, with an average of 21.57% ( Table 2). Percentage of heterozygous SNPs per chromosome was in the range of 7.57 to 10.76%, and averaging 9.30% ( Table 2). Percentage of missing SNP per chromosome varied from 4.16 to 5.60%, with an average of 4.96% (Table 2).

Genome-wide association study (GWAS)
Genome-wide association study was conducted to identify SNPs associated with CCI under non-SCN infection, CCI in SCN-infected plants, and reduction in CCI by SCN. The number of significant SNPs varied among those aforementioned traits. A total of 22 SNPs were found to be significantly associated with CCI under non-infested condition. These SNPs were located on chromosomes 4, 5, 6, 7, 10, 11, 12, 13, 19, and 20 (Table 3). Of the 22 SNPs, five were found on chromosome 11 and 4 mapped on chromosome 6 ( Fig. 2a). The QQ-plot showed that the model used to assess the SNPs was robust (Fig. 2b) Table 3). Most of these high LOD value SNPs (LOD > 6) were located on chromosome 6 indicative of significant QTL associated with plant chlorophyll on this chromosome.
A total of 16 SNPs were found to be associated with reduction in CCI due to SCN. Those SNPs were located on chromosomes 2, 3,4,6,7,8,9,13,15,and 18 (Fig. 2e). Of the 16 SNPs, 4 were found on chromosome 8, suggesting significant QTL associated with tolerance to SCN in this region, based upon the reduction in CCI. The QQ-plot (Fig. 2f) (Table 3), which were found on chromosomes 13, 6, 7, 8, 4, and 6, respectively. Two of the most significant SNPs were located on chromosome 6, indicating probable QTL affecting SCN on this region.
An overlapping significant SNP, Gm19_48,074,289_A_ C, was found to be associated with both leaf chlorophyll content for non-SCN-infested and SCN-infested plants  (Table 3). Three overlapping significant SNPs, Gm06_50, 593,128_T_G, Gm13_39,378,998_G_A, and Gm15_43, 797,502_G_T, were also identified for leaf chlorophyll content of plants grown in soils with SCN and the reduction in CCI (Table 3), indicating these SNP markers may not be related to SCN tolerance. However, no overlapping SNPs were identified for the traits leaf chlorophyll content under non-SCN infestation and reduction in CCI due to SCN, suggesting that these SNP markers were associated with SCN tolerance.
A total of 13 candidate genes associated with leaf chlorophyll content for the SCN-infected plants were identified (Table 3). Of the 13 reported candidate genes, 10 had functional annotations and 2 encoded for proteins with unknown functions. These candidate genes were involved in biomolecule transporters such as importin, transcription factors such as sequencespecific DNA binding transcription factors, and plant    (Table 3). Results suggested a total of 15 candidate genes associated with chlorophyll content tolerance to SCN in soybean. Of the 15 candidate genes, 11 had functional annotations as reported in Table 3. Two genes, Glyma.13 g294200 and Glyma.15 g233100 encoding for a putative signaling peptide similar to TAX1 and a leucine-rich repeat-containing protein, respectively, were overlapping between CCI of SCNinfected plants and reduction in ICC. Most of the reported candidate genes encoded for molecule transporters within and between plant cells such as Glyma.06 g317100, Glyma.07 g191600, and Glyma.13 g294200. Candidate genes found within the most significant genomic regions containing the SNPs Gm13_39,378,998_G_A, Gm06_50,593,128_T_ G, Gm07_35,908,169_T_C, Gm08_11,501,419_A_C, and Gm06_16,315,206_A_G were Glyma.13 g294200, Glyma.06 g317100, Glyma.07 g191600, Glyma.08 g149800, and Glyma. 06 g187300, encoded for a putative signaling peptide similar to TAX1, protein transporter, secretory carrier membrane protein, Iron/ascorbate family oxidoreductases, and lipase (class 3) (Table 3).
Overall, selection efficiency and accuracy of the SNPs associated with reduction in CCI were lower than those of the SNPs associated with CCI for the non-SCNinfected plants and SCN-infected plants. For the reduction in CCI, selection accuracy was in the range of 44.07 and 68.48%, with an average of 54.56% ( Table 6). The SNP Gm15_43,797,502_G_T had the highest selection accuracy (68.18%), whereas the SNP Gm03_3,334,303_ C_A showed the lowest selection accuracy (44.07%). SNP selection efficiency varied from 29.55 to 45.45%, with an average of 35.75% ( Table 6). The SNP with the highest selection efficiency was Gm15_43,797,502_G_T (45.45%), whereas the one with the lowest selection efficiency was Gm03_3,334,303_C_A (29.55%). Favorable alleles corresponding to the most significant SNPs Gm13_39,378,998_G_A, Gm06_50,593,128_T_G, Gm07 _35,908,169_T_C, Gm08_11,501,419_A_C, and Gm04_5, 172,181_A_G were G, T, C, A, and A (Table 6).

Genomic selection (GS)
Genomic selection for CCI of non-SCN-infected plants, CCI of the SCN-infested plants, and reduction in CCI by SCN was conducted using 5 different statistical models. For the plants without SCN infection, average GS accuracy was 0.33, 0.23, 0.32, 0.38, and 0.28 for rrBLUP, gBLUP, Bayesian LASSO (BLR), Random Forest (RF), and Support Vector Machines (SVMs), respectively, when all 4089 SNPs were included in the models (Additional file 2: Table S2). Increase in GS accuracy was identified using GWAS-derived SNP markers for most of the statistical models expect rrBLUP. The highest increase was found when gBLUP was used (Fig. 3). When only significant SNPs were incorporated into the GS models, the Bayesian LASSO model provided the highest average GS accuracy (0.74), whereas the lowest one was recorded when rrBLUP was used (0.31), indicative of the GS accuracy being both SNP type and GS modelsensitive.
For CCI under SCN infestation, GS accuracy was 0.45, 0.41, 0.47, 0.51, and 0.44 for rrBLUP, gBLUP, BLR, RF, and SVMs (Additional file 2: Table S2), respectively, when all SNPs were used to estimate the genomic estimated breeding values (GEBVs). In contrast to the   results found for CCI under non-SCN infestation, GS increased by at least 39% when significant SNPs obtained from GWAS were used. Interestingly, the highest increase was found when rrBLUP was used (Fig. 4), and GS accuracy was the highest under the rrBLUP model (0.83), the second highest GS accuracy was provided by the Bayesian LASSO model (0.81), whereas SVMs showed the lowest GS accuracy (0.70). These results suggested that using significant SNPs obtained from GWAS could provide a better GS accuracy. GS accuracy for reduction in CCI by SCN was established. When all SNPs were used, the RF model exhibited the highest GS accuracy (0.41), whereas the lowest one was found under both rrBLUP and BLR (Additional file 2: Table S2). Significant increase in GS accuracy was found when only significant SNPs were considered (Fig. 5), which was similar to what was found for the two aforementioned traits. By only using GWAS-derived SNPs, GS accuracy was 0.79, 0.59, 0.77, 0.61, and 0.62 for rrBLUP, gBLUP, BLR, RF, and SVMs, respectively (Additional file 2: Table S2).

Discussion
SCN resistance has been evaluated based on female (cyst) counts as measurements of SCN reproduction in soybeaninfected plants. In this investigation, we evaluated tolerance of soybean to SCN based on leaf chlorophyll content. One pathway of SCN damage to soybean is reduction of chlorophyll content and induction of chlorotic symptoms [4]. However the molecular mechanisms involved in reduction of chlorophyll content and induction of chlorosis by SCN infection have not been studied. As far as we know, this investigation represents the first study of QTL associated with soybean chlorophyll content tolerance to SCN. Leaf chlorophyll content-based phenotyping strategy for SCN infection evaluation would allow for potential discovery of new loci associated with SCN tolerance, therefore making the genetic background broader for managing SCN, especially in the situation of the increasing SCN virulence. However, soybean tolerance to SCN should be based on yield response, and chlorophyll content can be one of factors contributing to soybean yield response [26]. Additional studies would definitely be required to establish a possible link between yield loss and reduction in chlorophyll under SCN infestation.
GWAS was performed in efforts to identify new loci conferring tolerance of soybean to SCN based on the assessment of reduction in leaf chlorophyll content, thus contributing to diversifying genes for SCN management. The use of GWAS to discover loci associated with SCN resistance has been shown to be promising in other studies [12,[22][23][24]. All previously reported GWAS investigations relied on mature female count to assess resistance to SCN and SNPs were associated with the female count. In this report, a total of 14 and 16 SNPs were found to be associated with CCI under SCN infestation and reduction in CCI by SCN infection, respectively. The significant SNP, Gm18_1,620,585_T_C, found on chromosome 18 and associated with CCI under SCN infestation was located at 88 Kbp upstream of the rhg1 locus. In addition, the SNP Gm18_1,427,298_G_T mapped on chromosome 18 and significantly associated with a reduction of CCI was located at 281 Kbp upstream from the rhg1 locus. These results indicate that this panel carries PI 88788 SCN-type resistance. Similar results were reported [12] stating that SNP markers located at even about 1 Mb from the rhg1 locus were still in high LD with that SCN-resistant locus. In addition, a SNP marker located at 23 Kbp from Gm18_1,620,585_ T_C and being tightly linked with the rhg1 locus was reported [23]. Therefore, the two aforementioned SNPs which were found at a distance less than 300 Kbp from the rhg 1 locus could be used in marker-assisted selection for SCN resistance. This finding suggested that assessing soybean tolerance to SCN based on chlorophyll reduction could provide useful result for selecting SCNtolerant genotypes. Most of the significant SNPs associated with both CCI under SCN and reduction of CCI by SCN were found within previously reported SCNresistant QTL and loci [12,23,24,[27][28][29][30][31]. In addition, the results suggested three new loci associated with chlorophyll content tolerance to SCN, of which, two were found on chromosome 3 and associated with the SNPs Gm03_3,334,303_C_A and Gm03_39,574,966_T_ C, and the third one was mapped on chromosome 6 and associated with the SNP Gm06_50,593,128_T_G. The discovery of these new loci would permit for diversifying SCN-tolerance genes for SCN management.
Selection efficiency and accuracy were computed for the most significant SNPs associated with CCI under non-SCN infestation, CCI for the SCN-infected plants, and reduction in CCI as reported in Tables 4, 5, and 6. The use of SNP selection and accuracy has been highlighted in other GWAS-related reports [32,33]. SNP selection accuracy and efficiency varied from medium to high in this study. This suggested that the significant SNPs identified from this investigation can be used for further marker-assisted selection for enhancing soybean resistance/tolerance to SCN.
Candidate genes found within a 10-kb region harboring significant SNPs have been established in Table 3. Candidate genes associated with CCI under non-SCN infestation encoded for proteins that were relevant to chlorophyll pathway. Functional annotation of the identified candidate genes consisted of chlorophyll A-B binding protein and 4-alpha-glucanotransferase found in photosynthetic leaves [34]. Proteins involved in plant development such as ROP interactive partner [35], forminrelated [36], homeobox transcription factor [37], and uridine kinase [38] were identified as well. In addition, genes encoding for proteins associated with plant nutrition such as asparagine synthetase [39] and sulfate transporter [40] were found. The results were indicative of the robustness of the significant SNPs reported in this study since they permitted the discovery of candidate genes relevant to chlorophyll pathway and plant nutrition uptake under non-SCN infestation.
The genomic region harboring the overlapping SNP (Gm19_48,074,289_A_C), which was associated with both CCI under non-SCN infestation and SCN- infestation, enclosed a gene encoding for an importin, which was responsible for biomolecule trafficking within plant cells [41]. This gene could impact the flow of nutrients during plant development even under plant stress such as SCN infection. In addition, an overlapping candidate gene encoding for a leucine-rich repeatcontaining protein (Table 3) was reported to enhance both leaf chlorophyll content under SCN infestation and tolerance to SCN in leaf chlorophyll content. Leucinerich repeat protein has been widely shown to be involved in plant resistance mechanism to pathogen attack [36]. Further investigations are required to validate the function of this gene and its involvement in SCN tolerance in soybean. In addition, a signaling peptide was found to be associated with chlorophyll content tolerance to SCN. Signaling peptides have been reported to be involved in plant development [42], suggesting that this gene could enhance plant survival under stress of SCN infestation. Protein transporters and membrane proteins were widely found to be involved in chlorophyll content tolerance to SCN in this study. These proteins enhance the flow of biomolecules and nutrients within and between cells [41], thus permitting plant survival under SCN infection. Moreover, proteins associated with plant hormone signaling such as ethylene-responsive element binding factor 13, BTB/POZ domain-containing protein, and F-box family protein were identified. These signaling proteins have been demonstrated to be directly involved in plant defense against pathogens [43].
The new locus found on chromosome 3 harbors a gene encoding cytochrome P450, which has been shown to contribute to both plant development and defense under pathogen attack [44]. Further analysis is needed to confirm the involvement of this gene in resistance/tolerance to SCN. A lipase (class 3) was also found on chromosome 6. Lipases have been demonstrated to assist with plant defense mechanism against pathogen [45]. In addition, a methyltransferase-like protein gene was identified in a genomic region belonging to chromosome 2, which was in the vicinity of a significant SNP (Gm03_3, 334,303_C_A) associated with chlorophyll content tolerance to SCN. This protein modulates gene expression [44]. Additional studies are required to understand the involvement of this methyltransferase gene in enhancing chlorophyll content tolerance to SCN in soybean plants.
Genomic selection has recently become more and more popular in modern and large-scale crop breeding programs. Previous studies highlighted that GS allowed for a more robust prediction of genotypic values compared to QTL [46]. In addition, GS has been proven to be more powerful compared to Marker-Assisted Selection (MAS) when dealing with traits controlled by genes having small effects [47]. However, little has been done with respect to investigating GS accuracy for SCN resistance/tolerance. In this study, average GS accuracy among different GS models was 0.31, 0.46, and 0.35 for CCI under non-SCN infestation, CCI in the SCNinfected plants, and reduction of CCI by SCN when all marker sets were used. When GWAS-derived SNPs was used, average GS accuracy was 0.55, 0.76, and 0.68 for the three aforementioned traits. A GS accuracy averaging 0.46 for SCN resistance based on SCN female count and using different GS models was previously reported [12]. The results from the GS involving CCI was in agreement with that of found by previous investigations [12] despite the fact that two different phenotypes, leaf chlorophyll content and female count, were used. In addition, GS accuracy involving linear models (rrBLUP, gBLUP, RF) performed almost similarly as those using more sophisticated approach (Bayesian LASSO) and machine learning strategy (Additional file 2: Table S2). Similar findings were reported [12].

Conclusions
A total of 172 soybean genotypes were evaluated for the effect of SCN HG Type 1.2.3.5.6.7 (race 4) on soybean leaf chlorophyll content in the greenhouse. The leaf chlorophyll content indices (CCI) were used as the phenotypic data and 4089 filtered and high-quality SNPs obtained from the Soy6K SNP Infinium Chips as the genotypic data for the GWAS and GS analysis. A total of 22, 14, and 16 SNP markers were associated with CCI of non-SCN-infected plants, SCN-infected plants, and reduction of CCI by SCN, respectively. The average GS accuracy ranged from 0.31 to 0.46 with all SNPs and varied from 0.55 to 0.76 when GWAS-derived SNP markers were used across five GS models. The SNP markers from this study could be used to improve the soybean leaf chlorophyll content tolerance to SCN infection through MAS and GS. Further study is needed to investigate the translation of a reduction in chlorophyll content to yield loss under SCN infestation.

Plant materials and phenotyping
A total of 172 soybean genotypes were used for this study (Additional file 1: Table S1). This panel of lines was part of the panel of 282 lines selected by Bao from the University of Minnesota soybean breeding program using PediTree for the previous study of association mapping (AM) and genomic selection (GS) of SCN resistance by Bao et al. (2015) [12]. Most of the lines were susceptible to SCN in terms of SCN female counts of HG Type 0 (race 3), but there were six resistant, six moderately resistant, and four moderately susceptible lines (Additional file 1: Table S1). The few resistant lines contained SCN resistance genes from PI 88788 [12]. The plant materials listed in the Additional file 1: Table S1 are preserved in the soybean breeding program in the University of Minnesota, and the lines with PI numbers are available in the United States Department of Agriculture Germplasm Resources Information Network (USDA GRIN).
The phenotyping experiment was conducted in the greenhouse at the University of Minnesota St. Paul campus. A soil without SCN infestation was collected from a soybean field. The soil was mixed and divided to four lots. For each replicate, a lot of soil was further thoroughly mixed, and then divided to 1-kg lots that were placed in 1-galon plastic bags. In each bag, 500 g sand were added for increase of drainage. The soil of each bag was used for one pot.
Soybean cyst nematode HG Type 1.2.3.5.6.7 (race 4) was cultured on susceptible soybean 'Sturdy'. Race 4 can reproduce well on the lines containing resistance genes from PI 88788. Eggs were collected from the soybean roots and soil. The eggs (80,000 eggs/bag ≈ 10,000 eggs/ 100 cm 3 soil) in 10 ml water were added into the soil in the plastic bag before planting. For the pots without SCN eggs, 10 ml water was added into the bags. The soil-sand with or without SCN eggs were mixed in each bag. About 83% of the soil from each bag was placed in a pot. Ten soybean seeds were placed on the surface each pot and the seeds were covered with the remaining soil. The pots were placed in the greenhouse benches. The two pots (SCN and no-SCN) of the same soybean line were placed together to minimize the environmental difference between the SCN and no-SCN treatments within a genotype. Each line was replicated four times. Due to the large number of lines, this experiment was conducted at four different times with approximately 60 lines per time in the same greenhouse. Although pots of each replicate were placed in a block, the experiment was considered randomized design because the lines were evaluated in four groups at four different times. Leaf chlorophyll content indices (CCI) were recorded using SPAD 502 DL Meter (Minolta) on the second trifoliate leaves of 8-week old soybean plants. A  Descriptive statistics were generated using the option 'Tabulate' of JMP Genomics®7 (SAS Institute Inc., Cary, NC, USA). Data were visualized through combined violin and boxplots using the packages 'ggplot2', 'labeling' and 'gridExtra' of R 3.3.0. Data were analyzed using PROC GLM of SAS®. 9.4. The statistical model for the analysis was the following.
Y ij ¼ μ þ Gi þ ε ij with i ¼ 1; 2; :::; 172 and j ¼ 1; 2; 3; 4 Y ij denoted the response of the i th genotype at the j th replication, G i represented the effect of the i th genotype (assumed to have fixed effect), and ε ij was the experimental error associated with the ij th observation.
Genotyping DNA was extracted from young leaves of each accession using DNeasy 96 Plant Kit (QIAGEN, Valencia, CA). The DNA samples were genotyped using an Illumina GoldenGate SNP assay. A total of 4252 SNPs obtained from the Soy6K SNP Infinium Chips (https://www.soybase.org/snps/download.php) were used in the genotyping. SNP data having more than 10% missing data, more than 20% heterozygous SNPs, and minor allele frequency less than 4% were removed from the analysis. After SNP filtering, a total of 4089 high-quality SNPs were used for further analysis.

Genome-wide association study (GWAS)
GWAS was performed using a Fixed and Random Model Circulating Probability Unification (FarmCPU) in R software as previously described [48]. FarmCPU was shown to have an enhanced statistical power when running for GWAS [49]. Both fixed (FEM) and random effects (REM) were included in the model and run iteratively until no new pseudo QTNs were established. The model was described as following [48].
where y i represented the phenotypic data obtained from the i th individual, M ij 's denoted the pseudo QTNs, b j 's were the effect of the j th pseudo QTN, S ik denoted the k th SNP corresponding to the i th individual, and e i was the random error for the i th observation such that e ĩ N(0, σ 2 e ).

(b)
REM : where y i was the phenotype corresponding to the i th individual, u i denoted the total genetic effect (random effect) for the i th individual with a variance-covariance matrix defined as 2Kσ 2 a , σ 2 a was an unkown genetic variance and K was the Kinship generated from the pseudo QTNs, and e i was the residual such that e iÑ (0, σ 2 e ). Estimate of the variance-covariance matrix was computed using a Singular Value Decomposition (SVD) of the pseudo QTNs based upon the FaST-LMM (Factored Spectrally Transformed Linear Mixed Model) algorithm.

Candidate gene(s) discovery
Significant SNPs (LOD > 2.0) [50] postulated from GWAS were used for candidate gene search. A 10-kb genomic region flanking each SNP was considered. Functional annotation of candidate genes was investigated in Soybase (www.soybase.org).

SNP selection accuracy and selection efficiency
SNP selection accuracy and selection efficiency were computed based on the formulas previously developed [33]. The top 57 performers and the 57 least performers for each trait were chosen.

Genomic selection (GS)
Genomic selection was conducted using all 4089 SNPs and the SNPs showing association (LOD > 2.0) [50] with the traits of interest, respectively. Genomic estimated breeding value (GEBV) was estimated using 5 different statistical models described as following.

Ridge regression best linear unbiased predictor (rrBLUP)
The rrBLUP model was y = WGβ + ε [25] where y is the vector phenotype, β was the marker effect with β~N(0, Iσ 2 β ), W was the incidence matrix relating the genotype to the phenotype, G was the genetic matrix, and ε was the random error. The solution for the rrBLUP equation was defined by β^=(Z T Z + Iλ) −1 Z T y with Z = WG. The ridge parameter was described as λ = σ 2 e /σ 2 β with σ 2 e being the residual variance and σ 2 β the marker effect variance. rrBLUP was performed using the 'rrBLUP' package of R [51].
Genomic best linear unbiased predictor (gBLUP) [52] The gBLUP model was y r = X r β + Z r μ r + ε r where the 'r' subscript referred to the genotypes involved in the reference panel, y r was the vector phenotype, β was the genetic effect being assumed to be fixed, X r was the incidence matrix relating β to y r , μ r denoted the polygene random additive effect with μ r~N (0, Kσ 2 a ) where K was the Kinship matrix and σ 2 a the additive genetic variance, ε r was the random error with ε r~N (0, Iσ 2 e ) where I was an identity matrix and σ 2 e was the residual variance.
The Kinship matrix was divided into reference and inference panel as described below.
where K rr was the variance-covariance matrix for the reference group, K ii represented the variance-covariance matrix for the inference group, and K ir = (K ri )' denoted the covariance matrix between individuals from the reference and inference groups, respectively. The predicted genetic effect in the inference panel was obtained using the following formula [53].
where u i denoted the polygene effect in the inference group, and K ir , K rr , and μ r were previously described. gBLUP was performed using GAPIT [54].
Bayesian least absolute shrinkage and selection operator (Bayesian LASSO) Bayesian LASSO was a modified version of LASSO regression. In Bayesian LASSO, posteriors related to the genetic and residual variances were Exponential and Multivariate Normal, respectively. The statistical model was described as following [55]. y = μ + Xg + ε where y was the vector phenotype, μ denoted the overall mean, X represented the SNP matrix, g was the vector of random effect due to SNPs, ε represented the vector of random residuals, the posterior distribution of g was defined by g|λ~∏ j (λ/2)exp.{− λ|g j |} with λ~Unif(0,1,000,000) being the λ prior, and the posterior distribution of ε|σ 2 e~M VN(0, Iσ 2 e ) with σ 2 e~I nvχ 2 (4) being the prior distribution for σ 2 e . Bayesian LASSO was done in R using the package 'BGLR' [56] with burn-ins and iterations of Markov-Chain Monte Carlo (MCMC) equal to 5000 and 1000, respectively [57].

Random Forest
Random forest regression was based upon on unpruned tree decision [58]. In random forest regression, a new split was obtained from a Bootstrap sample generated from the training set. Splitting at the tree node level was based upon randomly selected subsets of predictors. The prediction of a new observation x((F^r f B (x)) was the mean outcomes obtained from B trees defined by {T(x, ψ b )} 1 B . Therefore, the prediction function was described as following.
F^=(1/B)* Σ b T(x, ψ b ) with b = 1….B and ψ b denoted the b th Random Forest tree defined by the split variables, cutpoints at each node, and the terminal node. Random Forest regression was established in R using the package 'randomForest' [59]. A total of 500 trees and 4 branches were used [12].

Support vector machines
Support Vector Machines (SVMs) have been recently widely used in genomic selection-related studies. This is a kernel-based supervised machine learning approach with a regression equation described as following [60]. y = f(X|β) + ε with f(X|β) = Σ j β j K h (X, X j ) being the kernel generating function. In this study, a Gaussian kernel was used. SVMs regression was performed in R using the package 'kernlab' [61].

Cross-validation
A five-fold cross validation was performed for the genomic selection study [62]. The association panel was randomly divided into 5 disjoint groups. A total of 4 subsets were used as training set, and the remaining set was considered testing set. A total of 100 replications were conducted at each fold. Mean and standard errors corresponding to each fold were computed. Genomic selection accuracy was obtained by computing the Pearson's correlation coefficient between GEBV and the observed phenotype for the testing set as previously described by [62].