SNP Prediction and Validation
Multiple methods were initially tested for SNP prediction and filtering. We used the GMAP  and Maq  mapping software. The GMAP software directly compares predicted bases in short-read query and genomic reference sequences, while the Maq software additionally uses quality scores in the short-read sequences, and provides additional output information about the consensus quality of SNP calls. We made three validation runs, on a preliminary 4× scaffold assembly; and one "production" run on the preliminary 6.5× scaffold assembly. The validation runs used GMAP with multiple short sequence reads to predict SNPs, and Maq with either multiple or single short sequence reads to predict SNPs. The production run used predominantly multiple short sequence reads to predict SNPs, with single short sequence read predicted SNPs where necessary in areas of particular interest. Because only one of the programs uses quality scores, the parameter sets are not trivially comparable. However, both programs were tunable for these objectives. For each of the tested methods, we picked parameter sets with the general goals of predicting unique, polymorphic, well-supported SNPs, given constraints of available sequence and the need to generate markers in certain marker-poor regions.
The first SNP prediction method, for testing and validation involved the use of GMAP software  for alignment of unique PI 468916 short sequence reads to the Williams 82, preliminary 4× scaffold assembly, with SNPs predicted from the alignments. The preliminary 4× scaffold assembly was a "test run" produced by the sequencing consortium for an initial assessment of assembly characteristics of the genome. At the same time, we used this early assembly to initially test and validate using next generation sequencing and different software for SNP-prediction. The GMAP alignment method, with stringent match criteria (using only high-quality reads, unique mappings, multiple-reads SNP support; details in Materials and Methods), produced a total of 10,778 predicted SNPs. The validation set for the GMAP method consisted of 635 primer pairs designed to flank each candidate SNP. Of these 635 primer pairs, a total of 535 produced a sequence tagged site with high quality sequence surrounding the predicted SNP, and 456 of the 535 produced amplicons containing the predicted SNP, which constituted an 85% validation rate.
The second SNP prediction method employed was the Maq mapping and assembly software  to align unique PI 468916 short sequence reads to the preliminary 4× scaffold assembly and predict SNPs from that assembly. The Maq analysis method, when used with selected parameters (minimum consensus-base quality of 20, unique read placement; details in Materials and Methods), produced a total of 25,047 predicted SNPs, each predicted using one or more short sequence reads aligned to one position within the genome. The first validation set for the Maq procedure consisted of 48 primer pairs designed to flank SNPs predicted from two or more short sequence reads. Of these 48 primer pairs, a total of 40 produced a sequenced tagged site (STS) with high quality sequence surrounding the predicted SNP, and 37 of the 40 produced amplicons containing the predicted SNP, which translated into a 92.5% validation rate.
We also tested the validation rate of predicting SNPs with a single short read. For this test, the Maq analysis method above was used, but without the requirement for multiple read support of predicted SNPs. A total of 48 primer pairs were designed to flank SNPs predicted from only one short sequence read. Of the 48 primer pairs tested, 43 produced an STS with high quality sequence surrounding the predicted SNP, and 34 of those 43 produced amplicons that contained the predicted SNP, which resulted in a 79% validation rate.
To identify SNPs best suited for GoldenGate assay design and for anchoring and orienting the preliminary 6.5× scaffold assembly, we used a modification of the Maq method described in the validation tests above. In this "production" run, we required a consensus score of at least 27, and that the flanking 120 bases be at least 2/3 non-repetitive sequence (see Materials and Methods for implementation). Because of the high (79%) validation rate for SNPs called from single short sequence reads using Maq, these SNPs were also included in this dataset so that additional markers on more of the smaller unmapped scaffolds could be anchored to the genetic map. In total, 7,108 SNPs were predicted for use in anchoring and orienting additional scaffolds. Ultimately, 1,536 SNPs were selected from this pool of 7,108 SNPs to create the higher resolution map needed for validation of scaffold ends, for anchoring of additional scaffolds, and for mitigating the ambiguity of orientations of anchored scaffolds - all of which required a map with markers placed in the existing gaps that are present within the soybean Consensus Map 4.0. These 1,536 SNPs were used to create an Illumina GoldenGate soy bean o ligo p ool a ll (SoyOPA-4) [Additional file 1]. The SoyOPA-4 produced 1,254 successful GoldenGate assays indicating that the predicted SNPs had a validation and assay conversion rate of 81.6%.
High Resolution Genetic Map
SoyOPA-4 was used to genotype 470 F5-derived RILs from the Williams 82 × PI 468916 (W82 × 468) population with the 1,254 polymorphic SNPs. A total of 26 of the 470 RILs were excluded from further analysis due to marker heterozygosity levels > 20%, which suggested that those 26 RILs trace to outcrosses occurring during generation advance, rather than being true F5-derived lines. To tie the newly constructed high resolution genetic map to the existing Soybean Consensus Map 4.0, SoyOPA-3 was used to genotype a subset of 282 RILs from the W82 × 468 population. From these 282 genotyped RILs, the genotype data of 14 RILs were subsequently eliminated which were in common with the 26 RILs that had already been eliminated due to high heterozygosity levels. SoyOPA-3 was one of three previously designed custom soybean OPAs developed and tested by Hyten et al. [2, 7]. SoyOPA-3 contains 1,396 SNPs that had been mapped on the Soybean Consensus Map 4.0 and had been developed so that all SNPs included in the OPA were polymorphic within at least one of three RIL mapping populations used in the creation of the Soybean Consensus Map 4.0 . Of the 1,396 SNPs on SoyOPA-3 that mapped to the Soybean Consensus Map 4.0, a total of 565 were polymorphic between Williams 82 and PI 468916.
A total of 550 of the 565 SNPs from SoyOPA-3 and 1,240 of the 1,254 SNPs from SoyOPA-4 were mapped using 444 RILs to create the 20 linkage groups that correspond to the 20 chromosomes of soybean and had an estimated total genetic length of 2,537 cM. The remaining 29 SNPs were not linked to any of the 20 linkage groups. The average level of heterozygosity observed in the population was 6.3% which is the expected level of heterozygosity for a RIL population in the F5 generation. Segregation distortion was observed for multiple tightly linked markers in 16 regions throughout the genome [Additional file 2].
To determine if the mapping of SoyOPA-3 on a subset of the 444 RILs caused any significant expansions or contractions of the genetic map, a separate map with only the SNPs from SoyOPA-4 was created [Additional file 2]. Comparing the map created using only 1,240 SoyOPA-4 SNPs to the map created using all 1,790 SNPs from SoyOPA-3 and SoyOPA-4 revealed only one substantive change of 22 cM on chromosome 5. This change was due to the elimination of the SoyOPA-3 SNPs leaving a gap of 66 cM between adjacent SoyOPA-4 SNPs. There were a total of 16 discrepancies between the two maps that were 2 to 10 cM with all other discrepancies between the two maps being less than 2 cM in genetic distances between the SoyOPA-4 SNP markers.