Large polyploid genomes such as B. napus and wheat present a challenge for SNP discovery because of the presence of multiple homoeologous sequences [25, 31]. Allelic variants need to be distinguished from non-allelic (paralog) variants (nucleotide polymorphisms between paralogs/homoeologs or between the A and C genomes) which present as false SNPs. In addition, the repetitive nature of the polyploid genomes has been one of the major obstacles to SNP discovery. In this study, three conditions were utilized to identify putative “simple SNPs”.
Firstly, low-copy DNA regions were identified by uniquely-aligned reads that were excluded from repetitive DNA regions and mapped to only one place in the B. rapa and B. oleracea reference sequences. Sequenced reads were classified into three categories: ‘uniquely aligned’, ‘repeatedly aligned’ and ‘unaligned’. Here, the ‘repeatedly aligned’ category represents duplicated loci across the allopolyploid B. napus genome. The ‘unaligned’ category may be partially derived from novel sequences induced by such events as genome rearrangements or transposon activity. Hence, only the uniquely aligned single-hit reads were selected from the aligned results for further analysis. Secondly, only homozygous loci were selected for subsequent analysis in each individual. Heterozygous loci could be unambiguously attributed to polymorphism between homoeologous chromosomes rather than to allelic heterozygosity. Thirdly, only reads with depth ≥ 4 were used for SNP discovery, in order to exclude SNPs generated by sequencing error. Generally speaking, minimum recommended read depth is ≥ 3 per genotype .
A total of 892,803 SNP polymorphisms were identified among the ten accessions of B. napus, using a stringent filtering approach favouring high quality SNPs over exhaustive SNP sampling to provide a resource of immediate value for crop improvement. Therefore, the actual frequency of SNP polymorphisms between these accessions is likely to have been underestimated, due to the stringent filtering methods used and due to exclusion of duplicated DNA.
In the present study, approximately 55% of SNPs were distributed on the A genome, and 45% of SNPs were distributed on the C genome. However, Bancroft et al.  identified 15559 SNPs on the A genome and 5675 SNPs on the C genome : the bias towards A-genome SNPs was far more significant than in the present study. The genetic distance between the Ningyou7 and Tapidor C genomes is likely narrow, although these two genotypes were selected on the basis of their genetic dissimilarity, contrasting trait characteristics and different cultivation ranges . However, the results of the present study agree that the A genome appears more variable than the C genome. Uneven distribution of SNPs throughout the genome is common, and has also been observed in Brassica relative Arabidopsis thaliana.
A total of 36,458 SNPs predicted to cause non-synonymous amino acid substitutions were identified in this study. These SNPs may represent causal genetic variation contributing to phenotype variation. Using this SNP set to perform genome-wide association in B. napus would be more efficient than using a general SNP set to identify causal gene mutations. GO analysis in the present study suggested that the genes predicted to contain non-synonymous SNPs were more commonly associated with binding and catalytic activity than with other functionality. This may suggest that proteins with the function of binding and catalytic activity may play a significant role in adaptive evolution.
A 96-SNP GoldenGate assay can be used successfully for SNP genotyping in B. napus, despite the high number of paralogous sequences in this polyploid species. Figure 3 shows an example of a putative simple SNP (SNP RP13) in the two mapping populations. If the SNP was a hemi-SNP with one homozygous locus, the genotyped samples of mixed populations would cluster into four or more groups  (Figure 4). In the present study, when the results from the two populations were pooled for SNP chip analysis, only 3/96 (3%) of SNPs showed four genotype clusters, suggesting these were hemi-SNPs with genotype-specific heterozygosity in the additional amplified region. However, validating these SNPs over a wider range of accessions would be valuable in determining what proportion of SNPs are simple SNPs, and what proportion are hemi-SNPs.
Although 892,803 SNPs have been developed, there are still some limitations to this work. Ten accessions was a productive number for the SNP discovery. However, increasing the number of resequenced accessions will enhance efficient, polymorphic SNP discovery. As well, eight of the accessions used were semi-winter-type B. napus, and therefore the effect of these SNPs in spring-type and winter-type B. napus needs to be further validated. Ten SNPs were randomly selected for sequencing and validation in ten lines: 97% of the sequenced SNP loci matched the prediction. The high validation ratio may have resulted from the stringent filtering conditions. Ninety-six SNPs were tested on a genotyping platform, and most polymorphic SNPs showed segregation distortion. The segregation distortion may have resulted from selection bias for particular alleles during the process of population construction (e.g. microspore culture to produce the DH population). It is also possible that for some of these SNPs genotyping using the GoldenGate assay resulted in theincorrect grouping of multiple genotype clusters together (e.g. AAAB and AABB), which would result in distorted segregation ratios. However, multi-locus genotypes are usually clearly identifiable in Genome Studio by the presence of additional separate clusters, so it is more likely that the segregation distortion observed was due to selective pressure for one or the other parental allele under population growth conditions. Future work could include validation of the genomic location of these SNPs by designing and using arrays in large mapping populations originating from diverse B. napus parent genotypes.