Comparison of different approaches for detecting population structure
Knowledge about the patterns of population structure is essential for efficient germplasm organization. Therefore, various approaches have been developed for this purpose. The method implemented in the software STRUCTURE is one of the most frequently used approaches. However, when dealing with thousands of individuals and markers, the high computational requirements of STRUCTURE analyses make it impractical [36]. Instead, PCA, PCoA, as well as LAP have the potential to extract the fundamental structure of a dataset without assuming any population genetic model [18, 19]. Furthermore, as these methods are not computationally intensive, they might be possible alternatives for detecting population structure.
These approaches, however, do not allow to make directly statistical inferences about the number of subgroups. Furthermore, the assignment of inbreds to subgroups is not defined. MCLUST, however, could determine the numbers of subgroup as well as the cluster membership probability simultaneously without genetic assumptions [21]. Nevertheless, MCLUST applied directly to the raw marker data had in our study only a low power to identify population structure (data not shown). This might be due to the fact that many markers explain a small part of the population structure information. To overcome this problem, MCLUST was applied in our study on principal components (PC), principal coordinates (PCo), or lapvectors.
The number of subgroups (from 1 to 15) were examined by MCLUST based on 1-150 PC, PCo, and lapvectors. Our results suggested that the number of subgroups varied between one and nine (Additional file 3). The number of subgroups showed a high variability if less than 20 PC, PCo, or lapvector were used which explained together less than 75% of the variance. However, when the number of PC was higher than 50, the number of subgroups started to vary again (Additional file 4). The explanation for this observation is unclear and requires further research. These findings suggested that determining the number of subgroups using MCLUST applied to PC, PCo, or lapvector is not straight forward and requires careful consideration of the numbers of dimensions used for the analyses.
When the number of subgroups was set to two according to the results of PCA, PCoA, and LAP, we observed for 10-40 PC, 10-50 PCo, and 1-100 lapvectors > 95% correspondence with the germplasm type information (Additional file 4) and > 90% correspondence with the assignment by STRUCTURE (data not shown). The above mentioned methods also had with > 85% a high correspondence of assignment with each other (data not shown). These findings suggested that these methods might be time-saving alternatives to STRUCTURE analyses, if the assignment of genotypes to subgroups is of interest and the numbers of subgroups is known.
Population structure of the elite sugar beet germplasm
Results of earlier studies revealed that cultivated sugar beet genotypes are genetically distinct from wild beet genotypes [37, 9]. Moreover, the results of [6] indicated that the seed and pollen parent heterotic pools of cultivated sugar beet showed two distinct clusters after 40 years of recurrent reciprocal selection. Therefore, in our study, the population structure of one of these two heterotic pools, namely the pollen parent heterotic pool was examined in further detail.
The results of the STRUCTURE analysis revealed the presence of two subgroups in the entire pollen parent germplasm set (Additional file 1). This observation was in accordance with the clustering observed in the PCA, PCoA and LAP analyses as well as with the MCLUST analysis and with the number of examined germplasm types (Figure 2, Additional file 2). Furthermore, 99.6% of the inbreds in the subgroup 1 based on the MCLUST analysis with 10 PCs were sugar types and 98.5% of the inbreds in the subgroup 2 yield types. The observed pattern of population structure might be explained by the fact that due to a negative correlation between root yield and sugar content [7], the selection on both traits in an originally undifferentiated population could lead to differentiated populations. The observation of distinct subgroups was further made possible by the occurrence of only few recombination events between the two germplasm types [8]. Nevertheless, we observed a higher average MRD for all the inbreds than for that between two germplasm types. This observation indicated that higher variation existed within the populations than between the populations.
Our explanation is in accordance with the observation that the IIlinois long term selection experiment for grain protein (high vs. low protein) and oil concentration (high vs. low oil) in maize had lead to phenotypically but also genotypically divergent populations [38]. Due to the fact that germplasm type information was in very good agreement with molecular marker information, sugar type and yield type inbreds were the basis for all further analyses.
Comparison of different numbers of SNPs for detecting population structure
As the SNP number and selection strategy is expected to affect the estimates of population structure (c.f. [14]), we examined these aspects in our study. The correspondence of assignment by MCLUST based on subsets of 9-252 SNPs vs. the whole SNP set improved with an increasing number of SNPs (Figure 3). Similarly, the CV of MRD estimates among all pairs of inbreds decreased with increasing number of SNPs (Figure 4). This is due to the fact that a high number of SNPs provides a high precision for determining population structure as well as for measuring the genetic distance between inbreds. When the SNP numbers selected at random or in a stratified fashion reached about 100, the before mentioned trends of the correspondence as well as the CV reached a plateau and not much further improvement could be obtained by further increasing the number of SNPs. As the costs for genotyping will also increase with an increasing number of SNPs, our results indicated that in the examined sugar beet germplasm about 100 SNPs would be required to determine the same population structure as the whole SNPs set did and that this estimation would be done with a similar precision.
We observed a slightly higher correspondence (Figure 3) as well as lower CV of MRD (Figure 4) for the stratified than for the random resampling strategy. This observation suggested that by choosing markers that are equally distributed across the genome, it is possible to reduce their number compared to randomly distributed markers while achieving the same level of precision in assigning inbreds to subgroups as well as estimating MRD. An even higher correspondence can be obtained with the same number of markers if they were selected with respect to their PIC values (Figure 3). This observation suggested that with SNPs selected for a high PIC value, the number of SNP markers required to determine the same population structure could be further reduced.
The number of SNPs predicted in our study to be required for MRD estimates is considerably lower than that calculated for maize [12]. This observation might be explained by differences in the number of genotypes studied. [12] examined three times more genotypes than we did, which increases the number of markers required to unambiguously identifying each genotype. Furthermore, [12] examined 25 times more SNPs than we did, which also increases the number of markers required to achieve a similar precision as the whole SNPs set did.
Genome-wide distribution of genetic diversity
Elite sugar beet germplasm has been intensively selected since the mid of the last century [8]. Consequently, the genomic regions controlling traits of economic importance are expected to be shaped by this selection. Therefore, characterizing the genome-wide distribution of genetic diversity of elite sugar beet germplasm which has been selected for different traits, such as sugar content vs. root yield might help to identify the genes controlling these traits. A similar approach has been successfully applied to identify a panel of known genes as well as some interesting candidate genes and QTLs in Holstein cattle [22].
We observed an average gene diversity of 0.338 for the entire germplasm set. This finding is in good accordance with results of [37] where a gene diversity of 0.31 was observed in USDA sugar beet gene bank materials assessed with RAPD markers. In contrast, the gene diversity observed in our study was lower than the values reported earlier ([26, 9, 6]), where an average gene diversity of 0.51-0.62 was observed in weed beet and sugar beet populations using SSR markers. This difference might be explained by the examined marker types. SNP and RAPD markers are typically bi-allelic, whereas SSR markers are multi-allelic, which has the potential to increase gene diversity (c.f. [12]).
The average gene diversity of the sugar type inbreds was higher than that of the yield type inbreds (Additional file 5). This observation might be explained by ascertainment bias during SNP development or a higher selection intensity applied during breeding of yield type sugar beets compared to sugar type inbreds. Our explanation was supported by the fact that the effective population size Ne of the yield type inbreds was considerably lower than that of the sugar type inbreds (Table 2), which indicated stronger bottleneck effects for the yield types than for the sugar type inbreds. However, it should be noted that the calculation of Ne assumes idealized populations [34], and that where these idealizations are violated such as selected populations or selected SNPs, the calculated Ne will deviate from the true value. Another reason for our finding of a higher gene diversity of the sugar type inbreds compared to the yield type inbreds might be that it is more difficult to introduce new germplasm from exotic sources into the yield types than into the sugar types.
The unequal distribution of genetic diversity across the genome could be explained by the ascertainment bias during SNP development. However, more likely, this observation is due to the selection history of the different genome regions. Therewith, the genome-wide distribution maps of genetic diversity (Additional file 5 and 6) might be a first step to identify the target genes or regions selected during breeding history. For example, genes related to sugar content and root yield might be present in the most divergent genomic regions between these two germplasm types. Common genes under selection in the breeding program of the both germplasm types (e.g. disease resistant genes) might be present in the genomic regions showing the same level of gene diversity and low MRD (Additional file 5 and 6).
Genome-wide distribution of LD and consequences for association mapping
The power and resolution of association mapping depend greatly on the genome-wide distribution of LD assessed with a high number of markers [39]. We observed that a total of 18.97%, 31.84%, and 32.01% of the linked loci pairs in the entire germplasm set, yield and sugar type inbreds, respectively, showed r2 values higher than the significance threshold (Table 1). The percentages observed in our study were lower than that reported earlier [6]. In contrast, the values of our study were higher than that of earlier studies [26, 27, 9], where 1.1%-14.3% of the loci paris were observed to be in significant LD. These differences might be explained by the facts that (i) different significance thresholds were used, (ii) a rather high marker density was applied in our study compared to earlier studies, (iii) different marker types were used in these studies, i.e. SNPs in our study vs. SSRs or RAPDs in other studies, and (iv) different plant materials was examined, i.e homozygous elite inbreds of sugar beet in our study and [6] vs. random mating wild beets in other studies.
As r2 between SNPs decayed with genetic map distance, we suggest that linkage between SNPs is an important factor influencing the patterns of LD in the studied germplasm. The r2 reached the threshold of significant LD within 7.4 cM, 45.1 cM, and 20.6 cM for the entire germplasm set, yield type and sugar type inbreds, respectively. In addition, at binned genetic map distances reached a plateau at 15-20 cM for the entire gemplasm set and the two germplasm types. The decay distance we observed was longer than that reported by [6], where r2 declined to 0.1 at 10 cM, and that of [25] where only marker pairs < 3 cM showed a high extent of LD. The difference might be due to (i) the rather high density of markers examined in our study compared with earlier studies and (ii) different regression methods used to measure the decay of LD. The observation of slower LD decay for yield type inbreds than for sugar type inbreds, which might be due to the different selection history as outlined above, resulted in smaller effective population sizes Ne calculated for the yield type inbreds than the sugar type inbreds (Table 2). The results indicated that different numbers of markers are required for genome-wide association mapping in the different types of germplasm.
The high proportion of SNP loci pairs in significant LD as well as the decay of LD with distance suggested that association mapping is a tool applicable in the context of sugar beet breeding. However, both in the entire germplasm set and the two groups of the germplasm types we observed only for very few (0.74-6.22%) linked SNP paris r2 values > 0.8 (Table 1). Such high r2 values are required in order to allow the detection of marker-phenotype associations explaining less than 1% of the phenotypic variance [32]. This in turn indicates that for genome-wide association mapping in sugar beet, the number of markers has to be dramatically increased compared to the number applied in our study.
We observed different LD levels along the linkage groups of sugar beet (Additional file 8). This observation suggests that estimating the number of markers required for genome-wide association mapping from the genome-wide average of LD is dubious. In this case, important QTL might be not detected as locally occuring low levels of LD decrease the power to detect them. Therefore, the genome-wide distribution of LD has to be considered when designing SNP genotyping arrays in the context of genome-wide association mapping. Furthermore, the LD patterns found in the pollen parent heterotic pool might not be the right information source for designing SNP genotyping arrays for other germplasm.