Genetic diversity analysis of a flax (Linum usitatissimum L.) global collection

Background A sustainable breeding program requires a minimum level of germplasm diversity to provide varied options for the selection of new breeding lines. To maximize genetic gain of the North Dakota State University (NDSU) flax breeding program, we aimed to increase the genetic diversity of its parental stocks by incorporating diverse genotypes. For this purpose, we analyzed the genetic diversity, linkage disequilibrium, and population sub-structure of 350 globally-distributed flax genotypes with 6200 SNP markers. Results All the genotypes tested clustered into seven sub-populations (P1 to P7) based on the admixture model and the output of neighbor-joining (NJ) tree analysis and principal coordinate analysis were in line with that of structure analysis. The largest sub-population separation arose from a cluster of NDSU/American genotypes with Turkish and Asian genotypes. All sub-populations showed moderate genetic diversity (average H = 0.22 and I = 0.34). The pairwise Fst comparison revealed a great degree of divergence (Fst > 0.25) between most of the combinations. A whole collection mantel test showed significant positive correlation (r = 0.30 and p < 0.01) between genetic and geographic distances, whereas it was non-significant for all sub-populations except P4 and P5 (r = 0.251, 0.349 respectively and p < 0.05). In the entire collection, the mean linkage disequilibrium was 0.03 and it decayed to its half maximum within < 21 kb distance. Conclusions To maximize genetic gain, hybridization between NDSU stock (P5) and Asian individuals (P6) are potentially the best option as genetic differentiation between them is highest (Fst > 0.50). In contrast, low genetic differentiation between P5 and P2 may enhance the accumulation of favorable alleles for oil and fiber upon crossing to develop dual purpose varieties. As each sub-population consists of many genotypes, a Neighbor-Joining tree and kinship matrix assist to identify distantly related genotypes. These results also inform genotyping decisions for future association mapping studies to ensure the identification of a sufficient number of molecular markers to tag all linkage blocks.


Background
Flax (Linum usitatissimum L.) is an ancient crop, grown throughout the world to be sold at market. Domestication events have separated flax into two broad types: seed flax and fiber flax [1]. Seed flax is mainly grown for oil which is rich in omega-3 fatty acid. Preference of flaxseed in human diet is increasing rapidly due to its high dietary fiber, omega-3 oils, and anti-carcinogenic lignans [2]. Flaxseed oil is also used in paints and varnishes for its unique drying properties [3]. On the other hand, fiber flax is grown to harvest fiber for making linen cloth. In recent years, high value product development has been the prime target of fiber industry from flax stem [4].
Diversity is an important characteristic of a sustainable breeding program. More diversity of germplasm provides the breeder better options for selecting parents to develop need-based cultivars. Diversity is also important for association mapping as the broadest diversity is correlated with a rapid LD decay [5]. Diversity in genetic materials occurs due to variation in phenotypic appearance and genotypic background. Initially, the diversity of flax germplasm was assessed based on morphological parameters [6,7], and biochemical marker such as isozymes [8,9]. However, morphometric diversity often leads to false prediction as morphological characteristics are plant developmental stage dependent and environment sensitive [10]. Morphological characterization is also labor intensive and time consuming. In addition, isozyme markers are affected by plant developmental stage [11,12] and are available for only a limited number of loci [13,14]. The limitations of morphological and biochemical markers has led to the development of DNA based markers which are environment independent and do not require previous pedigree information [15]. Molecular marker-based diversity is more precise and economic as it allows breeders to select unrelated individuals among thousands of genotypes within a short period of time which in turn reduces field workload by evaluating only unrelated genotypes. Different molecular marker techniques such as RAPD, AFLP, ISSR, SSR and IRAP has been used to assess the genetic diversity of flax germplasm [16][17][18][19][20][21][22]. The availability of a flax reference genome [23] has created the opportunity of discovery and exploitation of SNP markers, which are abundant and well distributed throughout the genome.
North Dakota State University runs a moderate size flax breeding program to develop improved cultivars with conventional breeding methods. The program is being hampered by the narrow genetic base of parental stocks, as the same sets of parents have been crossed repeatedly in different combinations. To enrich the parental stock, the program now desires to incorporate diverse germplasm to existing parental stock. To speed up the selection procedure and increase the genetic gain per year, the program also desires to apply markerassisted and genomic selection techniques by exploring marker-trait association through genome-wide association mapping. To identify marker trait association, quantitative trait loci (QTL) and association mapping (AM) approaches are commonly used. QTL mapping is done by tracking the cosegregation of QTL and marker loci in biparental mapping populations and reveals low resolution regions due to the relatively low recombination rates of a single cross. AM reveals marker trait association by utilizing linkage disequilibrium (LD) of germplasm collections [24]. Although AM yields high resolution loci by exploiting historical recombinant events, it is affected by population structure which results in false positive association. Thus null or weak population structure and a low level of relatedness among individuals of the germplasm collection is desirable which leads to rapid LD decay and increases the power of marker detection [5].
In this study, we genotyped 350 flax germplasm accessions using 6200 informative SNP markers. The objectives were (1) to explore genetic diversity and differentiation among the genotypes, (2) to investigate the potential of the collection as parental resource and (3) to assess the suitability of the collection for marker-assisted breeding.

SNP profile
The selected 6200 SNPs were distributed across 15 Figure S3). The occurrence of transition SNPs (3532 SNPs) was more than that of transversions (2668 SNPs) with a ratio of 1.32. The frequency of C/T transitions was highest (28.61%) and C/G transversions were lowest (9.56%). Both A/G and C/T transitions occurred in similar frequencies (i.e. A/G 28.35% and C/T 28.61%), whereas the frequencies of four transversions were: A/C 11.61%, A/T 10.40%, C/G 9.56%, G/T 11.45% ( Table 2). The inbreeding coefficient within individuals (F is ), fixation index (F) and observed heterozygosity (Ho) of all the markers were 1, 1 and 0 respectively as all were homozygous. The Shannon's information index (I) of all markers ranged from 0.03 to 0.70 with a mean value of 0.34.
The expected heterozygosity (He) ranged from 0.08 to 0.53 with a mean value of 0.30. The polymorphic information content (PIC) ranged from 0.07 to 0.47 with a mean value of 0.24 (Table S2). Sub-population wise marker diversity parameters are presented in supplementary Table S3.

Population structure
The whole collection was divided into seven subpopulations based on structure analysis using the Delta K approach (Fig. 1a). The NDSU released and other American genotypes were grouped under sub-population-5 (P5) whereas European (Hungary), Turkish and Asian (India & Pakistan) genotypes were under sub-population-1 (P1), sub-population-7 (P7) and sub-population-6 (P6), respectively. Sub-population-2 (P2), sub-population-3 (P3) and sub-population-4 (P4) were composed of a mixture of genotypes of different origins (Fig. 1b). All of the subpopulations consist of oil type genotypes except subpopulation-2, which consists of mostly fiber type genotypes. Among oil types, spring type seed flax belong to P5, winter types belong to P1 and P7, short large seed Indian seed flax belong to P6, Mediterranean or Argentine seed flax belong to P3 and Ethiopian forage type seed flax belong to P4. Based on individual Q matrix, the proportion of pure (nonhybrid) and admixed (containing markers assigned to more than one sub-population) genotypes in each sub-population was calculated.
The proportion of pure accessions in each subpopulation ranged from 18 to 81% at a 0.7 cutoff value. The P5 and P6 contained highest percentage (81%) of pure accessions, whereas P4 contained the lowest percentage (18%) ( Table 3). We also performed principal coordinate analysis (PCoA) to show the genetic similarity among sub-populations. The first two axes explained 18.49% of the total observed variation (Table S4). The PCoA revealed that NDSU released and other American genotypes (P5), Turkish (P7) and Asian (P6) genotypes were well clustered and separated from rest of the genotypes (Fig. 2). In addition to that, we also constructed phylogenetic tree based on neighbor joining (NJ) criteria (Fig. 3). The output of neighbor-joining (NJ) tree analysis was in line with that of structure analysis and PCoA.

Population genetic differentiation
The AMOVA revealed that variance among subpopulations covered 28% of total variation whereas the remaining 72% of total variation accounted for variance among individuals within sub-populations (Table 6) with a F st and Nm value of 0.28 and 0.64, respectively. All pairwise F st comparisons between sub-populations were significant (p < 0.01).
Most of the combinations showed a great degree of divergence (F st > 0.25) [25] except few combinations such  (Table 7). At the loci level, the genetic differentiation, F st ranged from 0.01 to 0.95 with a mean of 0.29 (Table S5). We also performed kinship (IBS) analysis to facilitate the individual genotype selection for desirable cross combinations ( Figure S4). The IBS coefficients ranged from 1.12 to 2. The average coancestry between any two flax genotypes was 1.41. Approximately 80% of the pairwise IBS coefficients ranged from 1.12 to 1.50 (Table S8, Figure S5). Mantel test was performed to show the correlation between geographic and genetic distance among individuals within each sub-population ( Table 8).
Individuals of P4 and P5 showed significant positive correlation between geographic and genetic distance (r = 0.251, 0.349, respectively, and p < 0.05) whereas it was not significant in other sub-populations ( Figure S1). In the entire collection, significant positive correlation (r = 0.30 and p < 0.01) was revealed by mantel test.

Linkage disequilibrium pattern
The linkage disequilibrium (LD) pattern was investigated across the entire collection, each sub-population and chromosome-wise. LD = r 2 values decreased with the increase of distances. In all cases, mean LD was high (r 2 > 0.80) at short distance bin (0-1 kb) and declined with increasing bin distance (Table S6). In the entire collection, the mean linked LD, mean unlinked LD and loci pair under linked LD was 0.41, 0.02 and 2.46%, respectively. The mean linked LD was highest in P6 (r 2 = 0.50), and was lowest in P4 (r 2 = 0.39). In P6, highest proportion (28.22%) of total loci pair was linked, whereas it was very low (1.08%) in P3 (Table 9). We also calculated the LD decay rate. In the whole collection, LD decayed to its half maximum within < 21 kb distance. Each chromosome showed differential rate of LD decay.

Discussion
We used a total of 6200 homozygous SNP markers for diversity analysis of 350 genotypes. The used SNPs were well distributed throughout the genome. The transition SNPs were more frequent than transversion SNPs, indicating that transition mutations are more tolerable to natural selection [26]. A similar result was also found in  other species such as Camelina sativa [27], Camellia sinensis [28], Hevea brasiliensis [29] and Brassica napus [30,31], that may be due to synonymous mutations in protein-coding regions [32]. We also calculated PIC and expected heterozygosity (He) for each marker. The PIC determines the usefulness of any marker for linkage analysis whereas He determines the diversity of haploid markers [33]. We found all markers moderately or low informative as PIC value for all markers was less than 0.5 [34]. Other researchers also found similar results in flax [35], winter wheat [36,37], rice [38] and maize [39].
Bi-allelic nature of SNP marker and probably low mutation rate [40] restrict the PIC value within 0.5. The He value for all markers was always greater than PIC value as PIC value become closer to He with more alleles and with increasing evenness of allele frequencies [33]. Selection of diversified materials is crucial for widening the genetic base of a breeding germplasm collection. In our study, based on the identified SNP markers, the different sub-populations exhibited moderate diversity (average H = 0.22), which is in line with our expectation as flax possesses an autogamous reproduction system. A similar level of diversity was found in one study [41], though other studies revealed both low [42,43] and high [44,45] level of diversity of different group of flax germplasm. The variation in results may be due to the utilization of different markers and different genotype sets by the researchers. The great homogeneity of the diversity indices of different subpopulations in the studied collection suggests that the species is durable enough to avoid the natural loss of genetic variability by drift [46]. We also calculated the Tajima's D value to indicate the abundance or scarcity of rare alleles in different sub-population and selection mechanism behind sub-populations [47]. Sub-population P6 displayed a negative Tajima's D value indicating presence of more rare alleles in this group or recent expansion of the group as most of the individuals of this sub-population are cultivars grown in India and Pakistan. The other six (P1 -P5 and P7) sub-populations showed positive Tajima's D values indicating less rare alleles in those groups or recent population contraction. Previously, negative Tajima's D values were found in flax landraces [1,48] and fiber type flax [1] but it was positive for oil, winter and dehiscent type flax. All seven sub-populations showed significant level of relatedness (r). The negative correlation between diversity indices (H and I) and relatedness indicates that inbreeding and genetic drift play a significant role in reducing genetic variability in the studied population which results in increased differentiation among sub-populations. Similar phenomenon was also found in Arapaima gigas species [49].
The success of any breeding program usually depends on the right choice of parental groups at the inception. The NDSU flax breeding program is comparatively old. The program already has developed some high yielding and high oil content varieties as well as considerable amount of advanced breeding lines. To enrich the parental stock of the on-going program, the genetic diversity of 350 flax germplasms comprising NDSU released varieties and advanced breeding lines were analyzed in this study. We partitioned the whole collection to its maximum i.e. seven sub-populations based on structure, PCoA and NJ-tree analysis though cluster number was less [43,50,51] and more [52] than ours finding in previous studies. The genetic structure among populations is influenced by gene flow, mutation, selection, and mating strategy [53]. In the studied collection of 350 lines, we identified limited gene flow as one of the determinants of genetic differentiation as Nm value was less than one [54]. It was also supported by the relatively large separation of P6 (Indian and Pakistani genotypes) and P7 (Turkish genotypes) from other sub-populations as extensive geographic distance hinders the gene flow. Limited gene flow also led to high genetic differentiation in Calotropis procera [55], in Nelumbo lutea [56] and flax [45]. Sub-population P1, P2 and P3 contained European  [57]. The presence of fiber type genotypes in P2 is likely one of the reasons for separation of P2 with other European groups P1 and P3. The P4 is composed of genotypes from closely located African and Asian countries which indicates exchange of genetic material among those countries. As per our expectation, all NDSU released varieties and advanced breeding lines, Canadian genotypes were grouped under the same sub-population (P5) as advanced breeding lines shared ancestors and historical germplasm exchanged occurred between USA and Canada [16]. The results of the mantel test indicated non-significant correlation between genetic and geographic distances of the studied populations. This supports the sporadic presence of genotypes of different origins in same sub-population, especially in P1, P2, P3, P6, P7. A similar scenario also occurred in a previous diversity analysis study of flax due to weak passport data [22]. However, this was not true for P4 and P5 as the mantel test showed significant correlation between geographical distances and the genetic distances. The significant associations between genetic distances and geographical distances were also detected in pale flax and flax collections [58] and in Linum austriacum (Lineaceae) populations [59].
Hybridization among genotypes from divergent populations will usually produce more diversity, transgressive segregation, and heterosis resulting in higher genetic gain. Pairwise F st is a good indication of the degree of divergence among populations. Both high and low pairwise F st value is good for parent selection depending on the objectives. In the present study, we identified statistically significant large and small pairwise F st values. Similar results were also found in previous studies [52,60]. To develop high yielding and high oil content varieties we will choose breeding parents from divergent subpopulation pairs such as P5 and P6, P7 and P6 as pairwise F st between them is highest (F st > 0.50) These subpopulations also contain different released varieties. For creating dual purpose transgressive segregants, we will choose parents from pair P2 and P6 (F st > 0.50) as P2 contained mainly fiber type and P6 contained oil type genotypes. For quick fixation of both fiber and oil contributing alleles in single individuals, crosses between genotypes of P2 and P5 will be more effective as pairwise F st < 0.20. Within sub-populations, crossing among genotypes will also be useful as AMOVA reveals variance among individuals within sub-population covered a larger portion of total variation than variance among subpopulation. This result is in line with the previous findings [41,45,58,61], but reverse results were also found in recent studies [62,63]. In this case, we could utilize P3, P4 and P1 showing high diversity (h > 2.30). All subpopulations contained both pure (non-hybrid) as well as admixed genotypes. For parent selection, the pure genotypes will be prioritized. However population diversity tends to inflate the real differentiation between any two pair of individuals as it exploits the alleles that not necessarily come from the same parent or ancestor. IBS coefficients are good to decide what individuals will be crossed to combine positive alleles that historically never have been combined. The molecular kinship or coancestry in self-pollinated crops tends to be higher than that in cross-pollinated crops as heterozygosity reduces the probability of two alleles at a locus of being identical by state [64]. In our study, most of the genotypes had weak  relatedness as approximately 80% of pairwise coancestry ranged from 1.12 to 1.50. For identifying specific cross combinations within and among sub-populations, genotypes having low IBS coefficients among them will be utilized.
Most of the economically important traits are quantitative in nature. To develop markers for quantitative traits, association mapping (AM) is used and knowledge of linkage disequilibrium (LD) is useful to determine the number and density of markers and experimental design needed to perform the analysis. Although low LD requires more markers for high resolution, it increases the predictive power of each one [24]. We found that the overall LD of the entire collection was 0.03 and LD decay was not observed within short distance for the entire collection as well as each sub-population. This is because of the autogamous (self-pollination) mating mode of flax [65] and LD declines more slowly in selfpollinated crops where recombination is less effective than in cross-pollinating species [24,66]. The higher LD level was also found in flax [35] and sesame [67] because of self-pollination. We found the slowest LD decay in P6 as the level of genetic variation captured by the target population influences the extent of LD and LD decay is rapid in landraces and accessions compared to related cultivars [68]. We also analyzed chromosome-wise LD decay to select chromosome-wide marker numbers for AM. Our analysis showed that LD decay was high in chromosome Lu13 and Lu8 and low in chromosome Lu1 and Lu3 which was more rapid than LD decay rates in previous findings [50]. This is may be due to the difference in genotype sets and marker sets. This finding indicates that we need to consider more marker for chromosome Lu13 and Lu8 than other chromosomes for better resolution during AM. The overall findings reveal that for fine mapping of QTL by AM, higher markers should be used according to the population and chromosome-wide LD decay rate. Again, selection of populations having low pairwise F st with high but similar level of LD will reduce the number of required individuals and markers for AM analysis. However, population structure and cryptic relatedness also affects AM analysis by increasing the false positive rate [69,70]. To minimize the false positives, we will use a mixed linear model (MLM) with Q-matrix and kinship matrix as covariates [70,71].

Conclusions
In the present study, we used highly informative SNP markers which were developed through GBS analysis. The identified SNPs provide a clear picture of genetic structure, diversity, relatedness and linkage disequilibrium of the studied population which leads to higher precision in parent selection for a need-based future breeding program. These markers will also facilitate QTL mapping, association mapping, to allow us to utilize marker-assisted and genomic-selection breeding tools for multiple traits breeding.

Plant materials
A core collection of 350 flax germplasm accessions originated in 38 countries of 6 continents were collected from North Central Regional Plant Introduction Station (NCRPIS), Ames, Iowa, USA, North Dakota State University (NDSU) released varieties and advanced breeding lines, varieties developed by different institute of USA and Canada (Fig. 6, Table S1).

DNA extraction and sequencing
Young leaves were collected from 30 days old plants and flash-frozen in liquid nitrogen. Tubes were stored at -80 C until lyophilized. The lyophilized leaf tissue was ground in tubes with stainless beads using a plate shaker. DNA was extracted using Qiagen DNeasy Kit (Qiagen, CA, USA) from lyophilized tissue following the manufacturer's protocol. DNA concentration was measured using a NanoDrop 2000/2000c Spectrophotometer (Thermofisher Scientific). The ApekI enzyme was used for GBS library preparation [72]. Sequencing of the library was done at  [74]. After passing all the required steps of TASSEL 5 GBSv2 pipeline, 281,368 unfiltered SNPs were identified. As flax is strictly self-pollinating crop, and this material is assumed to be inbred, all the heterozygous loci were first removed. Heterozygous SNPs are most likely are due to artefactual collapse of homologous sites during alignment. Then VCFtools [75] was used to select bi-allelic SNPs considering the criteria: minor allele frequency (MAF) ≥ 0.05, missing values (max-missing) ≤25%, depth (minDP) ≥ 3 and physical distance (thin) ≤ 500. These filtering steps resulted in a total of 6200 SNP markers.

Data analysis
The collection was divided into genetic groups using STRUCTURE v2.3.4 [69] software. The admixture model, a burnin period of 10,000 and 50,000 Monte Carlo Markov Chain (MCMC) iterations with 10 replications per K (K1-K10), were used as parameters for structure analysis. The optimal number of groups was determined based on DeltaK approach [76] which was performed by Structure Harvester [77]. The individual Q matrix for the optimal K value was generated utilizing membership coefficient matrices of ten replicates from STRUCTURE analysis using CLUMPP [78]. The results of structure analysis was visualized using the Structure Plot v2 software [79]. Principal co-ordinate analysis (PCoA) was conducted based on Nei's genetic distance by covariance standardized approach in GenAlex v6.5 [80]. An unrooted neighbor-joining (NJ) phylogenetic tree was constructed using MEGAX program with 1000 bootstrap [81]. Analysis of molecular variance (AMOVA) was done to partition the genetic variance among the groups identified by STRUCTURE in Arlequin3.5 [82]. The average pairwise between sub-population Fst and relatedness (r) values were calculated using GenAlex v6.5 [80]. GenAlex v6.5 was also used to estimate percentage of polymorphic loci, number of effective alleles, Shannon's information index, expected heterozygosity and unbiased expected heterozygosity of each marker and sub-population. The SNP distribution plot was developed using R package CMplot (available at: https://github.com/YinLiLin/R-CMplot). The  polymorphism information content (PIC) of markers was calculated using software Cervus [83]. Tajima's D value of each group was calculated using MEGAX software [81]. The level of relatedness (r) was correlated with Shannon's information index (I) and diversity (H) in R v3.5.2 [84]. We performed a mantel test [85] within each subpopulation based on genetic distance and geographic distance in GenAlex v6.5 as each sub-population was composed of genotypes, collected from different locations. The kinship (IBS) matrix was calculated using software Numericware i [86] and kinship heatmap and histogram were developed using R package ComplexHeatmap [87]. Linkage disequilibrium (LD) pattern of whole collection and different sub-populations were analyzed using PopLDdecay [88].