Use of genotyping-by-sequencing to determine the genetic structure in the medicinal plant chamomile, and to identify flowering time and alpha-bisabolol associated SNP-loci by genome-wide association mapping

Background Chamomile (Matricaria recutita L.) has a long history of use in herbal medicine with various applications, and the flower heads contain numerous secondary metabolites which are medicinally active. In the major crop plants, next generation sequencing (NGS) approaches are intensely applied to exploit genetic resources, to develop genomic resources and to enhance breeding. Here, genotyping-by-sequencing (GBS) has been used in the non-model medicinal plant chamomile to evaluate the genetic structure of the cultivated varieties/populations, and to perform genome wide association study (GWAS) focusing on genes with large effect on flowering time and the medicinally important alpha-bisabolol content. Results GBS analysis allowed the identification of 6495 high-quality SNP-markers in our panel of 91 M. recutita plants from 33 origins (2–4 genotypes each) and 4 M. discoidea plants as outgroup, grown in the greenhouse in Gatersleben, Germany. M. recutita proved to be clearly distinct from the outgroup, as was demonstrated by different cluster and principal coordinate analyses using the SNP-markers. Chamomile genotypes from the same origin were mostly genetically similar. Model-based cluster analysis revealed one large group of tetraploid genotypes with low genetic differentiation including 39 plants from 14 origins. Tetraploids tended to display lower genetic diversity than diploids, probably reflecting their origin by artificial polyploidisation from only a limited set of genetic backgrounds. Analyses of flowering time demonstrated that diploids generally flowered earlier than tetraploids, and the analysis of alpha-bisabolol identified several tetraploid genotypes with a high content. GWAS identified highly significant (P < 0.01) SNPs for flowering time (9) and alpha-bisabolol (71). One sequence harbouring SNPs associated with flowering time was described to play a role in self-pollination in Arabidopsis thaliana, whereas four sequences harbouring SNPs associated with alpha-bisabolol were identified to be involved in plant biotic and abiotic stress response in various plants species. Conclusions The first genomic resource for future applications to enhance breeding in chamomile was created, andanalyses of diversity will facilitate the exploitation of these genetic resources. The GWAS data pave the way for future research towards the genetics underlying important traits in chamomile, the identification of marker-trait associations, and development of reliable markers for practical breeding. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3991-0) contains supplementary material, which is available to authorized users.

Results: GBS analysis allowed the identification of 6495 high-quality SNP-markers in our panel of 91 M. recutita plants from 33 origins (2-4 genotypes each) and 4 M. discoidea plants as outgroup, grown in the greenhouse in Gatersleben, Germany. M. recutita proved to be clearly distinct from the outgroup, as was demonstrated by different cluster and principal coordinate analyses using the SNP-markers. Chamomile genotypes from the same origin were mostly genetically similar. Model-based cluster analysis revealed one large group of tetraploid genotypes with low genetic differentiation including 39 plants from 14 origins. Tetraploids tended to display lower genetic diversity than diploids, probably reflecting their origin by artificial polyploidisation from only a limited set of genetic backgrounds. Analyses of flowering time demonstrated that diploids generally flowered earlier than tetraploids, and the analysis of alpha-bisabolol identified several tetraploid genotypes with a high content. GWAS identified highly significant (P < 0.01) SNPs for flowering time (9) and alpha-bisabolol (71). One sequence harbouring SNPs associated with flowering time was described to play a role in self-pollination in Arabidopsis thaliana, whereas four sequences harbouring SNPs associated with alpha-bisabolol were identified to be involved in plant biotic and abiotic stress response in various plants species.
(Continued on next page) (Continued from previous page) Conclusions: The first genomic resource for future applications to enhance breeding in chamomile was created, andanalyses of diversity will facilitate the exploitation of these genetic resources. The GWAS data pave the way for future research towards the genetics underlying important traits in chamomile, the identification of marker-trait associations, and development of reliable markers for practical breeding.
Keywords: Matricaria recutita, Chamomile, Medicinal and aromatic plant (MAP), Genetic diversity, Genome-wide association study (GWAS), Bisabolol, Genetic resources, Genotyping by sequencing (GBS), Single nucleotide polymorphism (SNP), Structure Background German Chamomile (Matricaria recutita L. syn. Chamomilla recutita (L.) Rauschert; from hereon chamomile) is described to be native to the Near East and South-Southeast Europe [1] and is grown in many temperate regions of the world. As already mentioned by Hippocrates (fifth century BC, [2]), it has a long history of use in herbal medicine, with applications in gastrointestinal diseases, treatment of infections and inflammatory diseases of the skin, and respiratory problems. The flower heads contain medicinal compounds, and are either used dried (e.g. tea) or the essential oils are extracted. Chamomile has numerous secondary metabolites which are medicinally active, including the terpenes alpha-bisabolol (syn. Levomenol) and its derivates (bisaboloids), matricine/ chamazulene, further terpenes, and several flavonoids [2].
Chamomile is mainly outcrossing [3,4], and as most chamomile varieties are rather landraces, they display a high degree of phenotypic variability reflecting that of natural populations [5], from which they were isolated. Whereas wild populations of chamomile are diploid, agronomically cultivated varieties are either diploid or artificially generated tetraploid, the latter of which have larger flower heads and higher 1000-seed weight [6]. Genetic diversity in chamomile was evaluated in a few studies using a variety of different genetic markers (summarized in [7]), including AFLP [8], RAPD [8,9] and ISSR [10]. Nextgeneration sequencing (NGS) approaches have been successfully used for the analysis and exploitation of genetic and genomic resources in crop plant breeding, including rapeseed, soybean, barley, or maize, e.g. [11,12]. One NGS approach, genotyping by sequencing (GBS), has been widely used in many crop plants (e.g. [13][14][15][16][17][18]), is highly reproducible [13] and reduces genome complexity with restriction enzymes, thus, lowering costs. In contrast, the use of NGS / GBS for the amelioration of medicinal and aromatic plant breeding is relatively underexploited. An NGS approach has not yet been described for the nonmodel plant species chamomile.
In our work, we applied GBS to a set of chamomile origins (varieties, populations, accessions) to identify genome-wide, high quality single nucleotide polymorphisms (SNPs), to estimate genetic diversity and to reveal genetic structure within our cultivated chamomile germplasm collection. Therefore this is the first study reporting the generation and use of SNP markers in chamomile. We furthermore phenotyped the plants for flowering time (FT) and the content of bisaboloids by gas chromatography-mass spectrometry (GC-MS). Using phenotyping and genotyping data, an association study was performed to identify SNPs associated with FT and bisaboloids content, and to search for possible candidate genes underlying FT and the production of the medicinally-important compound alpha-bisabolol. In comparison to other marker platforms like AFLP or RAPD, the high marker density of genome-wide SNPs generated by GBS increases the probability of identifying markers closely linked to genes of interest. The resulting data provide a basis to enhance breeding in chamomile.

Plant material
Ninety one genotypes, representing 2-4 genotypes each from 33 origins of M. recutita L., and an additional 4 genotypes from 2 M. discoidea DC accessions as outgroup, were included in the analysis ( Table 1). The focus of the study on cultivated chamomile, mainly encompassing origins from all over Europe, but also including some locations outside of Europe.
For the purpose of this study, we define an origin as either a registered variety, a gene bank accession, a cultivated or wild population of chamomile. The term population applies to a wild natural population as well as to a cultivated one selected for breeding or agriculture. Genotype refers to a genetically individual plant.

Plant cultivation and ploidy analysis
Seeds were sown in the greenhouse on the 16th July 2014, and seedlings were individually transplanted into single pots (14 cm diameter) after reaching the 20-leaf stage. Between 2 to 4 genotypes per origin were selected for the subsequent analysis, with a focus on elite varieties/ populations (4 genotypes each). Plant ploidy was determined by flow-cytometry according to Otto et al. [19] at the 10-to 20-leaf stage.
Plants were randomized in the greenhouse, organized in a block of plants of 6   Analysis of chemical compounds by gas chromatographymass spectrometry (GC-MS) The content (peak areas) of bisaboloids (terpenoids), namely bisabolol oxide B, bisabolon oxide A, alphabisabolol, and bisabolol oxide A, were analysed by gas chromatography for 80 samples (76 from M. recutita and 4 from M. discoidea). The content of the chemical compounds was determined for each genotype in a sample of pooled flower heads, where flower heads were harvested after the first 2 circles of florets started flowering, scored as the flowering time until maximum half of the florets were flowering. All plants were grown and processed equally. Samples were immediately frozen in liquid nitrogen and stored at −80°C until further analysis to prevent the oxidation of alpha-bisabolol. The analysis by GC-MS was done with modifications according to Köllner et al. [20] and Irmisch et al. [21]. For each sample, the finely ground powder on liquid nitrogen of 80 to 100 mg material from 5 flower heads was used. For each pulverized sample 200-400 μl of n-hexane was added, followed by overnight incubation with shaking at 25°C. A total of 100 μl was transferred to a clean microtube fitted for GC vials. Terpene analysis was carried out with a gas chromatograph (GC 2010, Shimadzu, Duisburg, Germany) equipped with a splitless injector (injector temperature, 220°C; injection volume, 1 μl) and coupled to a quadrupole mass selective detector (Shimadzu). H 2 was used as carrier gas at a flow rate of 1 ml min-1 . Samples were analysed on a Supreme-5 ms column (30 m length × 0.25 mm inner diameter × 0.25 μm film thickness, Chromatographie Service GmbH, Germany). The coupled mass spectrometer was operated with a transfer line temperature of 230°C, a source temperature of 230°C, a quadrupole temperature of 150°C, an ionization potential of 70 eV and a scan range of 50-350 amu.
A flame ionization detector (FID) was used for the quantification of the compounds (μg/g of fresh weight), and operated at 250°C. Compounds were identified by comparison of retention times and mass spectra to those of authentic reference substances obtained from Sigma-Aldrich (Steinheim, Germany), Roth (Karlsruhe, Germany), or by reference spectra in the Wiley and Shimadzu libraries [22].

DNA extraction and genotyping-by-sequencing (GBS)
DNA was isolated from the 95 samples according to the manufacturer's instruction using the Agencourt Chloropure kit (Agencourt Bioscience Corp., Beverly, Massachusetts, USA). After isolation the DNA was analysed on a 1.3% Agarose gel (Bio&SELL Universal-Agarose), and the DNA concentration was determined relative to a DNA standard. DNA dissolved in TE buffer was shipped to the Cornell University Biotechnology Resource Center (BRC, Cornell, USA) for GBS analysis.
Sample preparation and sequencing was performed at the BRC using a protocol modified from Elshire et al. [3] (http://www.biotech.cornell.edu/brc/genomics-facility) using the enzyme ApeKI for digestion. GBS libraries constructed were sequenced in the BRC Genomics Facility on the Illumina HiSeq 2000/2500 (100 bp, single-end reads).

Sequencing data analysis and SNP detection
In the absence of a reference genome, the assembly was done de novo. The 156 × 10 6 barcoded reads from the 95 samples (genotypes) were demultiplexed, trimmed for restriction site and barcode, filtered and clustered using the software pipeline pyRAD v. 3.0 [23]. If not described differently, the default settings were used. The 4 samples from M. discoidea were assigned as outgroup, with the 91 remaining samples also defined as threshold (n = 91) for the maximum number of individuals with a shared heterozygous site to account for the high proportion of polyploid samples and the outcrossing nature of M. recutita. The minimal number of samples per locus was set to 75, i.e. 75 from 91 samples having data at a particular locus, the parameter for the clustering threshold of reads within and between individuals was set to 0.88, and filtering for barcodes, adapters and cut sites was done strictly setting value "2". The maximum ploidy level was appointed as tetraploid.
Using the above criteria, the pyRAD pipeline filtered and called 6495 SNPs from the 91 M. recutita genotypes.

Analysis of the genetic diversity and genetic structure
The genetic diversity and structure of cultivated chamomile was analysed with the filtered 6495 high-quality SNPs using the admixture model of STRUCTURE 2.3.4 [24][25][26]. The outgroup (4 genotypes from M. discoidea) was excluded from the analysis, since M. discoidea was clearly different from M. recutita. Ten independent analyses with K (number of clusters) ranging from 1 to 15 with 15,000 iterations as burn-in were performed followed by 50,000 supplementary Markov Chain Monte Carlo (MCMC) iterations. The optimal K was obtained according to Evanno et al. [27] using STRUCTURE Harvester [28]. For K = 3 and K = 7, an additional 10 independent analyses with 25,000 iterations as burn-in followed by 100,000 supplementary MCMC iterations were done and used for further analysis. The burn-in length was assessed by monitoring the parameters. CLUMPP v 1.1.2 [29] was used to align the results of individual runs (Greedy algorithm).
The allele frequencies were assumed to be correlated between the origins of chamomile rather than independent [25], as chamomile is a mainly outcrossing species whose seeds are exchanged for cultivation and breeding between regions, and thus the investigated populations / origins can share similar allele frequencies [25]. The admixture model was used, since it is more general and does not assume each individual to belong to a single cluster [26]. The value for lambda (allele frequencies parameter) was set to 0.3 after lambda was estimated in STRUCTURE. Otherwise, the default settings of STRUCTURE were used.
Additionally, the supermatrix of concatenated sequences, 682,699 bp long, was used to compute a phylogenetic tree. The neighbor joining method [30] with 100 bootstrap and keeping nodes above 75% was used to obtain a consensus phylogenetic tree with Geneious v.10.0.9 (https:// www.geneious.com, [31]). To validate the results from the neighbor-joining and STRUCTURE analysis, principal coordinate analysis (PCoA) was done.

Analysis of phenotype-genotype associations
To obtain markers for GWAS analysis (genome wide association studies), the full sequences (contigs) identified by the pyRAD pipeline as harbouring all the 6495 SNPs of the STRUCTURE analysis were used to detect 44,468 polymorphic markers within a set incorporating all 95 genotypes. Due to the lack of genetic or physical map, the minor number of missing profiles were imputed by means of Random Forest algorithm [32] through R package "missForest" [33]. Quality control was applied such that markers with call rates less than 90% were excluded [34,35]. In the end, out of the 44,468 polymorphic markers 16,972 markers, including 16,333 biallelic and 639 multiallelic, were available for subsequent GWAS. Since the single results in GWAS were subsequently evaluated (including consideration of the p-values and r 2 -values), different criteria for SNP filtering were applied than for the genetic structure analysis so that more SNPs could be used as basic data for the GWAS.
Considering the lack of phenotyping replicates here, estimating the heritability was not feasible. Nevertheless, genomic heritability [36,37] is a well-suited surrogate for the heritability. We estimated genomic heritability via a replicated (20×) 5-fold cross validation, as the square of average correlation from the 20 replications comparing predicted genotypic values and phenotypic values of respective traits.
GWAS was performed with the R package "rrBLUP" [38] using phenotypic data and the 16,972 markers obtained. The kinship matrix was estimated with a G matrix [39] using 5768 markers (out of the total 16,972 markers) whose minor allele frequency was above 0.05. The final significance test for each marker/allele was accomplished using F-test in "rrBLUP". The percentage of phenotypic variance explaining each significant marker was estimated via R 2 by fitting a regression between phenotypes and marker profiles, as implemented in R using the function "lm".
The DNA sequences containing highly significant markers as identified by GWAS were annotated using firstly BLAST (blastn and tblastx; [40,41], http://blast. ncbi.nlm.nih.gov/Blast.cgi) against the NCBI plant genomes nucleotide collection of flowering plants data using the discontiguous megablast algorithm. Mapping and annotation of the BLAST-results were performed using Blast2GO 4.0.2 and default settings [42][43][44][45]. Additionally, blastn against the Helianthus annuus (sunflower) XRQ genome assembly was done at https://www.heliagene.org/HanXRQ-SUN-RISE/ [46], since sunflower is the closest related species with a sequenced genome available, belonging together with chamomile in the Asteraceae. Local BLAST analysis was done with CLC Genomics Workbench 8.5 (QIAGEN Aarhus A/S).

Results
Analysis of genetic structure reveals less genetic diversity for tetraploid than for diploid chamomile origins As expected, the 4 genotypes of M. discoidea (outgroup) were clearly genetically separated from M. recutita (see Fig. 1, Additional file 1: Fig. S1 and Additional file 2: Fig. S2), and were thus excluded from further analyses. Using STRUCTURE, the optimal numbers of clusters (k) that described the population structure of the investigated chamomile were determined as 3 and 7 (Additional file 3: Fig. S3) for the main (Additional file 4: Fig. S4) and more detailed substructures (Fig. 2, Additional file 5: Table S1), respectively. Although the allele frequencies were assumed to be correlated between the origins of chamomile, STRUCTURE was nonetheless additionally run with the setting of independent allele frequencies, which lead to far less population structure differentiation (as expected and described by Falush et al. [25]), but did not significantly differ from the outcome of the analysis with correlated allele frequencies (Additional file 6: Fig. S5). In general, the STRUCTURE analyses demonstrated that the single plants of an origin were genetically similar, with moderate exceptions occurring (Fig. 2, e.g. 005 = 'Lutea' or 010 = ' Argenmilla').
The content of the bisaboloids was highly variable between the genotypes Large differences in the content of single bisaboloids were measured between the genotypes and origins, with single genotypes tending to be rich in only one of the four bisaboloids, i.e. plants showing a high content of alpha-bisabolol often had a low content of bisabolol oxide A and bisabolon oxide A (Fig. 4, Table 2). The plants from the outgroup M. discoidea had low contents of bisaboloids, only bisabolon oxide A was present to some extent (Fig. 4).   Fig. 4) were the 4 genotypes from 'Manzana' (002_02, 002_04, 717_03, 717_04) and the genetically defined group (see above and Fig. 2) consisting of 'Camoflora' , 'Lutea' and one genotype from 'Goral' (007_01). The alpha-bisabolol rich genotypes were characterized by two main STRUCTURE clusters each (> = 9.5% share of each genetic cluster, compare Table 2 and Fig. 2).
Genome-wide association study (GWAS) identified significantly associated SNP-markers for the traits flowering time and alpha-bisabolol The estimated genomic heritabilities for the traits flowering time and content of alpha-bisabolol were above 0.4 (0.48 and 0.40 respectively, Fig. 5, Additional file 11: Table S3), while that for the content of bisabololoxide B, bisabolone oxide A and bisabololoxide A were clearly lower (0.01; 0.01; 0.14, Fig. 5, Additional file 11: Table S3). Thus, genome-wide association studies were performed  Table S1). The top row indicates the geographical origin (upper) and ploidy (lower) for each sample according to   for flowering time and alpha-bisabolol only, focusing on the two traits for which a high portion of the phenotypic variability can be explained genetically. GWAS resulted in the identification of 9 SNPs located in 5 different sequences for flowering time (FT) and 72 SNPs located in 60 different sequences for alpha-bisabolol content, all of which were above the defined significance threshold 0.01 under Bonferroni correction (Additional file 12: Table S4). Three sequences (2445, 3029, 4585) harboured more than one SNP associated with FT, and ten sequences (256, 1379, 1998, 2208, 2380, 2489, 5681, 6623, 6440, 6670) more than one SNP associated with alpha-bisabolol content.

Flowering time (FT)
The BLAST alignment [40,41] of the 5 sequences harbouring significantly associated SNPs (Additional file 12: Table S4) against the NCBI nucleotide collection of flowering plants identified 2 sequences with significant hits to potential candidate genes (BLAST-score of > = 80 with a sequence identity of > = 80% and an E-value <1 × 10 −13 ) in multiple plant species. Both sequences harboured more than one significantly associated SNP: (1) "seq. 2445" with 2 SNPs: xyloglucan endotransglucosylase/hydrolase protein with significant alignments to 30 plant species; and (2) "seq. 4585" with 3 SNPs: uncharacterized protein with significant alignments to 9 plant species, and protein kinase / salt-induced ABC1 kinase below the BLAST aligment results threshold in 2 plant species (Additional file 13: Table S5). Xyloglucan endotransglucosylase/hydrolase is described to be involved in self-pollination in Arabidopsis thaliana [47].
BLAST aligment (megablast) of the FT associated sequences to the sunflower genome [46] yielded hits for 4 of the 5 sequences on different chromosomes (multiple hits on 6 chromosomes for "2445", single hits all in different chromosomes for "441", "3029" and "4585").

Alpha-bisabolol content
Out of the 60 sequences (contigs) associated with alpha-bisabolol content in our study (Additional file 12: Table S4), 31 could be aligned by BLAST against the NCBI nucleotide collection of flowering plants, of which 13 sequences achieved matches above the selected threshold (listed in Additional file 13: Table S5 together with the suggested functional role of the putative gene: BLAST-score of > = 80 with a sequence identity of > = 75% and an E-value <1.33 × 10 −13 ). For 12 of these 13 sequences specific gene products were described in multiple plant species (Additional file 13: Table S5).
Whereas 8 (sequences 3218, 3223, 3249, 4194, 6598, 6694) of the 12 sequences are involved in synthesis or regulation within universal pathways and associated ubiquitous proteins in many eukaryotes or plants (Additional file 13: Table S5), 4 sequences are described to play a role in plant biotic and abiotic stress response. Interestingly, one sequence (3223) is described to be involved both in basic function and in abiotic stress response ( [48]: armadillo/beta-catenin repeat family proteins). Ascorbate peroxidase (722) responses are directly involved in the protection of plant cells against adverse environmental conditions [49], Glutamate receptor-like channels (1139) in plants are discussed to fulfill a function in response to an attack from pests and pathogens [50], and the chaperone protein ClpB1 (5721) is described to be a heat shock protein required for acclimation to high temperatures [51].   The last sequence (1998) is predicted to encode a nuclear fusion defective 4-like protein, which is described to be required for karyogamy during female gametophyte development and fertilization in Arabidopsis thaliana [52]. The BLAST aligment (megablast) of the 60 sequences associated with alpha-bisabolol content against the sunflower genome [46] yielded 35 matches above the threshold for the E-value of 10 −3 . 12 of the 13 sequences from the alignment to different plant genomes (Additional file 13: Table S5, except the microsatellite sequence "1379") were above the threshold value for the alignment against sunflower genome.
The 35 sequences with E-value above the threshold, were aligned to all 17 chromosomes of the sunflower genome, with minimum of 1 (chromosome 17) and maximum of 7 to 8 (chromosomes 8, 10 and 13) sequences  Table S6). Among these 35 sequences, 14 were aligned to only 1 sunflower chromosome, 10 to 2 different sunflower chromosomes, and the remaining 11 to multiple chromosomes, with sequences "4194" and "5751" both being aligned to 6 chromosomes. Ten sequences were aligned to multiple loci on the same chromosome. Tblastx gave more hits against the sunflower genome, but with higher E-value and only consisting of more multiple hits against the same chromosomes.
The sequences for which specific gene products in multiple plant species were identified (see above) were not concentrated on any of the sunflower chromosomes (Additional file 14: Table S6: sequence name written bold). Based on these data, no clustering of these 35 highly alpha-bisabolol associated sequences to single sunflower chromosomes could be observed.
Mapping by Blast2GO of the BLAST results to retrieve gene ontology terms for FT and alpha-bisabolol content achieved no matches. Thus, no functional annotation could be done.

Discussion
Rare autopolyploidization events are a potential cause for low genetic diversity of the cultivated tetraploid chamomile Tetraploid chamomile varieties were generated a very few times artificially by polyploidisation of diploid plants, as evidenced by the STRUCTURE analyses (Fig. 2). The large, genetically homogenous group of 14 tetraploid orgins, obtained from all over Europe, is likely to originate from the same polyploidised diploid genetic background, i.e. similar parental material, with minor genetic differences arising via selection and introgression during breeding. These data indicate that most of the high-performing tetraploid chamomile origins are genetically highly similar, and justify the demand to further broaden the diversity by including different germplasm in the future breeding process. Two smaller, genetically different groups of tetraploid origins with "extended" genetic background include plants with the highest alpha-bisabolol content (see below): (1) The tetraploid variety 'Manzana' , which was generated from diploid 'Degumille' [53] and (2) the group containing 'Lutea' , 'Zloty Lan' and partly 'Goral'.
Two tetraploid genotypes (003_03 from 'Bodegold' , 004_04 from 'Camoflora') were genetically diverse, each consisting of several STRUCTURE-clusters, although both were similar in "STRUCTURE fingerprints" to their corresponding diploid genotypes. Considering 'Bodegold' , registered in 1962 as a tetraploid variety [54], which consisted of di-and tetraploid plants, we hypothesize that contamination with closely related wild growing diploid chamomile through seeds or (unreduced) pollen has led to variable ploidy [19]. Contrarily, 'Camoflora' (a diploid variety) might have experienced fertilization with both unreduced male and female gametes or unreduced egg cells crossed with cultivated tetraploid chamomile, which both could have led to the detection of tetraploid genotypes in our study. However, frequent gene flow between di-and tetraploid chamomile is not expected due to the different ploidy of the gametes, which would typically lead to highly sterile triploid F 1 -progeny [55,19]. Consequently, an increase in the genetic diversity of tetraploid chamomile origins by crosses with diploids is not likely to happen spontaneously, and hence the inclusion of different new origins in breeding programmes would likely be advantageous.
Geographic origin showed no congruence with genetic structure In our analysis, there was no clear congruence between the genetic structure (clusters) and geographic origin ( Fig. 2 and Additional file 9: Fig. S8b), i.e. no clustering due to geographic origin could be found. Even though the origins from Argentina (010) and North Korea (516) indeed seem to be genetically distinct from the European  Table S3) ones (Additional file 9: Fig. S8b), the samples from Russia (23), Egypt (26) and USA (27) showed strong similarity to diverse European origins. An explanation could be that regular exchange of plant material between the regions took place for this mainly outcrossing species, and that in our panel of plants were mainly cultivated chamomile origins with good agricultural performance. Franke and Schilcher [9] discriminated the origin of chamomile flowers by their chemotype, but only from wild and not cultivated collections. For the crop plant Brassica rapa, Tanhuanpää et al. [56] analysed 61 accessions with SNP markers and found grouping corresponding to morphotype and flowering habit, but not to geographic origin. Their results are comparable to our data, indicating the limited value of the geographic origin alone (without correlation to other traits) to genetically structure cultivated outcrossing plants for which the exchange of genetic material is not inhibited. Consequently, only with numerous wild collected samples from the natural regions of origin (e.g. Iran) specific geographic footprints in the genetic patterns might be revealed.

Flowering started earlier for diploid than for tetraploid chamomile plants
Diploid plants flowered in average significantly earlier than tetraploid ones (Fig. 3, Additional file 10: Table S2), as has been shown for Chamerion angustifolium [57,58]. It is hypothesized that (auto-) tetraploid plants develop slower and flower later due to the longer time necessary for replication of a doubled genome [59,60], which consequently may lead to slowing of metabolism and growth rates in polyploids [61,62]. Since tetraploid chamomile is autotetraploid, possible effects of allopolyploidy like allelic interactions or epigenetic modifications of homoeologous loci leading to nonadditive gene regulation [63,64], are not considered. The high genomic heritability observed for the flowering time in our study (Fig. 5) additionally indicates a strong genetic influence on the flowering time in chamomile, as has been described for Arabidopsis and Brassica. Mayfield et al. [64] hypothesized that late flowering in these genera is a dominant trait, as flowering time probably depends on the expression of a repressor gene. Thus, late flowering in tetraploids compared to diploids could be additionally influenced by higher chances for dominant alleles to be present. Bao et al. [65] identified factors regulating flowering and endopolyploidization in Arabidopsis and suggest an involvement of cell cycle control in the timing of reproductive transition.

Identification of the flowering time (FT) associated sequences in chamomile can enhance breeding
The 5 sequences containing the SNPs significantly associated with flowering time (Additional file 12: Table S4) may serve as a starting point to develop molecular markers to be used in marker assisted selection (MAS) for FT in chamomile. Early flowering (=early harvesting) could help circumvent negative effects on yield due to summer drought, whereas late flowering might be of advantage in cold regions with late frost, especially when sowing was already done in autumn. Additionally to the effect of the ploidy on FT, molecular markers could facilitate the breeding for early or late flowering varieties by selecting genotypes with genetic factors for these traits.
The BLAST alignment demonstrated that the orthologue of one of these 5 sequences ("2445"; Additional file 12: Table S4) codes for protein connected to flowering (i.e. self-pollination) in A. thaliana [47]. Further investigations would be required to unravel the function of the identified sequences in chamomile.
High alpha-bisabolol content detected in tetraploid plants from two genetically defined groups indicates its limited occurence in cultivated chamomile Only 9 chamomile genotypes (origins 'Manzana' , 'Lutea' , 'Goral' and 'Zloty Lan') from 2 STRUCTURE groups (Fig. 2) showed a high content of alpha-bisabolol, as has been described for 'Manzana' , 'Lutea' , and 'Goral' Here, only one out of three analysed 'Goral' genotypes showed a high bisabolol level, while two of three 'Zloty Lan' genotypes were alpha-bisabolol-rich with the third genotype being rich in alpha-bisabolol oxide B, in contrast to the data of Franke and Schilcher [2] and Das [7]. Our data for 'Lazur' also differ from the data of Franke and Schilcher [2] as we detected high levels of bisabolol oxide A and B but no bisabolon. These differences could be explained by genotypic variability in the polymorphic varieties of this outcrossing species, potentially increased by introgression with other chamomile populations during cultivation. This potential variability should be taken into account when maintaining a variety characterized by high-levels of specific compounds serving as medicinal drug, as well as when breeding new varieties.
The pharmaceutically important sesquiterpene alphabisabolol was detected in 28 of the analysed plants. It was previously suggested that an oxidative biosynthetic pathway converts the alpha-bisabolol to bisabolol oxide B or to bisabolol oxide A, and from the latter further to bisabolone oxide A [66,67].
The 9 alpha-bisabolol rich genotypes belonged to 4 varieties (see above) from 2 different tetraploid groups, both possessing 2 main genetic clusters as defined by STRUCTURE (compare Table 2 and Fig. 2). However, a general conclusion that heterotic effects lead to the higher content of alpha-bisabolol cannot be deduced from our data ( Fig. 4 and Fig. 2), as many origins with several main genetic clusters each had no detectable alpha-bisabolol content. Since only 2 tetraploid groups showed plants with high alpha-bisabolol content in our study, their membership in two main genetic clusters could also be a random effect explained by a breeding process of choosing different genetic materials as parents, but not necessarily leading to significant heterotic effects. This view is also supported by the analysis of the average heterozygosity. Some alpha-bisabolol rich genotypes displayed an elevated level of heterozygosity over the whole DNA-sequences, but more genotypes with the same level of heterozygosity contained no alpha-bisabolol (Additional file 15: Fig. S9, Additional file 16: Table S7). Wagner et al. [16] describe the inheritance of the ability to produce alpha-bisabolol as monogenic, which does not support the involvement of a heterosis effect in its content. The gene most likely responsible for alpha-bisabolol formation is the alphabisabolol synthase McTPS2 identified in chamomile [68]. The expression of such a terpene synthase was often found to be dominant in many plants including maize [69,70]. In a genome wide association study of maize, several SNPs with significance for terpene production were located in terpene synthase genes [71]. However, in chamomile, none of the SNPs associated with alpha-bisabolol content were located in the sequences of alpha-bisabolol synthase or other genes with an apparent involvement in terpene biosynthesis [21]. Some of the SNPs controlling alphabisabolol biosynthesis may affect regulatory factors that also cause a positive introgression/heterosis effect.
It cannot be concluded that tetraploidy leads to an enhanced potential for elevated levels of alpha-bisabolol, since only some tetraploids had high levels of it, and for 'Manzana' the diploid starting material ('Degumille': [53]) is already described to contain high levels of alphabisabolol.
Connection between high alpha-bisabolol content with genes for plant defense and stress response Thirteen out of the 60 sequences harbouring the 72 SNPs significantly associated with alpha-bisabolol content could be annotated, 9 of which had orthologues in multiple plant species, of which 4 are described with roles in plant biotic and abiotic stress response (Additional file 13: Table S5).
The association of genes for plant defense and stress response with alpha-bisabolol content is consistent with the antifungal and antibacterial effects of alpha-bisabolol and its oxides [10,72], which are part of their medical value. Secondary metabolites often function in plant defense and stress response, and in particular, terpenoids (a class of substances to which the bisaboloids belong) play vital roles in plant defense [73]. Furthermore, synthesis and accumulation of natural products (i.e. secondary metabolites) is enhanced in drought-stressed plants, or by application of signal transducers [74,75]. Hence, we hypothesize that moderate stress might enhance the production and thus concentration of alpha-bisabolol in chamomile, and yield as long as the overall growth of the plant is not hampered.

Conclusion
Our analyses of high quality SNPs enabled us to identify genetic clustering in chamomile, with some genotypes belonging mainly to one genetic cluster, while others demonstrate evidence for introgression beween different genetic clusters. Tetraploid varieties tended to be less diverse than the diploid ones. Several origins were identified as being genetically different from most other origins, suggesting their possible use for exploiting heterosis. We identified several genotypes with high levels of the medicinal important substance alpha-bisabolol. Single genotypes tended to be enriched in only one of the four bisaboloids. GWAS identified highly significant (P < 0.01) SNPs for flowering time (9) and alpha-bisabolol content (71). The data from GWAS pave the way for future research towards the identification of marker trait associations, their underlying genetics and their use in marker assisted selection (MAS) for practical breeding in German chamomile.