Characteristics of the completed chloroplast genome sequence of Xanthium spinosum: comparative analyses, identification of mutational hotspots and phylogenetic implications

Background The invasive species Xanthium spinosum has been used as a traditional Chinese medicine for many years. Unfortunately, no extensive molecular studies of this plant have been conducted. Results Here, the complete chloroplast (cp) genome sequence of X. spinosum was assembled and analyzed. The cp genome of X. spinosum was 152,422 base pairs (bp) in length, with a quadripartite circular structure. The cp genome contained 115 unique genes, including 80 PCGs, 31 tRNA genes, and 4 rRNA genes. Comparative analyses revealed that X. spinosum contains a large number of repeats (999 repeats) and 701 SSRs in its cp genome. Fourteen divergences (Π > 0.03) were found in the intergenic spacer regions. Phylogenetic analyses revealed that Parthenium is a sister clade to both Xanthium and Ambrosia and an early-diverging lineage of subtribe Ambrosiinae, although this finding was supported with a very weak bootstrap value. Conclusion The identified hotspot regions could be used as molecular markers for resolving phylogenetic relationships and species identification in the genus Xanthium.

various traditional medicinal treatments in multiple countries [11]. Parts of the X. spinosum plant are used for the treatment of cancer and diarrhea [12,13], intermittent fever related to hydrophobia and rabies [14], and rheumatoid arthritis [15], and have antibacterial [14] and antiviral properties [14,[16][17][18]. Although several antimicrobial substances and their functions have been studied in X. spinosum over the past five decades, no exclusive genetic or genomic studies have been conducted to date.
Universal molecular markers such as the plastid genes rbcL and psbA and nuclear internal transcribed spacer (ITS) have been widely used for the rapid and precise identification of plant species but have proved unsuccessful for distinguishing very closely related species [19][20][21]. The genus Xanthium is commonly known as cocklebur, and is a close relative of the genus Ambrosia. The number of species in the genus Xanthium remains under debate, and this genus may include 5 to more than 20 species [22][23][24][25]. Phylogenetic analyses of several plastid and nuclear DNA markers have shown conflicting results for Xanthium and its relatives [11]. By contrast, Somaratne et al. (2019) used 46 cp protein-coding genes (PCGs) to resolve the phylogenetic positions of Xanthium and Parthenium and revealed that Parthenium is not an earlydiverging lineage of the subtribe Ambrosiinae. However, most plant cp genomes contain highly conserved structures that are useful molecular markers for the identification of plant species in genome-wide evolutionary studies; such structures provide significant quantities of genetic information and can resolve taxonomic and phylogenetic relationships [26,27].
In the present study, we examined both plastome evolution and the phylogenetic relationships within Heliantheae. For this purpose, we first sequenced and characterized the X. spinosum cp genome and compared it with the X. sibiricum cp genome as well as those of closely related species of Heliantheae. In addition, we identified hotspot regions of sequence variation and clarified the evolutionary dynamics among Xanthium species.

General features of the cp genome and its organization
The complete cp genome of X. spinosum was 152,422 bp in length. The cp genome showed a typical quadripartite structure containing two short inverted repeats (IRa and IRb) (25,075 bp) separated by a small single-copy (SSC) region (18,083 bp) and a large single-copy (LSC) region (84,189 bp) (Fig. 1). The cp genome encodes 115 unique genes, including 80 PCGs, 31 transfer RNA (tRNA) genes, and 4 ribosomal RNA (rRNA) genes. Six proteincoding, six tRNA, and four rRNA genes were duplicated in the IR regions. The overall GC content of the cp genome was 37.4%, while those of LSC, SSC, and IR regions were 35.4, 31.2, and 43%, respectively (Table 1).

Comparative analyses of Xanthium species
The borders of LSC-IRb and SSC-IRa in the cp genome of X. spinosum were compared to three other closely related species of Heliantheae, namely, X. sibiricum, Ambrosia artemisiifolia, and Parthenium argentatum [28,29] (Fig. 2). An intact copy of the rps19 gene was present in the LSC/IRb borders of X. spinosum, A. artemisiifolia, and P. argentatum, as well as a shared 95 bp to 119 bp sequence in the IRb region adjacent to the rpl2 gene. By contrast, the X. sibiricum rps19 gene was completely shifted to the LSC region, 71 bp away from the IRb region, despite the rpl2 gene being present at the LSC/IRb border. In addition, 154-175 bp of the fragmented rps19 gene in all four species was present at the IRa/LSC, LSC/IRa regions or its border. On the other hand, ѱycf1 was present in the IRa/SSC border of X. spinosum, whereas it was located in the IRb or silenced in the SSC region of X. sibiricum and A. artemisiifolia, and was situated in the SSC region of the P. argentatum cp genome. The entire ndhF gene was present in the SSC region of all four cp genomes. Similarly, an intact ycf1 gene was present in the SSC/IRa region of all of the cp genomes analyzed, except P. argentatum, which has a 565 to 583 bp fragment of ycf1 in the IRa region. However, P. argentatum encodes two copies of ycf1 in its genome. The trnH gene sequences are located in the LSC region 0 to 118 bp from the IRa/LSC border in all cp genomes.
The cp genomic sequences of four Heliantheae species were analyzed using mVISTA software to detect variation among the sequences (Fig. 3). The sequence divergence differed markedly among regions. The data revealed that the non-coding region was more divergent than its coding counterparts. Relative to the LSC and SSC regions, IR regions of all cp genomes were less divergent.

Repeat structure and SSR analyses
The presence of repeat sequences in the X. spinosum and X. sibiricum cp genomes was analyzed and the species were compared. Repeats in the X. spinosum cp genome consist of 264 forward, 256 palindromic, 251 reverse, and 228 complement. By contrast, X. sibiricum contained 18 forward, 15 palindromic, 6 reverse, and 2 complement repeats (Fig. 4a). In total, X. spinosum and X. sibiricum contain 999 repeats and 41 repeats, respectively. Among the 999 repeats identified in X. spinosum, repeats of 30-39 bp in length (983) were predominant in the cp genome; the longest repeat was 115 bp and was a palindrome sequence. Similarly, in X. sibiricum, 34 repeats were 30-39 bp in length, and the longest was a palindromic sequence of 177 bp (Fig. 4b).

Synonymous (K S ) and nonsynonymous (K A ) substitution rate analyses
The synonymous and nonsynonymous substitution rates were evaluated for 79 PCGs in the X. spinosum and X. sibiricum cp genomes. The K A /K S ratios of nearly all genes were less than 1, except for the PCG accD (1.56) (Fig. 7).
Positive selection analyses of the accD gene  (Table 3); the 2ΔLnL value was 25.91 and the p-value of LRT was 0 (Table 4).

Phylogenetic analyses
In all, 71 PCGs from 21 cp genome sequences were selected for inferring phylogenetic relationships among closely related species of Heliantheae, and Ligularia fischeri (MG729822) was selected as an outgroup. A maximum likelihood tree was constructed using 71 concatenated PCGs in the cp genomes. The genus Xanthium was closely related to the genus Ambrosia (Fig. 8). Our analyses showed that Parthenium was a sister clade to both Xanthium and Ambrosia, and also an early-diverging lineage of the subtribe Ambrosiinae with a weak bootstrap value (57%).

Discussion
The single circular cp genome structure of X. spinosum was similar to that of X. sibiricum with a typical quadripartite structure and equal GC content (37.45%) unevenly distributed across the cp genome. Relative to the LSC and SSC regions, the GC content is greater in IR regions across both cp genomes, possibly due to the presence of four extremely conserved rRNA genes with high GC content in these regions. The expansion and contraction of IR regions was the main cause of variation in cp genome size, and assessing these differences could shed light on the evolution of related taxa [30,31]. The cp IR boundary regions of X. spinosum were compared to those of closely related species, and little difference was found, except for position changes in ycf1. The sizes of the four cp genomes (X. spinosum, X. sibiricum, A. artemisiifolia, and P. argentatum) were not affected. Moreover, the length of each region and the total genome size were similar to those of most plant cp genomes reported previously [32].
Repeat units, which are dispersed in cp genomes at high frequency, play a significant role in genome evolution [33][34][35][36]. Our comparative analyses of X. spinosum and S. sibiricum cp genomes showed a 24.4-fold higher level of repeats in X. spinosum. An earlier study reported that variation in the number and type of repeats may play a major role in plastome organization; however, we found no correlation between these large repeat regions and rearrangement endpoints [37]. SSRs, also known as microsatellite repeats [38,39], are common in the cp genome, and these sequences display a high level of polymorphism, supporting their use as a genetic marker in previous investigations [40,41]. The contents of different types of SSRs and their distributions among cp regions were similar in X. spinosum and X. sibiricum. Multiple definitions of repeat motifs and repeat number within motifs have been used in the literature; our SSR definition aligns with those of Bilgen et al. [42] and Karaca et al. [43].
The cp genomes of Xanthium showed less variation in non-coding regions than in their coding counterparts. The LSC region exhibited higher divergence levels than the IR and SSC regions (Fig. 6c). Specifically, the two IR regions were least divergent, perhaps due to the presence of four highly conserved rRNA sequences in those regions. The average nucleotide diversity (π) of intergenic regions was 0.0170, almost four times as high as that of PCGs (π = 0.004195), revealing that intergenic regions show greater divergence (Fig. 6d).
Not all PCGs are phylogenetically useful for determining taxonomic discrepancies [44]. In previous studies, several plastid and nuclear DNA markers from noncoding regions have been used to resolve the phylogenetic position of Xanthium species, leading to inconsistent results [11]. Hence, the use of the additional markers and broader taxonomic sampling are required to achieve greater phylogenetic resolution at low taxonomic levels [11,45]. Therefore, in the present study, we proposed a set of 14 divergent regions between X. spinosum and X.  sibiricum to resolve taxonomic discrepancies and provide a genetic barcode for the genus Xanthium. All of these regions are intergenic spacer regions, which might be useful for the development of molecular markers to use in phylogenetic and phylogeographic studies. The 14 sequences identified in the present study are extremely polymorphic compared to the sequences used in previous studies [6,11,45]. Based on our data, molecular markers can be developed for these intergenic regions that may be used for phylogenetic, phylogeographic, and barcoding studies of Xanthium. Moreover, this is the first report of the development of genetic markers based on these regions and their use to distinguish among Xanthium species. In addition, the nucleotide substitution rate and BEB analyses revealed that the accD gene may be under positive selection, and other positively selected sites detected in the present study may drive the accD PCG, supporting the occupation of various habitats [46,47]. The earlier studies indicated that the gene accD encoded plastid beta carboxyl transferase subunit of acetyl-CoA carboxylase (ACCase) which is important for the proper chloroplast and as all stages of leaf growth [48], leaf longevity [49], fatty acid biosynthesis [50,51] and embryo development [52]. Hence, the accD gene Fig. 6 The genetic diversity based on Kimura's two-parameter model. a The P-distance value of protein-coding genes (b) the P-distance value of intron and intergenic regions (c) Boxplots of P-distance value difference among LSC, IR and SSC regions (d) Boxplots of P-distance value differences between protein-coding genes and intron and intergenic regions may have been involved in adaptation to specific ecological niches during the radiation of dicotyledonous plants [53].
Over the past few years, numerous plastid genome databases have been reported, offering an important foundation for resolving evolutionary, taxonomic, and phylogenetic questions in plants [54][55][56][57][58][59][60]. Our phylogenetic analyses showed that the genus Xanthium is most closely related to the genus Ambrosia. Several previous studies have used various methods including cladistic analyses [61,62], cp restriction site variation assessments [63], and sequence analyses [11,64] to understand the position of Xanthium, and these have shown that it is most closely related to Ambrosia species. Previous phylogenetic studies have shown that the genus Parthenium is an early-diverging lineage of the subtribe Ambrosiinae based on three plastid and two nuclear markers. We obtained consistent results, but with weak bootstrap support (57%). Somaratne et al. [6] suggested that Parthenium is not an early-diverging lineage of the subtribe Ambrosiinae, however, their phylogenetic analysis included only 46 cp PCGs. By contrast, we analyzed 71 PCGs in the present study, and the results suggest that Parthenium is an early-diverging lineage of subtribe Ambrosiinae.

Conclusion
We aimed to expand the molecular genetic resources available for the species X. spinosum through highthroughput sequencing and cp genome assembly. The structural characteristics of the X. spinosum cp genome is similar to other angiosperms. However, fourteen highly variable regions were detected and suggested as potential markers for future barcoding and phylogenetic studies of Xanthium species. Hence, the sequence data for the complete X. spinosum cp genome could be used as to distinguish among Xanthium species and resolve the phylogenetic relationships within the Ambrosiinae lineage.

DNA extraction and sequencing of Xanthium spinosum
Leaf material of Xanthium spinosum was obtained from Dr. George A Yatskievych, Curator, Plant Resources Center, University of Texas Herbarium (19-056), Austin, Texas, USA. Total genomic DNA was extracted using a modified cetyltrimethylammonium bromide method [65]. Illumina sequencing was carried out by LabGenomics, Seongnam, South Korea, using the Illumina HiSeq 2500 sequencing system. A paired-end library (150 × 2) was constructed with an insert size of 350 base pairs (bp). Read quality was analyzed with FastQC v0.11.9 [66] and low-quality reads were removed with Trimmomatic 0.39 [67]. The resultant clean reads were filtered using the GetOrganelle v1.6.0 pipeline (https://github.com/Kinggerm/GetOrganelle) to obtain plastid-like reads, and then the filtered reads were assembled de novo using SPAdes v3.12.0 [68]. The complete cp genome sequence of X. spinosum and its gene annotation were submitted to GenBank (MT668935).

Annotation of X. spinosum cp genome
The online program Dual Organellar GenoMe Annotator (DOGMA) was used to annotate the cp genome sequence of X. spinosum [69]. The initial annotation, putative starts, stops, and intron positions were fine-tuned through comparison with homologous genes in the closely related species X. sibiricum [6]. Transfer RNA genes were validated using tRNAscan-SE v1.21 with the default settings [70]. The program OGDRAW v1.3.1 was employed to draw a circular map of the X. spinosum cp genome [71].

Comparative cp genome analyses
The mVISTA program, which uses the Shuffle-LAGAN model, was employed to compare the cp genome of X. spinosum with three closely related cp genomes from X. sibiricum, Ambrosia artemisiifolia, and Parthenium argentatum using the X. spinosum annotation as a reference [72]. The boundaries between IR and SC regions of these species were also compared and investigated.  Repeat sequence and simple sequence repeats (SSRs) analyses The program REPuter was used to predict the presence of repeat sequences in the X. spinosum and X. sibiricum cp genomes, including forward, reverse, palindromic, and complementary repeats [73]. The following parameters were used to identify repeats with REPuter: Hamming distance 3, minimum sequence identity of 90%, and repeat size > 30 bp. Phobos software v1.0.6 was employed to identify SSRs in the X. spinosum and X. sibiricum cp genomes; the match, mismatch, gap, and N positions parameters were set to 1, − 5, − 5, and 0, respectively [74]. For repeat and SSR marker analyses, only one IR region was used.

Anaglyses of genetic divergence
To analyze genetic divergence, the PCGs, intergenic, and intron-containing regions of the X. spinosum and X. sibiricum cp genomes were extracted and aligned independently using Geneious Prime v2020.1.2 (Biomatters, New Zealand). Genetic divergence between these Xanthium species was calculated based on nucleotide diversity (π) and the total number of polymorphic sites using DnaSP v5.10.01 [75]. For this analysis, gaps and missing data were excluded.

Characterization of substitution rates
To calculate the synonymous (K S ) and nonsynonymous (K A ) substitution rates, the cp genome of X. spinosum was compared to that of X. sibiricum. Corresponding single-functional PCG exons were extracted from both genomes and aligned independently using Geneious Prime v2020.1.2 (Biomatters, New Zealand). The aligned sequences were translated into protein sequences and analyzed using DnaSP v5.10.01 to obtain K A and K S substitution rates without stop codons.

Positive selection analyses
Positive selection (M2a and M8) and control (M1a, M7, and M8a) models provided in EasyCodeML software v1.21 [76] were used to identify the occurrence of positive selection (ω > 1) on the accD locus in Heliantheae cp genomes. The sequence of the accD gene was aligned using the program MAFFT v1.4.0 [77], and the maximum likelihood phylogenetic tree was constructed using   [78]. The site-specific model was used to calculate nonsynonymous (K A ) and synonymous substitution (K S ) rates using EasyCodeML. The codon substitution models M0, M1a, M2a, M3, M7, M8, and M8a were analyzed. The likelihood ratio test was used to identify positively selected sites in comparisons of M0 (one-ratio) vs. M3 (discrete), M1a (neutral) vs. M2a (positive selection), M7 (β) vs. M8 (β and ω > 1) and M8a ((β and ω = 1) vs. M8 using a site-specific model [76]. The likelihood ratio test (LRT) for these comparisons was used to evaluate the selection strength and pvalues of less than 0.05 from the chi-square (χ 2 ) test were considered significant. If the LRT p-values were significant (< 0.05), the Bayes Empirical Bayes (BEB) method was implemented to identify codons under positive selection. BEB values higher than 0.95 and 0.99 indicate sites that are potentially under positive selection and highly positive selection, respectively.

Phylogenetic tree analyses
A phylogenetic tree was constructed using 71 PCGs from 21 Asteroideae cp genomes, with L. fischeri as the outgroup. A total of 20 complete cp genome sequences were downloaded from the NCBI Organelle Genome Resource database. The aligned PCG sequences were saved in PHYLIP format using Clustal X v2.1 [79], and phylogenetic analyses were conducted based on the maximum likelihood (ML) method and the GTRI model using RAxML v7.2.6 with 1000 bootstrap replications [78].