Chloroplast phylogenomics and divergence times of Lagerstroemia (Lythraceae)

Crape myrtles, belonging to the genus Lagerstroemia L., have beautiful paniculate inflorescences and are cultivated as important ornamental tree species for landscaping and gardening. However, the phylogenetic relationships within Lagerstroemia have remained unresolved likely caused by limited sampling and the insufficient number of informative sites used in previous studies. In this study, we sequenced 20 Lagerstroemia chloroplast genomes and combined with 15 existing chloroplast genomes from the genus to investigate the phylogenetic relationships and divergence times within Lagerstroemia. The phylogenetic results indicated that this genus is a monophyletic group containing four clades. Our dating analysis suggested that Lagerstroemia originated in the late Paleocene (~ 60 Ma) and started to diversify in the middle Miocene. The diversification of most species occurred during the Pleistocene. Four variable loci, trnD-trnY-trnE, rrn16-trnI, ndhF-rpl32-trnL and ycf1, were discovered in the Lagerstroemia chloroplast genomes. The chloroplast genome information was successfully utilized for molecular characterization of diverse crape myrtle samples. Our results are valuable for the global genetic diversity assessment, conservation and utilization of Lagerstroemia.

botanical surveys, several new crape myrtle taxa (species and variety) were found in Thailand, Vietnam, Cambodia and Laos [2,[5][6][7]. However, several plants are still known only from herbarium specimens. There are 115 Lagerstroemia name records in the Plant List database (http://www.theplantlist.org/), and half of the taxonomic status of the name remains unresolved.
A few phylogenetic studies have been conducted on Lagerstroemia, but the interspecific relationships in this group remain controversial. Phylogenetic relationships within Lythraceae based on chloroplast genic regions (rbcL, trnL-F, psaA-ycf3) plus the ITS region showed Lagerstroemia was sister to Duabanga and strongly supported the monophyly of the genus [8,9]. The phylogenic relationships within Lagerstroemia have been poorly defined overall using several chloroplast markers and/or the ITS and gene regions of the ubiquitinproteasome system [10,11]. The poor phylogenetic resolution in previous studies resulted from limited amounts of DNA sequence data available and the low genetic variation in the chosen molecular markers, likely due to this group's recent origin and rapid radiation.
Chloroplast genomes have proven to be powerful tools for studying phylogenetic relationships in related species because of their small size, high copy number, uniparental inheritance, and conserved gene content and arrangement [12][13][14]. In recent years, the chloroplast genomes have been sequenced and characterized for species identification and phylogenetic study [15][16][17]. However, due to sparse taxon sampling in previous studies, the phylogenetic relationships within Lagerstroemia are still unclear.
A robust phylogeny of Lagerstroemia, including more representative species and a large amount of genetic markers, is essential for understanding the evolutionary history, breeding of new cultivars and conservation of crape myrtle germplasm resources. In this study, we sequenced 20 chloroplast genomes of Lagerstroemia samples using next-generation sequencing (NGS). The aims of this study were: (i) to deepen our understanding of chloroplast genome evolution of Lagerstroemia, (ii) to reconstruct the robust phylogenetic relationship of Lagerstroemia, and (iii) to reveal the divergence times involving this genus.

Characteristics of Lagerstroemia chloroplast genomes
The complete chloroplast genomes of the 20 newly sequenced Lagerstroemia species ranged in length from 151,968 bp (L. guilinensis) to 152,629 bp (L. speciosa) ( Table 1). All chloroplast genomes had the four typical conjoined structures, including the LSC and SSC regions separated by two IR regions (Fig. 1) To identify the sequence divergence hotspots, the nucleotide diversity (π) value within the slide window of 600 bp was calculated (Fig. 2). The π values varied from 0 to 0.0318, the average pi value was 0.00474, the IR region exhibited the least nucleotide diversity (0.00285), and the SSC exhibited high divergence (0.01006). Four highly variable regions (pi > 0.02), including trnD-trnY-trnE, rrn16-trnI, ndhF-rpl32-trnL and ycf1, were detected in the Lagerstroemia chloroplast genomes (Fig. 2). Among these regions, trnD-trnY-trnE was located in the LSC region, rrn16-trnI was located in the IR region, and ndhF-rpl32-trnL and ycf1 were located in the SSC region. We compared the four hypervariable markers and the universal DNA barcodes (rbcL, matK, and trnH-psbA) in more detail ( Table 2). The number of variable sites of the four markers ranged from 38 (trnD-trnY-trnE) to 56 (rrn16-trnI and ndhF-rpl32-trnL), whereas the universal DNA barcodes had lower divergence. The average nucleotide diversity of the four rapidly evolving regions was 0.01941, which was 2.5 times higher than that of the universal DNA barcodes. The identified variable markers had higher resolution compared with the three universal markers, based on the ML tree ( Figure S1).

Phylogenetic analyses
Characteristics of the six different datasets used in this study are shown in Table 3. Dataset-3 possesses the most variable and parsimony-information sites, followed by dataset-2 and dataset-4. As expected, dataset-5 (IR region) had the fewest variable and parsimony-informative sites. Dataset-1 and Dataset-2 strongly supported the monophyly of Lagerstroemia (BS = 100/PP = 1.0). In this study, analyses based on each dataset revealed four clades in the genus Lagerstroemia. Clade I was sister to Clade II, and Clade III was sister to Clade IV. Clade I included four taxa, namely, L. siamica, L. intermedia, L. speciosa, and L. venusta. Only slight differences were found between L. speciosa and L. venusta. L. siamica was sister to L. intermedia. Clade II consists of six taxa: L. villosa, L. floribunda, L. tomentosa, L. calyculata, L. sp. 01, and L. sp. 02. L. villosa was the first divergent species in this clade. Clade III contained three taxa: L. fauriei, L. subcostata and L. limii. These three taxa had longer branch on the phylogenetic tree, indicating significant divergence between each other (Fig. 4). Seven taxa are in Clade IV: L. caudata, L. anhuiensis, L. glabra, L. excelsa, L. guilinensis, L. indica, and L. sp. 03. L. anhuiensis and L. glabra formed a clade and showed short branch in the trees. The topology of the Lagerstroemia samples with high resolution was achieved based on the whole chloroplast genome sequence data (Fig. 4). Figures S2, S3, and S4 show the general decrease in resolution capacity of the topology when either the LSC, IR, or SSC region was used due to the insufficient information.

Divergence time estimate
Different fossil calibration combinations were computed to investigate the variation of estimation values of the divergence times (Table 4). We focused on the Lagerstroemia stem and crown nodes. The estimated age of stemgroup Lagerstroemia showed a different pattern with younger age estimates when the fossil calibration of Lagerstroemia patelii (> 56 Ma, Fig. 5, Note 6) was not included. The Lagerstroemia stem node was 56.34 ± 4.78 Ma, and the Lagerstroemia crown node was 31.06 ± 2.82 Ma, obtained from the 12 fossil-calibrated analyses ( Table 4).
According to the fossil records, Lagerstroemia first appeared in the late Paleocene/early Eocene of the Indian subcontinent [18]. We consider the scenario including all the eight fossil calibrations as the final result (Fig. 5). The stem node of the Lagerstroemia was dated to 60.12 Ma (95 % highest posterior density, HPD: 56.2 −

Informative indicated chloroplast markers for Lagerstroemia
Our results indicate that the mutation patterns of the chloroplast genomes were not uniform. As a whole, the single-copy region possesses a higher divergence than the IR region, and the mutation events of SNPs and indels were not random, but instead were clustered as "mutation hotspots" or "highly variable regions". These results are generally consistent with those from other studies involving chloroplast genomes. Previous phylogenetic studies of Lagerstroemia mainly used the universal chloroplast loci (rbcL, matK, and trnH-psbA) and the ITS, but these did not provide a good resolution of the phylogenetic relationship in this genus [11]. Our results showed that the universal chloroplast markers have low divergence (Table 2), explaining the low resolution in previous studies and highlighting the importance of developing highly divergent markers. In this study, we have identified four highly variable loci: trnD-trnY-trnE, rrn16-trnI, ndhF-rpl32-trnL and ycf1 (Fig. 2). Of these, rrn16-trnI and ycf1 have been considered divergence hotspots by Xu et al. [15], which compared six Lagerstroemia chloroplast genomes and identified 12 highly variable markers. Previously, trnD-trnY-trnE was less used in plant phylogeny. rrn16-trnI is located in IR regions, which are specific to the Lagerstroemia chloroplast genome. In general, mutation hotspots are rare in the IR region. ndhF-rpl32-trnL included two intergenic regions (ndhF-rpl32 and rpl32-trnL), which showed the highest percentage of variable sites and the highest number of information sites (Table 2). However, there was poly A/T structure in this region, which may be regarded as low sequence quality [19,20]. The ycf1 locus was the most divergent marker in the Lagerstroemia chloroplast genome (Fig. 2) and has been broadly used for reconstructing plant phylogeny and species identification [21]. Therefore, the lineage-specific, highly variable markers developed in this study will facilitate further phylogeny reconstruction and DNA barcoding of crape myrtle species ( Figure S1).

Phylogenetics of Lagerstroemia
Lagerstroemia was a monophyletic group based on the morphology [1,3], several chloroplast markers [22] and ITS locus [8]. De Wilde and Duyfjes [5] classified Lagerstroemia into four sections on the basis of the monograph by Furtado & Srisuko [1]. Several morphological features used for morphological classification of Lagerstroemia in previous reports, such as (1) the number of the ridges on the calyx tube, (2) the number of the ridges is the same as or twice the number of sepals, and (3) glabrous or hairy within the calyx lobes, may be observed in the same clade generated based on the molecular classification. For example, in Clade I, the 6-7 ridges on the calyx tube outside in L. venusta is the same as the sepal number, but each of the other two taxa (L. speciosa and L. siamica) has 12 ridges on the calyx tube outside, which is twice the number of sepals. Not ridged (L. calyculata), 5-6 ridges (L. villosa), and 12 ridges (L. tomentosa) are observed in Clade II. It is difficult to satisfactorily quantify the relationship between the ridge number and the sepal number when no ridge is observed. In Clade IV, L. anhuiensis has hairs within calyx lobes, but it is glabrous within calyx lobes in L. guilingensis, L. caudata, L. glabra and L. indica.
Molecular markers, such as AFLP, SSRs [23], were used to distinguish the cultivars of Lagerstroemia species, such as L. indica, L. subcostata, L. limii and L. fauriei. However, the genetic background of the cultivars was unclear, and these markers were not informative to infer the relationship of those species. The chloroplast genome has become an efficient option for increasing plant phylogenomics at multiple taxonomic levels during the past years [24][25][26][27][28][29]. We had used the chloroplast genome data to infer phylogenetic relationships of six Lagerstroemia species, and discovered that the chloroplast genome sequences had effective information to infer the phylogeny of this genus [15].
In this study, we recovered a well-supported and species-level relationship of Lagerstroemia using six different chloroplast genome datasets. It provided strong support for the monophyly of Lagerstroemia, sister to Duabanga, and recovered four major clades (Figs. 3 and  4). However, the four clade classifications were different from the morphological classification of the genus [1]. For example, L. speciosa, L. limii, and L. glabra were in the section Adambea, the molecular results showed L. speciosa was in the clade 1, L. limii in the clade 3, and L. glabra in the clade 4, respectively. In clade I, L. venusta was a hexaploid species [11] and fell within the L. speciosa phylogenetically (Figs. 3 and 4). We inferred that L. venusta might be an allohexaploid species and its female parent was L. speciosa. The branch length was short in most terminal nodes, which showed Lagerstroemia may be undergone a rapid radiation [30,31]. The phylogenomics of Myrtales based on 66 protein-coding genes showed the 14 Lagerstroemia species formed four clades [17]. However, the relationship of Lagerstroemia was inconsistent with this study. The difference might be caused by the longer branch length of L. intermedia [17] which affected the topology of the phylogenetic tree. We used the same dataset to infer a similar tree as this study. Further investigations, including extended sampling, more morphological analysis and additional nuclear markers, are needed to insight the evolution of Lagerstroemia.

Divergence time of Lagerstroemia
The fossil record of the Lagerstroemia consists of leaf impressions, wood, and pollen [18]. According to the fossil record, the oldest confirmed evidence of the Lagerstroemia is a leaf impression of L. patelii from India, which was dated as early Eocene or late Paleocene/Thanetian in age (~56 Ma) [32,33]. The oldest occurrence of accepted Lagerstroemia pollen is from the middle Eocene of Central Java [34]. Those records indicated the origin time of Lagerstroemia was earlier than 56 Ma. Our data also support a late Paleocene origin (~60 Ma, Fig. 5; Table 2).
There were a number of putative fossil Lagerstroemia leaves and wood in the middle Miocene [18]. For example, the leaf species of L. mioparviflora, L. eomicrocarpa and L. siwalica were described from Nepal [35,36], and L. jamraniensis was from the Kathgodam area [37]. The wood fossil record of Lagerstroemia is used as the form genus Lagerstroemioxylon Mädler. The wood is recorded from Sumatra (Lagerstroemioxylon eoflosreginum) [38] and Myanmar (Lagerstroemioxylon irrawaddiensis) [39] and is widely encountered in India at several localities (Lagerstroemioxylon arcotense, Lagerstroemioxylon deomaliensis, Lagerstroemioxylon eoflosreginum) [18,40]. Those fossil records suggest that Lagerstroemia was common and somewhat diverse in the wet subtropical forests of the Indian subcontinent in the middle Miocene. The phylogeny and dating analyses demonstrate a similar pattern of this genus divergence into four clades during the Miocene~20 Ma. Diversification with Lagerstroemia occurred in the Pleistocene5 . 3 Ma, and at this time, this genus is present and persists in Japan [18,41].

Conclusions
In this study, we report 20 newly sequenced chloroplast genomes of the genus Lagerstroemia. The overall genomic structure, including gene number and gene order, was well-conserved. The relationship and divergence times of Lagerstroemia were revealed using complete chloroplast genome sequence data. Four clades were found in this genus. Greater taxon sampling is necessary to determine the number of species, morphological characteristics, evolution and biogeography. Our study showed that the chloroplast genome data will provide adequate information for resolving the phylogenetic relationships in this difficult-to-characterize genus.

Plant materials, genomic DNA extraction and sequencing
According to the morphological classification, the Lagerstroemia was classified into four sections and eight subsections [1]. In order to infer the framework of the phylogenetic relationship, we sampled 20 individuals of 17 described species, which represented all the four sections and six of eight subsections. The materials were obtained from the field, botanical gardens and the herbarium of the Institute of Botany, Chinese Academy of Sciences (PE , Table S1). Three crape myrtle samples could not be accurately identified morphologically because of the lack of morphological characters. In addition to the newly collected material for DNA sequencing, publicly available complete chloroplast genome sequences (15 accessions, Table S1) of Lagerstroemia were also included in this analysis.
Total genomic DNA was extracted from silica-dried leave tissues of living plants and herbarium specimens of this genus following the modified CTAB DNA extraction protocol [42]. The DNA from silica-dried tissue was fragmented to construct 350-bp insert libraries, and the DNA from the herbarium material was constructed using 150-bp insert libraries according to the manufacturer's manual (Illumina Inc., San Diego, CA, USA) and was then used for sequencing. Paired-end sequencing was performed on an Illumina HiSeq X-ten at Novogene in Tianjin, China, yielding Chloroplast genome assembly, annotation, and comparative analyses A four-step approach was employed to assemble the chloroplast genome. First, adaptors were removed, and low-quality sequences were trimmed using Trimmomatic 0.39 [43] with the following parameters: LEAD ING = 20, TRAILING = 20, SLIDINGWINDOW = 4:15, MINLEN = 36 and AVGQUAL = 20. Second, remaining high-quality reads were assembled de novo into contigs using SPAdes 3.6.1 [44]. Third, chloroplast genome sequence contigs were selected from the initial assembly by performing a BLAST search using the L. subcostata chloroplast genome sequence as a reference (GenBank accession number: KF572029). The selected contigs from chloroplast genomes were further assembled using Sequencher 5.4.5 (http://www.genecodes.com). Fourth, Geneious 11.1.2 was used to map all reads to the assembled chloroplast genome sequence to check the four junctions between the inverted repeats (IRs) and the small single-copy (SSC)/large single-copy (LSC) regions.
Chloroplast genome sequences were annotated using Plann [45] and, missing or incorrect genes were checked in Sequin. Physical maps of the circular chloroplast genomes were visualized with OGDRAW [46]. To assess sequence divergence and to explore highly variable chloroplast markers, nucleotide diversity (π) was calculated by sliding window analysis using DnaSP v6 [47], and nucleotide substitutions and p-distance were calculated using MEGA 7.0 [48].

Alignment and data matrix construction
The sequence alignments were constructed with MAFF T v7 [49]. All alignments were visually inspected with MEGA 7.0 [48] and manually adjusted where needed. To access the phylogenetic effects of the different regions in the chloroplast genome, we created six datasets based on different chloroplast genome regions or using different outgroups. All 78 protein-coding genes and four rRNA genes were extracted from the GenBankformatted files containing all chloroplast genomes using Python scripts. Those 82 genes were combined into a concatenated dataset as dataset-1. Dataset-2 included 35 whole chloroplast genome sequences of Lagerstroemia and five other species of Lythraceae as outgroups (Lythrum salicaria, Lawsonia inermis, Rotala rotundifolia, Sonneratia alba, and Duabanga grandiflora). Ambiguous alignment regions were trimmed using Gblocks 0.91b [50] implemented in Phylosuite v1.1 [51]. In addition, the third to sixth datasets only included 35 samples of Lagerstroemia, which were from the complete chloroplast genomes, LSC region, IR region, and SSC region, respectively.

Phylogenetic analyses
We used maximum likelihood (ML) and Bayesian inference (BI) methods for phylogenetic analyses. The datasets were unpartitioned, and the best-fit model was determined by ModelFinder [52]. Maximum likelihood analyses were run with RAxML v.8.1.24 [53]. RAxML searches were made with 500 randomized maximum parsimony starting trees, and RAxML was run again under the same conditions executing 1,000 nonparametric bootstrap replicates to assess the branch support.
BI was run with Mrbayes v3.2 [54]. Two independent Markov Chain Monte Carlo (MCMC) analyses were performed, each with four chains (three heated and one cold) for 20 million generations with sampling of every 100th tree. Each chain started with a random tree, and the first 25 % sampled generations were discarded as burn-in to construct a majority-rule consensus tree and to estimate posterior probabilities (PP). Stationarity was considered to be reached when the average standard deviation of split frequencies was < 0.01.

Fossil priors and BEAST analyses
We used BEAST v2.5.1 [55] to estimate the divergence times using dataset-1 and added seven Lythraceae species and three Onagraceae species to accommodate all available fossil calibrations. This dataset was calibrated using five reliably dated fossils.  [18]. This fossilized pollen was used to offset for the crown of the two lineages. Sonneratiaoxylon preapetalum Awasthi was fossil wood of Sonneratia [56] from the early Paleocene of India (Danian, 67.3 − 63. 8 Ma) and was used to calibrate the most recent common ancestor (TMRCA) of Sonneratia and Trapa to > 63. 8 Ma. We also used the oldest fossil accepted as Punica, which was wood of Punicoxylon eocenicum Privé-Gill from the middle Eocene (48.6 − 40.4 Ma) of Paris [18], and the seed of Lawsonia lawsonioides (Menzel) Mai. [57] from the middle Miocene (16 Ma ago) as conservative offsets on the stem nodes of Punica and Lawsonia, respectively. The oldest confirmed fossil of Lagerstroemia patelii Lakhanpal & Guleria, from the late Paleocene/Eocene (ca. 56 Ma) was used to calibrate the stem age of this genus to > 56 Ma [18,58]. Each of the five fossil priors (Lythrum elkensis/Peplis eaglensis, Sonneratiaoxylon preapetalum, Punicoxylon eocenicum, Lawsonia lawsonioides, and Lagerstroemia patelii) was given a lognormal distribution with offset values as specified (i.e., 81.0, 63.8, 40.4, 16.0, and 56.0 Ma, respectively), and with a mean of 1.5 and a standard deviation of 1, allowing for the possibility that these nodes are considerably older than the fossils themselves. In addition to these fossil priors, we also used three secondary priors. Based on the average value obtained by Berger et al. [59] in a calibrated analysis, three priors were used: (1) the average age of TMRCA of Lythraceae and Onagraceae (the root of the tree) was 104.6 Ma; (2) the crown age of Onagraceae was 85.4 Ma; and (3) the crown age of Lythraceae was 95.5 Ma. Each secondary prior was placed under normal distribution with a standard deviation of 1.
To assess possible calibration incongruence, we ran twelve analyses with calibration combinations ( Table 2). The twelve analyses were run with uncorrelated lognormal distribution (UCLD) relaxed molecular clock models to account for rate variability among lineages, the Yule speciation model and 100,000,000 generations with the MCMC method, sampling trees every 10,000 generations. The stationary phase was examined through Tracer 1.6 [60] to evaluate convergence and to ensure sufficient and effective sample size (ESS) for all parameters surpassing 200. A burnin of 10 % generations was discarded, and TreeAnnotator v2.4.7 was used to produce a Maximum Clade Credibility tree.