The complete chloroplast genome of Stauntonia chinensis and compared analysis revealed adaptive evolution of subfamily Lardizabaloideae species in China

Background Stauntonia chinensis DC. belongs to subfamily Lardizabaloideae, which is widely grown throughout southern China. It has been used as a traditional herbal medicinal plant, which could synthesize a number of triterpenoid saponins with anticancer and anti-inflammatory activities. However, the wild resources of this species and its relatives were threatened by over-exploitation before the genetic diversity and evolutionary analysis were uncovered. Thus, the complete chloroplast genome sequences of Stauntonia chinensis and comparative analysis of chloroplast genomes of Lardizabaloideae species are necessary and crucial to understand the plastome evolution of this subfamily. Results A series of analyses including genome structure, GC content, repeat structure, SSR component, nucleotide diversity and codon usage were performed by comparing chloroplast genomes of Stauntonia chinensis and its relatives. Although the chloroplast genomes of eight Lardizabaloideae plants were evolutionary conserved, the comparative analysis also showed several variation hotspots, which were considered as highly variable regions. Additionally, pairwise Ka/Ks analysis showed that most of the chloroplast genes of Lardizabaloideae species underwent purifying selection, whereas 25 chloroplast protein coding genes were identified with positive selection in this subfamily species by using branch-site model. Bayesian and ML phylogeny on CCG (complete chloroplast genome) and CDs (coding DNA sequences) produced a well-resolved phylogeny of Lardizabaloideae plastid lineages. Conclusions This study enhanced the understanding of the evolution of Lardizabaloideae and its relatives. All the obtained genetic resources will facilitate future studies in DNA barcode, species discrimination, the intraspecific and interspecific variability and the phylogenetic relationships of subfamily Lardizabaloideae. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07484-7.


Background
Herbal medicine has been used as complementary and alternative treatments to augment existing therapies all over the world. The bioactive natural compounds extracted in herbal medicine may have the potential to form new drugs to treat a disease or other health conditions [1]. However, the wild resources of these plant species were on the verge of exhaustion by plundering exploitation with the increasing demand for herbal medicine with significant economic value [2]. Previous studies of herbal medicine species mainly concentrated on the cultivation and phytochemical studies. Whereas, few studies have described the genetic diversity and phylogenetic analysis. The germplasm, genetic and genomic resources need to be developed as potential tools to better exploit and utilize these herbal medicine species [3]. In addition, a good knowledge of genomic information of these species could provide insights for conservation and restoration efforts. Therefore, the molecular techniques are required to analyze the genetic diversity and phylogenetic relationship of these plants.
Chloroplasts contain their own genome, composing of approximately 130 genes, which has a typical quadripartite structure consisting of one large single copy region (LSC), one small single copy region (SSC) and a pair of inverted repeats (IRs) in most plants [4][5][6]. Unlike nuclear genomes, the chloroplast genome is a highly conserved circular DNA with stable genome, gene content, gene order, and much lower substitution rates [7][8][9][10]. Recently, with the development of next generation sequencing, it has become relatively easy to obtain the complete chloroplast genome of non-model taxa [11][12][13]. Thus, complete chloroplast genome has been shown to be useful in inferring evolutionary relationships at different taxonomic levels as an accessible genetic resource [14,15]. On the other hand, although the chloroplast genome is often regarded as highly conserved, some mutation events and accelerated rates of evolution have been widely identified in particular genes or intergenic regions at taxonomic levels [7,[16][17][18]. The complete chloroplast genome has been considered to be informative for phylogenetic reconstruction and testing lineagespecific adaptive evolution of plants.
Lardizabaloideae (Lardizabalaceae) comprising approximately 50 species in nine genera [19]. It's a core component of Ranunculales and belongs to the basal eudicots. Most species of Lardizabaloideae were considered as herbal medicinal plants, which were widespread in China, except tribe Lardizabaleae (including genus Boquila and genus Lardizabala). Stauntonia chinensis DC., belonging to the subfamily Lardizabaloideae, is widely grown throughout southern China, including Jiangxi, Guangdong, and Guangxi provinces [20]. It has been frequently utilized in traditional Chinese medicine known as "Ye Mu Gua" due to its anti-nociceptive, antiinflammatory, and anti-hyperglycemic characteristics [21][22][23]. In this study, we reported and characterized the complete chloroplast genome sequence of Stauntonia chinensis and compared it with another 38 chloroplast genomes of Ranunculales taxa previously published (including species from Berberidaceae, Circaeasteraceae, Eupteleaceae, Lardizabalaceae, Menispermaceae, Papaveraceae, and Ranunculaceae). Our results will be useful as a resource for marker development, species discrimination, and the inference of phylogenetic relationships for family Lardizabalaceae based on these comprehensive analyses of chloroplast genomes.

Results
The chloroplast genome of Stauntonia chinensis We obtained 6.73 Gb of Illumina paired-end sequencing data from genomic DNA of Stauntonia chinensis. A total of 44,897,908 paired-end reads were retrieved with a sequence length of 150 bp, while a total of 41,809,601 of high-quality reads were used for mapping. The complete chloroplast DNA of Stauntonia chinensis. Was a circular molecule of 157,819 bp with typical quadripartite structure of angiosperms, which was composed of a pair of inverted repeats (IRA and IRB) of 26,143 bp each, separated by a large single copy (LSC) region of 86,545 bp and a small single copy (SSC) region of 18,988 bp ( Fig. 1 and Table 1). The genome contained a total of 113 genes, including 79 unique protein-coding genes, 30 unique tRNA genes and 4 unique rRNA genes (Table 1). Of 113 genes, six protein-coding genes (rpl2, rpl23, ycf2, ndhB, rps7, and rps12), seven tRNA genes ((trnI-CAU, trnL-CAA, trnV-GAC, trnI-GAU, trnA-UGC, trnR-ACG, trnN-GUU) and 4 rRNA genes (rrn16, rrn23, rrn4.5, rrn5) were duplicated in the IR regions. The Stauntonia chinensis chloroplast genes encoded a variety of proteins, which were mostly involved in photosynthesis and other metabolic processes, including large rubisco subunit, thylakoid proteins and subunits of cytochrome b/f complex (Table 2). Among the Stauntonia chinensis chloroplast genes, fifteen distinctive genes, including atpF, ndhA, ndhB, petB, petD, rpl2, rpl16, rpoC1, rps16, trnA-UGC, trnG-GCC, trnI-GAU, trnK-UUU, trnL-UAA, and trnV-UAC harbored a single intron, and three genes (clpP, rps12 and ycf3) contained two introns ( Table 3). The gene rps12 had trans-splicing, with the 5′-end exon 1 located in the LSC region and the 3′-exons 2 and 3 and intron located in the IR regions. The overall G/C content was 38.67%, whereas the corresponding values of LSC, SSC, and IR regions were 37.1, 33.68, and 43.08%, respectively.

Codon usage bias pattern
It is generally acknowledged that codon usage frequencies varied among genomes, among genes, and within genes [24]. Codon preferences was often explained by a balance between mutational biases and natural selection for translational optimization [25][26][27]. Optimal codons help to increase both the efficiency and accuracy of translation [28]. The codon usage and relative synonymous codon usage (RSCU) values in the Stauntonia chinensis chloroplast genome was calculated based on protein-coding genes (Table 4). In total, 85 proteincoding genes in the Stauntonia chinensis chloroplast genome were encoded by 26,246 codons. Among the codons, the most frequent amino acid was leucine (2701 codons, 10.29%), while cysteine (310 codons, 1.18%) was the least abundant amino acid excluding the stop codons. Similar to other angiosperm chloroplast genome, codon usage in the Stauntonia chinensis chloroplast genome was biased towards A and U at the third codon position, according to RSCU values (with a threshold of RSCU > 1) [29]. Further, the pattern of codon usage bias in the subfamily Lardizabaloideae and other species in Ranunculales were investigated (Fig. 2, Additional file 1). We found that two parameters (codon bias index, CBI and frequency of optimal codons, Fop) involved in codon usage bias were higher in Lardizabaloideae species than other species in Ranunculales.

Repeats and microsatellites analyses
Five type of repeat structures, including tandem, forward, palindromic, complement, and reverse repeats were identified using REPuter software in eight sequenced chloroplast genomes of Lardizabaloideae species. Overall, 23-40 repeat sequences were identified in each chloroplast genome, of which 3-9 tandem repeats, 7-17 forward repeats, and 11-17 palindromic repeats were separately detected, while few complement and reverse repeats were screened, for instance, only one complement repeat was predicted in Holboellia angustifolia (Fig. 3a). More than half of these repeats (72.5% at least)  had a repeat length between 30 and 50 bp (Fig. 3b), and majority of the repeats were distributed in non-coding regions, including the intergenic regions and introns.

Genome comparison
The border regions and adjacent genes of chloroplast genomes were compared to analyze the expansion and contraction variation in junction regions, which were common phenomenons in the evolutionary history of land plants. To evaluate the potential impact of the junction changes, we compared the IR boundaries of the Lardizabaloideae species (Fig. 4). Although the majority of genomic structure, such as gene order and gene number were conserved, the eight chloroplast genomes of Lardizabaloideae species showed visible divergences at the IRA/LSC and IRB/SSC borders. Some differences in the IR expansions and contractions still existed. For example, the IRB region expanded into the gene rps19 with 87 and 250 bp in the IRB regions of Decaisnea insignis and Sinofranchetia chinensis chloroplast genomes, respectively, although the IRB regions of other six chloroplast genomes were conserved. Thus, we found that the IR regions of the eight chloroplast genomes were conserved, except the chloroplast genomes of Decaisnea  insignis and Sinofranchetia chinensis, which were slightly expanded compared with that of the other species.
To further investigate the divergence of chloroplast genomes among Lardizabaloideae species, a global sequence alignment of eight chloroplast genomes were compared using the annotated chloroplast genome of A. trifoliata as a reference. These closely related species had little difference in genome size, ranging from 157, 797 bp to 158,683 bp. Although sequence similarities were very high in IR regions, the chloroplast genomes exhibited less conserved in LSC and SSC regions (Fig. 5). A sliding window analyses of the whole chloroplast genomes of eight Lardizabaloideae species indicated that most of the variation occurred in the LSC and SSC regions, which exhibited higher nucleotide variability (Pi) in comparison to IR regions (Fig. 6a). As shown in Fig.  6a, the nucleotide diversity values in the LSC and SSC regions ranged from 0.00173 to 0.08625 and from 0.0044 to 0.05637, respectively, while the value was from 0.00 to 0.01131 in the IRs regions. Expectedly, the divergence in intergenic regions was higher than in genic regions, but the ycf1 gene exhibited a higher variability. The most divergent non-coding regions among the eight Lardizabaloideae chloroplast genomes were trnH-psbA, trnK-rps16, rps16-trnQ, trnC-petN, trnT-psbD, ycf3-trnS-rps4, trnT-trnL, accD-psaI, petA-psbJ, ndhF-rpl32, and rpl32-trnL. Although coding regions were conserved, minor sequence variation was observed among the eight chloroplast genomes in the trnK, matK, psaJ, rpl16, ndhF, ccsA, ndhA, and ycf1 gene as shown in Fig. 6b (Pi value > 0.015). Similarly, mauve alignment results revealed that no large structural changes such as gene order rearrangements was detected across these eight chloroplast genomes of Lardizabaloideae species (Additional file 2), although some inversions were present in LSC and SSC regions in other Ranunculales species, such as Pulsatilla chinensis, Anemone trullifolia, and Anemoclema glaucifolium.

Estimating rates of chloroplast evolution and positive selection analyses
Most of Ka/Ks values of these Ranunculales species were less than or close to 1, providing the evidence that these chloroplast genes experienced purifying or no selection pressures ( Fig. 7 and Additional file 3). Furthermore, in Lardizabaloideae species, the Ka/Ks ratios were far less than 1 among Akebia trifoliata, Akebia quinata, Stauntonia chinensis, and Archakebia apetala. However, the Ka/Ks ratio between Holboellia angustifolia and Holboellia latifolia was greater than 1, implying some To further identify chloroplast protein-coding genes that might have undergone positive selection in Lardizabaloideae species, branch-site model analysis was employed by defining Lardizabaloideae species as foreground branch. A total number of 55 single-copy coding genes were considered for the positive selection analysis (Table 5). Although the likelihood ratio test showed that most of p-values were not significant in each gene range, two protein coding genes (rbcL and accD) indicated rejection of a null model (p < 0.05), corroborating the hypothesis that some amino acid sites in these two proteins in clade Lardizabaloideae species have been under positive selection (Table 5). Further analysis using a Bayes empirical Bayes (BEB) procedure identified 25 protein coding genes (accD, atpA, atpE, atpI, ccsA, clpP, ndhD, ndhF, ndhH, ndhI, ndhJ, ndhK, psaA, psaB, psaI, psbA, psbZ, rbcL, rpl33, rpoA, rpoB, rpoC1, rps14, rps2, and ycf3) with significant posterior probabilities suggesting some sites in these genes were under positive selection (Table 5, Fig. 8 and Additional file 4). Among them, 11 genes only had one positively selected site, whereas accD gene contained the largest number of positively selected sites (16 sites). Notably, most of ndh family genes possessed at least one positively selected site, implying this family members were potentially under positive selective pressure in Lardizabaloideae species (Fig. 8).

Phylogenetic analyses
Bayesian and ML trees reconstructed based on the CCG dataset were highly congruent in identifying the phylogenetic position of these seven families in the order Ranunculales (Fig. 9). All nodes of these phylogenetic trees were strongly supported by bootstrap values (BS) in ML analysis and posterior probabilities (PP) in Bayesian analysis. The 39 taxa were classified into five major clades, of which Berberidaceae, Menispermaceae, and Ranunculaceae species clustered into a clade showed a close genetic relationship, while other family species constituted a monophyly. However, the family Circaeasteraceae species showed different position relative to other six families in Bayesian and ML reconstructed trees based on the protein-coding genes CDs dataset. The family Circaeasteraceae species were clustered into a clade with family Ranunculaceae species in phylogenetic tree based on CDs dataset, indicating that Circaeasteraceae had strong support to be a sister to the Ranunculaceae.

Architecture of chloroplast genomes in subfamily Lardizabaloideae
Recently, chloroplast genomes have become to be useful tools to evaluate the genetic divergence among related species [30,31]. Here we present the complete chloroplast genome of Stauntonia chinensis. The organization of the chloroplast genomes among eight Lardizabaloideae species exhibited a high degree of synteny, implying that these genomes were evolutionary conserved at the genome-scale level ( Table 1 On the contrary, there were still a few diverged coding genes, including matK, accD, psaJ, rpl16, ndhF, ycf1, and so on. The matK and ycf1 coding regions had been observed to be highly divergent and could serve as markers for DNA barcoding and phylogenetic analysis [32][33][34][35]. Similarly, nucleotide diversity analysis showed that eight genes (trnK, matK, psaJ, rpl16, ndhF, ccsA, ndhA, and ycf1) among eight Lardizabaloideae species had higher divergence values (Pi > 0.015), implying that they contained more variations than other coding genes (Fig. 6b). Among these genes, matK, ndhF, ccsA, and ycf1 have been  previously detected as highly variable regions in different plants, and some of those were served as DNA barcode [36][37][38][39]. However, previous studies confirmed that both introns and intergenic regions exhibited higher divergence levels than coding regions [40]. In our study, both genome-scale level alignments and nucleotide diversity analyses of the eight Lardizabaloideae chloroplast genomes revealed common variable sites, including eleven intergenic regions and eight coding genes (Figs. 5 and 6).  Previous studies supported that repetitive sequences were considered to play crucial roles in chloroplast genome arrangement and sequence divergence, even those were generally rare among angiosperm plastomes [41][42][43]. Generally, Lardizabaloideae species exhibited a significant difference in number and length of repeats within their chloroplast genomes. Most of the repeats were distributed in non-coding regions, including the intergenic regions and introns, reflecting the fact that the evolution of non-coding regions was higher than that of coding regions (Fig. 3) [44]. However, several repeats occurred in the same gene (ycf2) or paralogs (pasA/psaB and trnS-GCU/trnS-UGA/trnS-GGA), which might be caused by replication slippage, generating improper sequence recombination [45,46]. Because of analytical and highly polymorphic nature, SSRs were considered to be well suited to assessment of genetic diversity within species and their relatives [47,48]. In summary, repetitive sequences present in chloroplast genomes could facilitate the species discrimination and act as tools for investigating levels of genetic diversity in subfamily Lardizabaloideae.

The adaptive evolution and positive selection
The Ka/Ks ratios were important to deduce the evolutionary rates and understand the adaptive developments among species [49]. The pairwise Ka/Ks ratios among Akebia trifoliata, Akebia quinata, Stauntonia chinensis, and Archakebia apetala were far less than 1, suggesting more intense purifying selection in these species, for both conservative and radical nonsynonymous substitutions (Fig. 7, Additional file 3). The lower Ka/Ks ratios might be explained that most genes in these species were likely to undergo deleterious nonsynonymous substitutions, and the purifying selection with stronger selective constraints for nonsynonymous substitutions than for synonymous ones [50,51]. However, the Ka/Ks ratio between H. angustifolia and H. latifolia was greater than 1, implying some chloroplast coding sites of these two species were under positive selection. It is possible that more unknown selective forces might have contributed to the elevated Ka/Ks ratios, and resulted in species divergence [52].
It was suggested that codon sites with higher posterior probability could be also considered as positively selected sites, and genes containing positively selected sites might be evolving under divergent selective pressures [53,54]. Although pairwise Ka/Ks ratios showed most of the chloroplast genes of Ranunculales species experienced purifying or no selection pressures, at least 25 chloroplast protein coding genes were identified with significant posterior probabilities suggesting sites with positive selection in Lardizabaloideae species, which indicated these genes might have evolved to adapt to environmental conditions (Table 5). Notably, we found that five of these 25 genes were associated with photosystem I and II subunits (psaA, psaB, psaI, psbA, and psbZ), while six of ten NADH-dehydrogenase subunit genes (ndhD, ndhF, ndhH, ndhI, ndhJ, and ndhK) possessed at least one positively selected site, implying these family members were potentially under positive selective pressure in Lardizabaloideae species (Fig. 8). Photosystem subunits and NADH-dehydrogenase subunits were essential in light energy utilization and electron transport chain for generation of ATP, which were all important components for photosynthesis of plants [55,56]. Therefore, all these genes, which were involved in important process for plant growth and development, might evolve results of more frequent substitutions to adapt to different environmental conditions. Among all positively selected genes, we found that the accD gene possessed the maximum number of sites under positive selection in Lardizabaloideae species, suggesting that the accD gene may play a pivotal role in the adaptive evolution of these species [57]. In addition, the likelihood ratio tests (LRTs) results showed that p-value of rbcL gene was less than 0.05, corroborating that sites in rubisco large subunit protein in clade Lardizabaloideae species have been under positive selection. As an important modulator of photosynthetic electron transport, recent study has revealed that positive selection of the rbcL gene was fairly common in all the main lineages of land plants [58,59]. Thus, the rbcL gene was widely used to establish the diverse phylogenetic relationships of land plants [18,60]. In summary, positive selection would possibly contribute to subfamily Lardizabaloideae diversification and adaptation.

The phylogenetic analysis in order Ranunculales
Chloroplast genome sequences which contained sufficient information have been widely used to reconstruct phylogenetic relationships among angiosperms even at lower taxonomic levels [61][62][63][64]. The phylogenetic relationships based on CCG dataset were consistent with the Angiosperm Phylogeny Group (APG) IV system of classification [19]. Unexpectedly, the phylogenetic relationships based on both CCG and concatenated proteincoding genes CDs datasets were inconsistent. The phylogenetic tree based on CDs dataset showed that the family Circaeasteraceae species were clustered into a clade with family Ranunculaceae species. This result indicates that species Kingdnia uniflora and Circaeaster agrestis in family Circaeasteraceae had strong support to be a sister to the Pulsatilla chinensis in family Ranunculaceae based on chloroplast protein-coding genes, which was inconsistent with the APG IV classification system. The inconsistent phylogenetic relationships implied a different rate of evolution in coding regions and non-coding regions, which might due to the nucleotide substitutions of non-coding regions were noisy than those.

Conclusions
This is the first report of the complete chloroplast genome sequence of Stauntonia chinensis. The architectural and the phylogenomic analysis of complete chloroplast genomes of eight Lardizabaloideae plants and relevant species could provide valuable genomic resource of this subfamily and its relatives. Meanwhile, several variation hotspots detected as highly variable regions could be served as the specific DNA barcodes. Our genomics analysis of these complete chloroplast genomes will lead to potential applications in the understanding of evolution and adaptation of species in the subfamily Lardizabaloideae.

Plant materials and DNA extraction
Stauntonia chinensis, which was identified by Prof. Liao Liang according to Flora of China, was sampled from Xianyan Mountain in Nanping city (118.10E, 26.73 N), Fujian Province, China. The voucher specimen deposited in Jiujiang University (accession number JJU130801). Approximately 5 g of fresh leaves was harvested for genomic DNA isolation using an improved extraction method [65].

Chloroplast genome sequencing, assembly and annotation
A library with the insertion size of 430 bp was constructed, and all genome data were sequenced using an Illumina Hiseq 4000 platform at BIOZERON Co., Ltd. (Shanghai, China) [66]. The filtered reads were aligned with the Akebia trifoliata chloroplast genome (GenBank accession KU204898), and mapped to the reference chloroplast genomes [67,68]. The chloroplast genes were annotated using an online DOGMA tool, using default parameters to predict protein-coding genes, transfer RNA (tRNA) genes, and ribosome RNA (rRNA) genes, coupled with manual check and adjustment [69].

Codon usage, and repeat structure
Codon usage was determined for all protein-coding genes using the program Codon W 1.44 [70]. The relative synonymous codon usage (RSCU) was calculated to examine the deviation in synonymous codon usage. Six values were used to estimate the extent of bias toward codons: the codon adaptation index (CAI), codon bias index (CBI), frequency of optimal codons (Fop), the effective number of codons (ENC), GC content (GC), and GC content of synonymous third codons positions (GC3s).

Genome comparison and nucleotide divergence
Comparative chloroplast genomes of eight Lardizabaloideae species were carried out and visualized by using mVISTA online software (http://genome.lbl.gov/vista/ index.shtml) [74] with the A. trifoliata as a reference [67]. The large structural changes such as gene order rearrangements, inversions, and insertions were identified using Mauve v2.4.0 with default settings [75]. The chloroplast genome borders were also analyzed to show the IR expansions and contractions. DNAsp v5.10 software was used to analyze the nucleotide diversity (Pi) and sequence polymorphism of Lardizabaloideae species [76].

Species pairwise Ka/Ks ratios and positive selected analyses
The concatenated single-copy gene coding sequences (CDs) of all 39 taxa were extracted and aligned with ClustalW [77]. Pairwise Ka/Ks ratios of all species were calculated using KaKs Calculator v2.0 [78]. The positive selected analyses were performed by an optimized branch-site model and Bayesian Empirical Bayes (BEB) method [53,54]. The single-copy CDs of protein-coding genes of all 39 taxa were extracted and their amino acid sequences were aligned with ClustalW. The branch-site model was performed to test for potential positive selection using the CODEML algorithm implemented in EasyCodeML [79,80]. The ratio (ω) of nonsynonymous to synonymous substitution rates was used to determine the selective pressure. The positive selection, no selection and negative selection were indicated when the ratio ω > 1, ω = 1, and ω < 1, respectively [80][81][82]. The likelihood-ratio tests (LRT) were performed according to Lan et al. [83]. The BEB method was used to compute the posterior probabilities of amino acid residues to identify whether these residue sites had potentially evolved under selection [53].

Phylogenetic analyses
The complete chloroplast genome (CCG) sequences and concatenated single-copy protein coding genes CDs of all 39 taxa were aligned using ClustalW. The phylogenetic analyses were carried out through maximum likelihood (ML) and Bayesian inference (BI) performed in IQ-TREE v1.6.1 and MrBayes 3.1.2, respectively [84,85]. The best-fit models for both datasets were selected by MrModeltest v2.3. The Maximum likelihood analyses were conducted using IQ-TREE with 1000 bootstrap replicates. The BI analysis was run for 100,000 generations and sampled every 100 generations. The first 25% of the trees were discarded as burn-in, and the remaining trees were used to build a 50% majority-rule consensus tree.