Complete chloroplast genomes provide insights into evolution and phylogeny of Zingiber (Zingiberaceae)

Background The genus Zingiber of the Zingiberaceae is distributed in tropical, subtropical, and in Far East Asia. This genus contains about 100–150 species, with many species valued as important agricultural, medicinal and horticultural resources. However, genomic resources and suitable molecular markers for species identification are currently sparse. Results We conducted comparative genomics and phylogenetic analyses on Zingiber species. The Zingiber chloroplast genome (size range 162,507–163,711 bp) possess typical quadripartite structures that consist of a large single copy (LSC, 86,986–88,200 bp), a small single copy (SSC, 15,498–15,891 bp) and a pair of inverted repeats (IRs, 29,765–29,934 bp). The genomes contain 113 unique genes, including 79 protein coding genes, 30 tRNA and 4 rRNA genes. The genome structures, gene contents, amino acid frequencies, codon usage patterns, RNA editing sites, simple sequence repeats and long repeats are conservative in the genomes of Zingiber. The analysis of sequence divergence indicates that the following genes undergo positive selection (ccsA, ndhA, ndhB, petD, psbA, psbB, psbC, rbcL, rpl12, rpl20, rpl23, rpl33, rpoC2, rps7, rps12 and ycf3). Eight highly variable regions are identified including seven intergenic regions (petA-pabJ, rbcL-accD, rpl32-trnL-UAG, rps16-trnQ-UUG, trnC-GCA-psbM, psbC-trnS-UGA and ndhF-rpl32) and one genic regions (ycf1). The phylogenetic analysis revealed that the sect. Zingiber was sister to sect. Cryptanthium rather than sect. Pleuranthesis. Conclusions This study reports 14 complete chloroplast genomes of Zingiber species. Overall, this study provided a solid backbone phylogeny of Zingiber. The polymorphisms we have uncovered in the sequencing of the genome offer a rare possibility (for Zingiber) of the generation of DNA markers. These results provide a foundation for future studies that seek to understand the molecular evolutionary dynamics or individual population variation in the genus Zingiber. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-023-09115-9.


Background
Zingiber Boehm. is a diverse genus of the family Zingiberaceae and consists of approximately 100-150 species that are widely distributed in the tropical and subtropical regions of Asia and Far East Asia [1,2]. Zingiber contains many economically important species. Some species have long-lasting inflorescences and an assemblage of tightly clasped, brightly colored bracts and floral that often highly showy. They are widely used as landscaping and cut-flower in floral arrangements including chocolate pinecone ginger (Z. montanum) and Chiang Mai Princess (Z. citriodorum) [1][2][3]. In addition, some Zingiber species are widely cultivated as edible crop and among the best-known nonprescription drugs in traditional medicinal systems such as myoga ginger (Z. mioga), shampoo ginger (Z. zerumbet) and ginger (Z. officinale) [4][5][6]. Ginger have the pharmacological and biological potential effects of analgesic and anti-inflammatory, antibacterial, antitumor and antidiabetic [7][8][9]. In recent years, ginger was even considered as an alternative therapeutic agent for COVID-19 treatment based on its anti-viral activity [10][11][12].
The genus Zingiber could be distinguished based on nutritional and floral characteristics [1,2]. Previous studies have shown that, species of Zingber can be divided into four groups, namely sect. Zingiber, sect. Dymczewiczia, sect. Pleuranthesis and sect. Cryptanthium based on the habit of inflorescences [13][14][15]. However, sect. Dymczewiczia was amalgamated with Sects. Zingiber and resolved as sister to sect. Pleuranthesis with weak support value according to the phylogenetic analysis of internal transcribed spacer (ITS) sequence of 23 Zingiber species and pollen morphology [16]. Zingiber species share similar characteristics of leaves and other vegetative organs, which makes it extremely difficult to identify species in the non-flowering stage [1][2][3]. Recently years, efforts have been made to explore the phylogenetic relationships among Zingiber species based on molecular data [16][17][18][19]. Kerss, et al. [17] found low resolution in identifying six Zingiber species using ITS and chloroplast matK regions. According to the analyses of amplified fragment length polymorphism (AFLP) DNA markers, Z. montanum was closely related to Z. zerumbet other than to Z. officinale [18]. These results were also revealed by Li, et al. [19] based on the complete chloroplast genome data. Overall, these previous studies have succeeded in clarifying the phylogenetic relationships of some Zingiber species, however, only small number of samples were used and the relationships among many species within the genus Zingiber are still unclear.
Chloroplast genomes have been used to address the chloroplast genome evolution, patterns and rates of nucleotide substitutions and phylogenetic relationships among land plants [20]. Chloroplast is a kind of vital organelle that can transform light energy into chemical energy in green plants [21,22]. The chloroplast genome usually has a typical quadripartite structure consisting of a large single copy (LSC) region, a small single copy (SSC) region, and two copies of inverted repeats (IRs) shows and encodes 110-130 genes with a size range of 120-180 kb and [23][24][25]. In compare with mitochondrial and nuclear genome, chloroplast genome is typically inherited maternally and non-recombining [26]. Although the chloroplast genome structure is usually conserved in angiosperms, variations in genome size, genome structure, and gene substitution rate have been identified [27,28]. In recent years, more than 40 complete chloroplast genomes have been sequenced in the family Zingiberaceae and divergent hotspots, which could be used for phylogenetic analyses, have been identified [25,[29][30][31]. However, only seven chloroplast genomes of Zingiber have been reported, which hindering the molecular plant identification and phylogenetic relationship clarification of Zingiber species. High throughput sequencing technology has made obtaining chloroplast genome sequences more practical and provides a unique opportunity to study the evolution of the chloroplast genome and the phylogeny of the genus Zingiber.
In this study, to characterize the genome structures, gene content, phylogeny and other characteristics of Zingiber, we sequenced chloroplast genomes of fourteen Zingiber species (Table 1). Then, we explored the molecular features of each genome and compared them with six other published chloroplast genomes within the Zingiber. Finally, we determined the chloroplast genome sequence variation, molecular evolution and phylogenetic relationships among 20 within the Zingiber.

Features of the Zingiber chloroplast genomes
All fourteen sequenced chloroplast genomes of Zingiber have a typical quadripartite structure containing one large single copy (LSC), one small single copy (SSC) and two inverted repeat regions (IRA and IRB) (Fig. 1 genes, including 79 protein coding genes, 30 tRNA genes, and 4 rRNA genes (Tables 1 and 2). Among the different protein coding genes in our fourteen sequenced chloroplast genomes, 61 genes are located in the LSC regions, 12 genes are located in the SSC regions, and 8 genes are duplicated in the IR regions (Table 1). There were 18 genes containing introns, most of them have only a single intron, whereas ycf3 and clpP genes contain two introns ( Table 2).

Codon usage and RNA editing sites
Codon usage patterns and nucleotide composition help to lay a theoretical foundation for genetic modifications of the chloroplast genome [32]. is the least common, with a frequency of 1.14-1.18% (Fig. 2a). Because of the value of relative synonymous codon usage (RSCU) > 1.00, thirty codons show codon usage bias in protein coding genes of the 14 sequenced  (Table S1). In the 14 identified chloroplast genomes that we sequenced, the ndhB gene has the highest number of potential editing sites (11 sites), followed by the ndhD gene (7 sites) ( Table S1). All of these editing sites are C-to-T transitions that occur at the first or second positions of the codons.

Contraction and expansion of inverted repeats (IRs)
A comprehensive comparison at the LSC/IRs/SSC boundaries was performed among the 20 Zingiber species (Fig. 5). Although the inverted repeat regions (IRA and IRB) are the most conserved regions of the chloroplast genome, shrinkage and expansion of the IR boundaries are hypothesized to help explain size differences between chloroplast genomes beyond genus. The length of the IR region in the 14 chloroplast genomes exhibited a modest expansion, ranging from 29,765 bp to 29,957 bp. Within the 20 chloroplast genomes of Zingiber species, the rpl22 and rps19 genes were located in the boundaries of the LSC/ IRB regions (Fig. 5). There were 20-115 bp between rpl22 and the LSC/IRB borders, and the distance between rps19 and the LSC/IRB boundary ranged from 108 bp to 157 bp (Fig. 5). The SSC/IRA boundary was situated in the ycf1 coding region, which crossed into the IRA region in all 20 Zingiber species. However, the length of ycf1 in the IRA region varied among the 20 Zingiber species from 309 bp to 3922 bp (Fig. 5).
The rps19 and psbA genes were situated in the boundaries of the IRA/LSC regions in all 20 Zingiber species, in which the distances between rps19 and the IRA/LSC border ranged from 108 bp to 157 bp (Fig. 5). For all 20 Zingiber species, a 95-156 bp distance was observed between the psbA gene and the IRA/LSC border (Fig. 5).

Phylogenetic analyses
The phylogeny of 55 Zingiberaceae species were well resolved (Fig. S1). Zingiber is monophyletic (BS = 100%) and was well resolved as sister to Kaempferia with strong support (BS = 100%). Based on the chloroplast genome dataset, we generated a well-resolved phylogeny of Zingiber (Fig. 8). The support values of all the branches in both ML and BI trees were robust (BI = 1.0, BS = 100%). Thus, we will not include the support values in the text below. Zingiber was divided into three sections: sect. Crytanthium, sect. Zingiber, and sect. Pleuranthesis. Sect. Crytanthium is resolved as sister to sect. Zingiber. There are four major clades of the sect. Crytanthium. The first branch was well supported and comprised Z. flavomaculosum + Z. densissimum as sister to Z. yingjiangense + Z. orbiculatum. The second clade was Z. recurvatum + (Z. koshunense + Z. cochleariforme). Within the rest of the sect. Cryanthium, two subclades were recovered: Z. teres + Z. smilesianum and Z. mioga + (Z. leptorrhizum + Z. striolatum). In sect. Zingiber, Z. xishuangbannaense, subsequently followed by Z. officinale, Z. neotruncatum, Z. zerumbet, Z. montanum was sister to Z. corallinum + Z. purpureum. As for sect. Pleuranthesis, which contains only one species (Z. ellipticum).

Discussion
In this study, 14 Zingiber chloroplast genomes were newly reported. Their genome size (162,481 bp-163,711 bp), GC content (35.89-36.18%), genome quadripartite structure, gene composition, all of the protein-coding genes, tRNA and rRNA showed high similarity, which were in consistent with other Zingiberoideae chloroplast genomes [25,[29][30][31]. The conservation of plastomes had been observed in various angiosperms such as Malvaceae, Araceae in which the same gene content and gene order had been reported [34][35][36][37]. Nevertheless, plastome rearrangement, gene duplication, gene loss and intron loss are reported in a number of plant lineages [22,25,38]. Although structure variations occurred in some Zingiberoideae plants for example, both trnS-GGA and trnT-GGU were lost in the chloroplast genome of Globba schomburgkii. The lhbA gene were lost in both Hedychium coccineum and Hedychium neocarneum [25]. However, the chloroplast genomes of Zingiber species were highly conserved in current study, which is in agreement with previous studies at genus level Camellia [39], Sinosenecio [40], and Chrysosplenium [41]. Plastomes are very conservative which was maintained by multiple molecular mechanisms including uniparental inheritance, rarity of plastid fusion, and the presence of an active repair mechanism [21,35]. Hence, the typically conservative nature of the Zingiber plastomes is linked to a certain molecular mechanism. The expansion and contraction at the borders of the IR regions of chloroplast genomes is common in angiosperms, which may cause size variations, gene duplication or the reduction, and the origination of pseudogenes [20,42,43]. Abnormal expansion of IR regions had been observed in some taxon e.g., Pilea [44], Erodium [45] and Pelargonium [46], which transferred numerous genes from the SC regions into the IR. In this study, we found that expansion and contraction of IRs showed much similarity among the species of the genus Zingiber, and the distribution and locations of gene types in these regions were highly consistent. These results are in agreement with previous report of Zingiberoideae [25]. The IR/SSC boundary shifts always cause the increased length in the IR regions. Here, we found the IR/SC boundary of Zingiber is relatively stable. The pseudogene of ycf1 originated at the junction of IR in Zingiber plants which was also observed in other angiosperms [25,35]. Compared with the chloroplast genomes of six Zingiber species published in NCBI, the length of IR region of all species assembled by ourselves was basically the same, and no gene loss was detected. Overall, the conservation of the IR of the Zingiber plants may be one of the reasons for its stability in length and structure.

Table 3 Positive selective amino acid loci and estimation of parameters
Highly variable regions are always used as DNA barcode markers for the studies on species identification and phylogenetic analyses. The high similarity of the vegetative characteristics has made it extremely difficult to distinguish Zingiber plants [16]. Since some classical DNA barcodes are insufficient for species identification and phylogeny of Zingiber, it is very important to find more highly variable regions at genus level that could be developed as representing potential markers for future variety identification research. Based on the results of mVISTA and nucleotide diversity, eight highly variable regions among 20 Zingiber species are identified including seven intergenic regions (petA-pabJ, rbcL-accD, rpl32-trnL-UAG , rps16-trnQ-UUG , trnC-GCA-psbM, psbC-trnS-UGA and ndhF-rpl32) and one genic region (ycf1). These highly variable regions could be used as potential DNA barcode for species identification and phylogenetic analysis for the Zingiber species. Among them, ndhF-rpl32, rpl32-trnL-UAG and ycf1 were reported as suitable for species identification at subfamily and genus level in Zingiberoideae [25]. The ycf1 gene is the most variable site in the chloroplast genome, showing greater variability than existing chloroplast candidate barcodes such as matK and rbcL [47]. Other five intergenic regions that identified in the present study are also reported in other plants at species level. For example, petA-pabJ was demonstrated well utilization as DNA barcodes for Lindera plant [48] and rbcL-accD was identified to be an effective marker for Rumex species [49]. Sun, et al. [50] suggested that petA-psbJ, ndhF-rpl32 and rpl32-trnL potentially be used as molecular genetic markers for population genetics and phylogenetic studies of Magnolia polytepala. And rps16-trnQ-UUG , trnC-GCA-psbM, psbC-trnS-UGA are also reported in previous studies [51,52]. Generally, although several candidate barcoding regions were identified, further research is still necessary to determine whether these highly divergent markers could be used in the identification and phylogenetic analyses of Zingiber species.
Positive selection is assumed to play key roles in the adaptation of organisms to diverse environments [53], while negative (purifying) selection is a ubiquitous evolutionary force responsible for genomic sequence conservation across long evolutionary timescales [54]. In this study, 19 genes with positive selection sites are identified in Zingiber. Among these genes containing amino acid positive sites, we found that ycf1 and ycf2 genes possess higher number (52, 24, respectively) of positive amino acid sites within Zingiber species, suggesting that the ycf1 gene may play important roles in the adaptive evolution of Zingiber species. Six genes (rpl12, rpl20, rpl23, rpl33, rps7 and rps12) encoding ribosomal subunit proteins are under positive selection, and these genes are considered to be essential for chloroplast biogenesis and function, suggesting that Zingiber plants may increase the adaptability of evolution by regulating encoding ribosomal subunit proteins in chloroplasts [55]. Moreover, eleven genes, namely ccsA, clpP, ndhA, ndhB, petD, psaA, psbB, psbC, rbcL, rpoC2 and ycf3, have also been identified with positive selection sites in current study. Recent studies have indicated that these nineteen genes with positive selection in some angiosperms are common. For examples, ccsA, rbcL, rpoC2 have been identified under positive selection in Orchidaceae, Euterpe, and Pterocarpus [24,56,57]; In Zingiberoideae, ccsA, ndhA, ndhB, psbJ, rbcL, rpl20, rpoC1, rpoC2, rps12, rps18, ycf1, ycf2 and ycf4 have also been identified under positive selection [25]. Zingiber species mainly inhabited warm, humid, semi-shaded environment and maintain a high level of plant diversity [1,3]. Therefore, based on our analyses, we believe that positive selection of these chloroplast genes may be promote the adaptation of Zingiber plants to semi-shaded environment, but the detailed adaptation mechanism needs further in-depth research.
The phylogenetic analysis of 55 Zingiberaceae species showed that Zingiber was well resolved as sister to Kaempferia with strong support (Fig. S1), which is consistent with previous studies [17,19,25,[58][59][60]. Previously, the classification of Zingiber species was usually based on the type of inflorescence and pollen morphology, which generally solved the classification problems of Zingiber plants [61]. Zingiber was classified into three sections based on ITS sequences analyses together with similarity in pollen morphology and inflorescence habit [16,17]. Our species-level phylogenetic tree of Zingiber showed that three traditionally accepted sections were monophyletic with strong support. In different with the result of Theerakulpisut [16] based on the ITS analyses, our results strongly supported sect. Cryptanthium as sister to sect Zingiber rather than sect. Pleuranthesis. Conflicts between phylogenetic trees delineated by chloroplast genomes and nuclear genes are also common in some angiosperms, such as Asteraceae and Zingiberaceae [62][63][64][65][66][67][68]. The conflict phenomenon may be due to reticulate evolution in the events of rapid diversification or uniparental inheritance of the plastome [35,62]. However, the mechanism that leads to the conflict in Zingiber require further in-deep research. Additionally, the phylogeny indicated strong support for interspecies relationships. In sect. Zingiber, Z. purpureum was well resolved as sister to Z. corallinum. Z. xishuangbannaense, a species endemic to china, was resolved as the first lineage split from Zingiber in this study. The reminder Zingiber species formed a monophyletic clade with strong support, which is consistent with previous studies [16,25]. The rest of the sect. Zingiber formed a strong supported clade. Although Theerakulpisut, et al. [16] recognized this clade, but the bootstrap value is below 50% and relationships among a number of lineages of this clade are uncertain. Our results demonstrated that Z. neotruncatum subsequently followed by Z. zerumbet was sister to Z. montanum + (Z. corallinum + Z. purpureum). For sect. Cryptanthium, 12 species, including 9 newly sequenced species in this study, were sampled, which is the mostly densest sampling to date. The relationships among lineages of sect. Cryptanthium were well resolved with robust support and provided a back bone for further classification at the infrageneric level and for investigating the biogeography of this group.

Conclusions
In this study, fourteen complete chloroplast genomes of Zingiber species have been sequenced, assembled and annotated for the first time. The structural characteristics of these fourteen chloroplast genomes are shown to be conservative, which are similar to those reported chloroplast genomes of Zingiberoideae species. Meanwhile, comparative analyses of 20 Zingiber chloroplast genomes have generated 8 highly variable regions, which may be used as a potential source of molecular markers for species identification. Based on whole chloroplast genomes data, phylogenetic relationships among 20 Zingiber species have been clearly resolved. We found sect. Cryptanthium as sister to sect Zingiber rather than to sect. Pleuranthesis. The conflict phenomenon may be due to reticulate evolution in the events of rapid diversification or uniparental inheritance of the plastome. In addition, 19 genes are under positive selection with high posterior probabilities, which may play important roles in Zingiber species adaption to semi-shaded environment. Overall, our research has greatly enriched the genome resources of Zingiber, which will help to further analyze the phylogeny of Zingiber and resolve the genetic relationships within Zingiber in the future.

Plant material, DNA extraction, and sequencing
A total of 21 chloroplast genomes were used for this study, including seven chloroplast genomes obtained from Gen-Bank (www. ncbi. nlm. nih. gov/ genba nk) and fourteen newly generated in this study (Table 1). Genomic DNA was isolated from silica-gel dried leaf tissue or herbarium specimens (Table S2)

Codon usage and RNA editing sites
Codon usage patterns and nucleotide composition could help to lay a theoretical foundation for genetic modifications of the chloroplast genome [32]. Here, to examine the deviation in synonymous codon usage, the relative synonymous codon usage (RSCU) was calculated using the software CodonW (University of Texas, Houston, TX, USA) with the RSCU value (Fig. 2a). When the RSCU value > 1.00, it means that the use of a codon is more frequent than expected, and vice versa. The clustered heat map of RSCU values of fourteen sequenced Zingiber chloroplast genomes was conducted by R v3.6.3 (https:// www.R-proje ct. org.) (Fig. 2b). To predict possible RNA editing sites in the twenty chloroplast genomes, protein coding genes were used to predict potential RNA editing sites using the online program Predictive RNA Editor for Plants (PREP) suite (http:// prep. unl. edu/) with a cut of value of 0.8.

Analyses of SSRs and long repeats
Chloroplast SSR has high variation level within the same species and is an important source for developing molecular markers, which are widely used in phylogenetic and population genetic analysis [71]. MIcroSAtellite (MISA) (http:// pgrc. ipk-gater sleben. de/ misa/) was used to detect the simple sequence repeat (SSRs or microsatellites) motifs in fourteen sequenced chloroplast genomes with the settings as follows: 8 for mono-, 5 for di-, 4 for tri-, and 3 for tetra-, pena-, and hexa-nucleotide SSRs (Fig. 3). The REPuter software was employed to identify long repeats such as forward, palindrome, reverse and complement repeats. The criteria for determining long repeats were as follows: (1) a minimal repeat size of more than 30 bp; (2) a repeat identity of more than 90%; and (3) a hamming distance equal to 3 (Fig. 4).

Genome comparison and nucleotide variation analysis
To detect the contractions and expansions of the IR regions in the chloroplast genomes of the Zingiber, 20 whole genomes within Zingiber were compared (Fig. 5). The online software mVISTA tool with the Shufe-LAGAN mode [72] was used to make pairwise alignments among these 20 whole chloroplast genomes with the annotated chloroplast genome of Z. cochleariforme as reference (Fig. 6). The 20 chloroplast genomes of Zingiber were first aligned using MAFFT v7 [73] and then manually adjusted using BioEdit v7.0.9 [74]. DnaSP v5.10 software [75] was used to calculate the nucleotide variability (Pi) of the 20 chloroplast genomes within the Zingiber, with a sliding window analysis with the step size and window length set as 200 bp and 800 bp (Fig. 7).

Positive selection analysis
To identify the genes under selection, we scanned the chloroplast genomes of fourteen species within Zingiber using the software EasyCondeML [76]. The software was used for calculating the non-synonymous (dN) and synonymous (dS) substitution rates, along with their ratios (ω = dN/dS). The analyses of selective pressures were conducted along the ML tree of these fourteen species in Newick format. Each single-copy CDS sequences was aligned according to their amino acid sequence. The sitespecific model with five site models (M0, M1a & M2a, M7 & M8) were employed to identify the signatures of adaptation across chloroplast genomes. This model allowed the ω ratio to vary among sites, with a fixed ω ratio in all the branches. The site-specific model, M1a (nearly neutral) vs. M2a (positive selection) and M7 (β) vs. M8 (β & ω) were calculated in order to detect positive selection [77]. Likelihood ratio test (LRT) of the comparison (M1a vs. M2a and M7 vs. M8) was used to evaluate of the selection strength respectively and the p value of