Transcriptome analysis of Clinopodium gracile (Benth.) Matsum and identification of genes related to Triterpenoid Saponin biosynthesis

Background Clinopodium gracile (Benth.) Matsum (C. gracile) is an annual herb with pharmacological properties effective in the treatment of various diseases, including hepatic carcinoma. Triterpenoid saponins are crucial bioactive compounds in C. gracile. However, the molecular understanding of the triterpenoid saponin biosynthesis pathway remains unclear. Results In this study, we performed RNA sequencing (RNA-Seq) analysis of the flowers, leaves, roots, and stems of C. gracile plants using the BGISEQ-500 platform. The assembly of transcripts from all four types of tissues generated 128,856 unigenes, of which 99,020 were mapped to several public databases for functional annotation. Differentially expressed genes (DEGs) were identified via the comparison of gene expression levels between leaves and other tissues (flowers, roots, and stems). Multiple genes encoding pivotal enzymes, such as squalene synthase (SS), or transcription factors (TFs) related to triterpenoid saponin biosynthesis were identified and further analyzed. The expression levels of unigenes encoding important enzymes were verified by quantitative real-time PCR (qRT-PCR). Different chemical constituents of triterpenoid saponins were identified by Ultra-Performance Liquid Chromatography coupled with quadrupole time-of-flight mass spectrometry (UPLC/Q-TOF-MS). Conclusions Our results greatly extend the public transcriptome dataset of C. gracile and provide valuable information for the identification of candidate genes involved in the biosynthesis of triterpenoid saponins and other important secondary metabolites.

Squalene synthase (SS; EC 2.5.1.21) is a key bifunctional enzyme in the terpenoid biosynthesis pathway; SS first catalyzes the formation of presqualene diphosphate (PSPP) from two molecules of FPP and then converts PSPP to squalene with NADPH and Mg 2+ ions [13,14]. The level of SS gene expression shows a positive correlation with the amount of triterpenes [15]. Although SS cDNAs have been cloned and analyzed from a number of herbal plant species, such as Glycyrrhiza glabra [16], Siraitia grosvenorii [17], and Lotus japonicas [18], no sequence or structural information is available on the SS gene in C. gracile.
RNA sequencing (RNA-Seq) analysis has been used to capture both coding and non-coding sequences and to quantify gene expression not only in a heterogeneous mixture of cells, tissues, and organs but also at the whole organism level [19]. Thus, RNA-Seq is a powerful tool for acquiring gene expression information and is of great significance in the mining of functional genes, analysis of gene expression profiles, and discovery of genetic metabolic networks [20]. At present, transcriptome sequencing has been applied to medicinal plants of the Labiatae family, including Scutellaria baicalensis Georgi [21], Ocimum sanctum and Ocimum basilicum [22], Mentha piperita and Mentha arvensis [23], and Clinopodium chinense [24]; however, the use of RNA-Seq analysis has not been reported in C. gracile to date.
In this study, we performed deep transcriptome analysis of four different tissues (flowers, leaves, roots, and stems) of C. gracile plants and identified genes potentially involved in triterpenoid saponin biosynthesis. This work lays a foundation for further exploration of the molecular mechanism of triterpenoid saponin biosynthesis in C. gracile.

Results
Total saponin content in different tissues of C. gracile Saponins were extracted from approximately 0.1 g dried powder of flowers, leaves, stems, and roots of C. gracile. Total saponin content was the highest in leaves (0.29%), followed by stems (0.23%), flowers (0.21%), and roots (0.18%), with a standard error of 0.0005, 0.0004, 0.0033, and 0.0011, respectively. The results of analysis of variance (ANOVA) of C. gracile data showed that the differences among the total saponin content of the four tissues were statistically significant (Fig. 1, Additional file 1: Table S1).

Analysis of triterpenoid saponins in C. gracile by UPLC/Q-TOF-MS
Qualitative analysis of triterpenoid saponin metabolites of C. gracile was conducted by UPLC/Q-TOF-MS. The results confirmed the presence of Buddlejasaponin IV in C. gracile. Additionally, Saikosaponin a, Clinoposaponin III, and Clinoposaponin V probably also exist in C. gracile, according to the retention time (t R ), maximum ultraviolet absorption wavelength (λmax), molecular ion peak, and ESI-MS data (Fig. 2, Additional file 2: Table  S2).
RNA sequencing and the de novo assembly of C. gracile transcriptome RNA-Seq analysis of C. gracile flowers, leaves, stems, and roots generated 43.81 Gb of high quality reads. Fulllength transcripts were reconstructed, and a total of 128, 856 unigenes, which refer to a uniquely assembled transcript or a cluster of genes that perform a particular function, were generated using Trinity and TGICL, with a mean length of 1354 bp and an N50 value of 2268 bp. Of the 128,856 unigenes, 47.37% (61,043 unigenes) were longer than 1000 bp, and 34.87% (44,929 unigenes) were longer than 1500 bp (Additional file 3: Figure S1).  Fig. 1 The total saponin content of leaves, stems, flowers, and roots of Clinopodium gracile (Benth.) Matsum plants. The X-axis indicates the type of C. gracile tissues, and the Y-axis indicates the total saponin content. Significant differences were determined by analysis of variance (ANOVA) in at least one public database, while a total of 25,975 (20.16%) unigenes were co-annotated in five databases (Table 1). Species distribution analysis showed that C. gracile unigenes showed the highest homology to Sesamum indicum sequences (40,961 unigenes; 43.81%), followed by Erythranthe guttata (14,690 unigenes; 15.71%) and Daucus carota subsp. sativus (8866 unigenes; 9.48%) (Additional file 4: Figure S2).

Functional classification and expressed overview of unigenes
The GO database divided 57.72% of unigenes into "cellular component", 79.93% into "molecular function", and 49.12% into "biological process", and 50,235 unigenes were categorized into at least one GO term. Furthermore, "membrane part" (15, [25,26]. The overall expression level of unigenes was the highest in flowers, followed by stems, leaves, and roots (Fig.  3b).

Identification of candidate genes involved in the triterpenoid biosynthesis pathway
Gene function annotation using the KEGG database assigned 73,648 unigenes to 136 pathways (20 subcategories), 14 of which were related to the biosynthesis of secondary metabolites (Additional file 6: Figure S4 and Additional file 7: Table S3). A total of 1406 unigenes were enriched in the phenylpropanoid biosynthesis pathway (Fig. 4a). Terpenoid metabolism    involves six pathways; the largest number of unigenes (360) mapped to "Terpenoid backbone biosynthesis," followed by "Carotenoid biosynthesis," "Ubiquinone and other terpenoid quinone biosynthesis," "Diterpenoid biosynthesis," and "Sesquiterpenoid and triterpenoid biosynthesis" (Fig. 4b).
Three unigenes (CL12163-1, CL12163-4, CL6527-11) encoding the SS enzyme were identified, with a sequence identity of 88.14%. We analyzed the structure of the CL12163-4 unigene, as it showed the highest expression in leaves. The CgSS (C. gracile squalene synthase) ORF with 1272 bp is expected to encode a protein of 423 amino acids. The amino acid sequence of CgSS showed high sequence similarity to SS proteins from other herbs, namely, Bacopa monniera (ADX01171.1; 83.02%), Eleutherococcus senticosus (AEA41712.1; 77.91%), Panax ginseng (ACV88718.1; 78.14%), Panax notoginseng (ABA29019.1; 77.91%), and Panax quinquefolius (CAJ58418.1; 77.91%). In the multiple sequence alignment of these SS amino acid sequences, six domains (I-VI) and two aspartate-rich regions, important for the catalytic activity of SS enzymes, are highlighted. Two domains (III and IV) of SS amino acid sequences were highly conserved among plant species, three domains (I, II, and V) were fairly well conserved, whereas domain VI was the least conserved. The secondary structure of CgSS contained 18 alpha helixes, which were the main component of the SS enzyme (Fig. 6). The 3D model of CgSS was constructed on the basis of the crystal structure of SS from Trypanosoma cruzi (PDB ID 3wca.3.A), which shares 46.02% sequence identity with CgSS [27]. Five domains (I-V) of CgSS are highlighted in different colors in Fig. 7a, and amino acids D77, D81, Y168, D213, and D217 comprising the active site of CgSS are indicated in Fig. 7b.

Validation of RNA-Seq data by qRT-PCR
To independently verify the transcriptome data and differential gene expression among different tissues, six unigenes involved in triterpenoid saponin biosynthesis were selected for qRT-PCR analysis. The expression levels of Un 41,982 (DXR), CL10352-1 (HMGR), and Un 5223 (HMGR) were the highest in roots. Similarly, Fig. 5 Proposed pathways for triterpenoid saponin biosynthesis in C. gracile. Expression levels of unigenes encoding enzymes that catalyze each step of the triterpenoid saponin biosynthesis pathway are shown. "CL" and "Un" indicates a cluster of transcripts and unigenes, respectively. F, flowers; L, leaves; S, stems; R, roots. The higher and lower expression level of unigenes are indicated in red and green, respectively the expression levels of CL12163-4 (SS), and CL1648-1 (HDR) were the highest in leaves, while that of Un 17, 275 (HMGR) was the highest in flowers. These data were consistent with the FPKM values of these genes determined from RNA-Seq analysis (Fig. 8).

Discussion
Triterpenoid saponins are the major bioactive compounds in C. gracile, with extensive pharmacological effects. However, genomic information on saponin biosynthesis is limited. In this study, we aimed to identify putative genes involved in triterpenoid saponin biosynthesis in C. gracile. Therefore, we conducted deep transcriptome sequencing on different tissues of C. gracile plants using RNA-Seq. Analysis of RNA-Seq data revealed 128,856 unigenes, with a mean length of 1354 bp and an N50 value of 2268 bp. Of the 128,856 unigenes, 99,020 were annotated, whereas 23.15% of the unigenes Fig. 9 Expression of unigenes in all four tissues of C. gracile. a The number of unigenes expressed in four tissues by veen diagram. b Number of DEGs identified in the leaves compared with roots, flowers, and stems of C. gracile. Genes highly expressed in leaves compared with roots, flowers, and stems were defined as "up-regulated," whereas genes with lower expression levels in leaves were defined as "down-regulated"  [30]. These results demonstrate the reliability of the C. gracile transcriptome.
We performed transcriptome analysis of the flowers, leaves, roots, and stems of C. gracile to facilitate the dissection of genes involved in the tissue-specific biosynthesis of triterpenoid saponins. This approach is widely used for the identification of novel genes involved in the biosynthesis of secondary metabolites and analysis of the molecular mechanism of triterpenoid saponin biosynthesis [31].
In the transcriptome data sets of C. gracile, the upregulated unigenes in leaves were mainly enriched in "Terpenoid backbone biosynthesis" and "Sesquiterpenoid and triterpenoid biosynthesis" in the KEGG database and were mainly annotated under "metabolic process" and "catalytic activity" in the GO database. These upregulated genes may explain the molecular basis of the medicinal value of C. gracile leaves.
The sponin content of different C. gracile tissues was determined by UV-spectrophotometry, with buddlejasaponin IV as a standard, which was identified and confirmed by UPLC/Q-TOF-MS. Saponins include steroid and triterpenoid saponins in most of plant species. Triterpenoid saponins are generally predominant in dicotyledons, while steroidal saponins are mainly identified in monocotyledons [32]. Triterpenoid saponins are some of the most important components of C. gracile, although the relationship between the expression of triterpenoid biosynthesis genes and accumulation of triterpenoid saponins in C. gracile is unclear. In this study, UPLC/Q-TOF-MS analysis was performed to identify the different constituents of triterpenoid saponins. UV-spectrophotometry analysis confirmed that the content of total saponin in C. gracile was the highest in leaves compared with other tissues. Higher expression levels of some of the unigenes encoding SS, HDR, HDS, DXS, and DXR enzymes in leaves compared with stems, flowers, and roots were consistent with the higher saponin accumulation in C. gracile leaves ( Fig. 1  and Fig. 5). This suggests that overexpression of genes encoding these key enzymes could increase the production of triterpenoid saponins in C. gracile plants. A similar approach, i.e., overexpression of the SS gene has been previously used to enhance the production of total saponins in Medicago truncatula [33].
Roots showed the highest expression of unigenes encoding most of the key enzymes involved in the biosynthesis of triterpenoid saponins (Fig. 5); however, this result contradicts the high levels of triterpenoid saponins in leaves. Expression levels of genes do not always correlate with enzyme activity [34]. Post-transcriptional modifications influence features of mRNAs, thus affecting protein synthesis. For example, in a previous study, the mismatch between ginsenoside content and HMGR gene transcript levels was reportedly affected by post-transcriptional modifications and feedback regulation [35]. Post-translational modifications such as phosphorylation and ubiquitination can also determine the rate of protein degradation [36].
Multiple sequence alignment of SS amino acid sequences suggested the presence of six domains (I-VI) required for the functional activity of the enzyme. Domains I, II, and III are involved in the first step of SS biosynthesis; domain IV participates in the second step; domain VI is a hydrophobic region that binds to the endoplasmic reticulum. The Mg 2+ -binding aspartate-rich regions and amino acids in the active site (Asp77, Asp81, Tyr168, Asp213, and Asp217) were highly conserved in SS amino acid sequences among different plant species [37][38][39]. Previously, it has been shown that overexpression of the SS (AB115496) gene significantly increases the ginsenoside content [40] in Panax ginseng. Therefore, our findings will be beneficial for further investigation into the role of SS in triterpenoid saponin biosynthesis.
TFs are sequence-specific DNA-binding proteins that modulate target gene expression. In the C. gracile transcriptome, only one unigene (Un 15,683) encoding an MYB TF was annotated to play a role in the metabolism of terpenoids. Previous studies have shown that the overexpression of MYB9b enhances tanshinone concentration in Salvia miltiorrhiza hairy roots [41], and MYB1 increases artemisinin content and trichome proliferation in Artemisia annua [42]. Additionally, eight unigenes encoding MYB TFs were mapped to "metabolism of biosynthesis of other secondary metabolites" in C. gracile. Overexpression of the gene encoding MYB1 and MYB2 TFs enhances the synthesis and accumulation of anthocyanins in Fagopyrum tataricum [43]. Moreover, MYB5, MYB26, and MYB31 participate in flavonoid biosynthesis in Ginkgo biloba under adverse conditions [44]. These TFs may also play a significant role in the regulation of triterpenoid saponins and other secondary metabolites in C. gracile.

Conclusion
Overall, this is the first report of a comprehensive RNA-Seq analysis of C. gracile tissues to identify genes involved in the biosynthesis of triterpenoid saponins. The transcriptome data will facilitate further examination of the molecular mechanism of triterpenoid saponin biosynthesis in C. gracile and promote the study of C. gracile functional genomics.  Fig. 12 The number of unigenes encoding TFs involved in the metabolic pathways in C. gracile. a TF families and unigene numbers related to the metabolism of terpenoids. b TF families and unigene numbers related to "metabolism of biosynthesis of other secondary metabolites"

Plant material and RNA isolation
Ten plants of C. gracile (height =~18 cm) were harvested from the herbal garden of the Anhui University of Chinese Medicine under the permission of managers and professionals on April 18, 2018 (Additional file 9: Figure S5), which were identified by Prof. Qingshan Yang (Anhui University of Chinese Medicine). Flower, leaf, root, and stem tissues were separated from three C. gracile plants, cleaned with ultrapure water, dried on filter paper, and then frozen in liquid nitrogen. Total RNA was isolated from four tissues (flowers, leaves, roots, and stems) of C. gracile plants in three replicates using the ethanol precipitation protocol and RNA Plant Kit (Aidlab Biotech, Beijing, China), according to the manufacturer's instructions. The quality and quantity of total RNA were verified using NanoDrop 8000 and Agilent 2100 Bioanalyzer (Thermo Fisher Scientific, MA, USA), respectively (Additional file 10: Table S5).

Determination of total saponin content
Saponins were extracted from dried C. gracile tissues (flowers, leaves, roots, and stems) and detected, as described previously [45], with slight modifications. Briefly, dried powder (0.1 g) of each tissue was extracted in 2 ml of 50% methanol by ultrasonication for 30 min. The filtrate obtained by vacuum filtration was concentrated at 60°C to near dryness and then resuspended in 25 ml of 50% methanol. The absorbance of the solution was determined by UV-spectrophotometry (JASCO Company, Japan). A curve of concentration versus absorbance was established using buddlejasaponin IV, as a standard, to determine the saponin content of C. gracile tissue (Additional file 11: Figure S6a and b). Total saponin yield (%) was calculated as follows: where A and m represent the solution absorbance of sample and weight of sample dried powder, respectively.

UPLC/Q-TOF-MS analysis
Triterpenoid saponin metabolite profiling was performed on a Waters ACQUITY UPLC /Xevo G2 Q-TOF (Waters Corporation, Milford, MA, USA). Samples were separated on Agilent ZORBAX Eclipse Plus C18 (2.1 mm × 50 mm, 1.8 μm). The column temperature was maintained at 35°C, and the flow rate was 0.20 mL/min. The mobile phase consisted of acetonitrile and 0.1% formic acid. The gradient was initiated with 10% acetonitrile for 5 min, 10% acetonitrile linearly increased to 20% acetonitrile within 1 min and held at 20% for 4 min, then 20% acetonitrile increased to 30% acetonitrile within 1 min and held at 30% for 4 min, and 30% acetonitrile increased to 45% acetonitrile within 1 min and held at 45% for 4 min. The Xevo G2-S Q-TOF mass spectrometer was run in the negative mode. Mass spectra were obtained with a scanning mass range of 50 to 1500 Da. High-purity nitrogen (N 2 ) was used as nebulizing gas, and ultra-high pure helium (He) was used as the collision gas. Other parameters were as follows: Capillary voltage, 2.00 kV; sampling cone voltage, 40.0 V; ion source temperature, 120°C; desolvation temperature, 350°C; cone gas flow, 50 L/h; desolvation gas flow rate, 600 L/h; collision energy (CE), 40-60 V; and leucine enkephalin was used as lock mass (m/z 554.2615).
cDNA library preparation and RNA-Seq analysis Total RNA isolated from plant tissues was treated with RNase-free DNase I (TaKaRa, China) to eliminate any traces of contaminating DNA, which was then mixed with oligo (dT) magnetic beads to purify mRNA. The purified mRNA was fragmented at an appropriate temperature. First-strand cDNA was generated by reverse transcription PCR (RT-PCR) using random hexamer primers. This was followed by second-strand cDNA synthesis. Subsequently, A-Tailing Mix (Enzymatics, USA) and RNA Index Adapters were added to perform end-repair, and the resulting cDNA was amplified by PCR. The PCR products were purified by Ampure XP Beads and eluted in EB solution. The quantity and quality of cDNA libraries were evaluated using Agilent 2100 Bioanalyzer (ABI, New York, NY, USA). The double-stranded PCR products were heated denatured and circularized by the splint oligo sequence to obtain the final library. The single-stranded circular DNA was amplified using phi29 (Thermo Scientific, USA) to generate DNA nanoball (DNB). The DNBs were loaded into the patterned nanoarray and single-end 50 bp reads were generated on the BGISEQ-500 platform (Beijing Genomics Institute, Wuhan, China) [46].

De novo transcriptome assembly
To obtain high-quality reads, low quality reads (i.e., reads with > 20% nucleotides with base quality < 10) and reads with adaptor sequences and/or unknown nucleotides (> 5%) were filtered using SOAPnuke [47] and Trimmomatic. De novo assembly of the high-quality reads was performed using Trinity [48]; PCR duplicates were removed prior to assembly to improve assembly efficiency. The assembled transcripts were clustered using the TGI clustering tool (TGICL) [49] to remove redundant sequences and obtain non-redundant sequences, termed unigenes.

Differentially expressed genes analysis
High-quality reads of each sample were queried against the genome sequence of C. gracile using Bowtie2 (version 2.2.5) [53], and gene expression level in each sample was calculated using RSEM (version 1.2.8) [54]. Unigenes showing significant differences in expression levels (fold-change [FC] ≥ 2.00; false discovery rate [FDR] ≤ 0.001) between leaves and other tissues (root, flowers, and stems) were identified by the PoissonDis method [55][56][57]. These DEGs were then used for GO functional analysis and KEGG pathways enrichment.

Analysis of the key genes expression level in triterpenoid saponin biosynthesis by qRT-PCR
To validate the C. gracile transcriptome data sets, qRT-PCR was performed in triplicate using GoTaq qPCR Master Mix (Promega) on a Real-time Thermal Cycler 5100 System (Thermo Scientific, Waltham, MA, USA).