Transcriptome analysis of Panax vietnamensis var. fuscidicus discovers putative ocotillol-type ginsenosides biosynthesis genes and genetic markers

P. vietnamensis var. fuscidiscus, called “Yesanqi” in Chinese, is a new variety of P. vietnamensis, which was first found in Jinping County, the southern part of Yunnan Province, China. Compared with other Panax plants, this species contains higher content of ocotillol-type saponin, majonoside R2. Despite the pharmacological importance of ocotillol-type saponins, little is known about their biosynthesis in plants. Hence, P. vietnamensis var. fuscidiscus is a suitable medicinal herbal plant species to study biosynthesis of ocotillol-type saponins. In addition, the available genomic information of this important herbal plant is lacking. To investigate the P. vietnamensis var. fuscidiscus transcriptome, Illumina HiSeq™ 2000 sequencing platform was employed. We produced 114,703,210 clean reads, assembled into 126,758 unigenes, with an average length of 1,304 bp and N50 of 2,108 bp. Among these 126,758 unigenes, 85,214 unigenes (67.23%) were annotated based on the information available from the public databases. The transcripts encoding the known enzymes involved in triterpenoid saponins biosynthesis were identified in our Illumina dataset. A full-length cDNA of three Squalene epoxidase (SE) genes were obtained using reverse transcription PCR (RT-PCR) and the expression patterns of ten unigenes were analyzed by reverse transcription quantitative real-time PCR (RT-qPCR). Furthermore, 15 candidate cytochrome P450 genes and 17 candidate UDP-glycosyltransferase genes most likely to involve in triterpenoid saponins biosynthesis pathway were discovered from transcriptome sequencing of P. vietnamensis var. fuscidiscus. We further analyzed the data and found 21,320 simple sequence repeats (SSRs), 30 primer pairs for SSRs were randomly selected for validation of the amplification and polymorphism in 13 P. vietnamensis var. fuscidiscus accessions. Meanwhile, five major triterpene saponins in roots of P. vietnamensis var. fuscidicus were determined using high performance liquid chromatography (HPLC) and evaporative light scattering detector (ELSD). The genomic resources generated from P. vietnamensis var. fuscidiscus provide new insights into the identification of putative genes involved in triterpenoid saponins biosynthesis pathway. This will facilitate our understanding of the biosynthesis of triterpenoid saponins at molecular level. The SSR markers identified and developed in this study show genetic diversity for this important crop and will contribute to marker-assisted breeding for P. vietnamensis var. fuscidiscus.


Background
Ginsenosides are triterpenoid saponins found exclusively in Panax species belong to Araliaceae family. The Panax genus comprises approximately 14 species, more than 150 naturally occurring ginsenosides have been isolated from different parts of plants [1] and most of the saponins possess four types of aglycone moieties, i.e. protopanaxadiol, protopanaxatriol, ocotillol, and oleanolic acid types. The most widely used Panax species, such as P. ginseng, P. quinquefolium, and P. notoginseng mainly contain protopanaxadiol-type and protopanaxatriol-type saponins, the other species like P. japonicus and P. zingiberensis, contain a large amounts of oleanolic acid saponins [2,3], all of them do not have or only a small amount of ocotillol-type saponins. Up to now, only one species, P. vietnamensis have been found particularly accumulates surprisingly high content of ocotillol-type saponins, mainly majonoside R 2 , which is as high as 5.3% of the dried rhizome and exhibited anti-tumor and hepatocytoprotective activities [4][5][6].
There are two kinds of pathways that form ocotillol. In pathway A, ocotillol might be biosynthesized via epoxidation of the double bond at C-24-C-25 of protopanaxatriol [1]. The enzyme catalyzes this reaction could be the ortholog of squalene epoxidase (SE) gene, because they epoxidized the similar double bonds of squalene or protopanaxatriol ( Figure 1). In pathway B, OS is further epoxidized to 2, 3; 22, 23-dioxidosqualene (DOS) by SE [23], followed by cyclization and hydroxylation to produce ocotillol, catalyzed by OSC and CYP450, respectively ( Figure 1). In Arabidopsis, OSC, lupeol synthase (LUP1), directly converts DOS to oxacyclic triterpenoid epoxydammarane [23], so there might be similar OSC in P. vietnamensis, which catalyzed the cyclization of DOS ( Figure 1). P. vietnamensis var. fuscidiscus, called "Yesanqi" in Chinese, is a new variety of P. vietnamensis, which was first found in Jinping County, the southern part of Yunnan Province, China [24]. P. vietnamensis var. fuscidiscus contains a higher content of majonoside R 2 than other genotypes of P. vietnamensis [2]. Therefore, P. vietnamensis var. fuscidiscus is a perfect plant species for studying the biosynthesis mechanism of ocotillol-type saponins. Interestingly, P. vietnamensis var. fuscidiscus was also found in Yuanyang and Lvchun County, Honghe prefecture of Yunnan Province and some of them are found for more than 15 years and exhibited remarkable disease resistance under the high temperature and rainy conditions in this district, suggested that this specie could be used to improve disease resistance of P. notoginseng, an important cultivated Panax species in Yunnan Province of China.
Our goal of this study is to characterize the transcriptome of P. vietnamensis var. fuscidiscus using Illumina HiSeq™ 2000 sequencing platform, to discover the candidate genes that encode enzymes in the triterpene saponin biosynthetic pathway, especially in ocotillol-type saponins biosynthesis, and produce information on SSR markers to facilitate the marker-assisted breeding of this species.

Results and discussion
Illumina sequencing and de novo assembly P. vietnamensis var. fuscidiscus root tissue was used for transcriptome sequencing and analysis because root organs have been used for medicinal purpose. A cDNA library was constructed from total RNA of P. vietnamensis var. fuscidiscus roots, and sequenced using Illumina paired-end sequencing technology. After removal of adaptor sequences, ambiguous reads and low-quality reads (Q20 < 20), a total of 114,703,210 clean reads were obtained. The Q20 percentage (sequencing error rate < 1%) and GC percentage were 97.23% and 43.25%, respectively. An overview of the sequencing and assembly statistics are shown in Table 1. The high quality reads obtained in this study have been deposited in the NCBI SRA database (accession number: SRA146484).
All the clean reads (114,703,210) were de novo assembled using the Trinity program into 161,443 contigs consisting of 218,944,221 bp. The size of the contigs ranged from 201 to 15,880 bp, with a mean length of 1,356 bp and N50 length of 2,087 bp. Among these contigs, 82,699 (51.23%) were longer than 1000 bp, and 46,915 (29.06%) contigs were shorter than 500 bp. Using paired-end joining and gap-filling methods, these contigs were further assembled into126,758 unigenes with an average length of 1,304 bp and an N50 length of 2,108 bp. There were 60,741 unigenes (47.92%) longer than 1,000 bp, and 28,676 unigenes (22.62%) longer than 2,000 bp ( Figure 2). In this study, the coding sequences (CDS) from all P. vietnamensis var. fuscidiscus unigene sequences were also detected and a total of 84,004 CDSs were obtained, among them, 24,580 CDSs (29.26%) were longer than 1,000 bp ( Figure 2).
To evaluate the quality of the assembled unigenes, all the usable sequencing reads were realigned to the unigenes using SOAPaligner [25], with up to 2 mismatches allowed. The sequencing depth ranged from 0.0173 to 50,534 fold, with an average of 50.67 fold. About 86.73% of the unigenes were realigned by more than 10 reads, 31.60% were supported by more than 100 reads, and 7.73% were supported by more than 1,000 reads (Additional file 1). In order to assess the extent of transcript coverage provided by unigenes and to evaluate how coverage depth affected the assembly of unigenes, we plotted the ratio of assembled unigene length to P. notoginseng orthologs length against coverage depth (Additional file 2A). Although many of the deeply sequenced P. vietnamensis var. fuscidiscus unigenes failed to cover the complete coding regions of their P. notoginseng orthologs, most of P. notoginseng orthologs coding region can be covered by corresponding unigenes. To certain extent, increased coverage depth can result in higher coverage of the coding regions. The percentage of P. notoginseng orthologs coding sequence covered by all P. vietnamensis var. fuscidiscus unigenes was also performed. We found that 14,892 of the orthologs were covered by with a percentage of more than 80% and 3,252 of the orthologs were covered by unigenes with a percentage from 40% to 80%. Furthermore, 326 orthologs were covered with only 20% or lower (Additional file 2B).
Due to the lack of P. vietnamensis var. fuscidiscus reference geneome availability, the reads produced by Illumina HiSeq™2000 were assembled using the de novo assembler Trinity. In this study, the assembly results indicated that the length distribution pattern and mean length of contigs and unigenes was similar to those in the previous Illumina-transcriptome studies [26][27][28], suggesting that the transcriptome sequencing data from P. vietnamensis var. fuscidiscus are assembled well. Compared to previous transcriptomic studies in Panax species [18][19][20][21][22], we produced more numbers of unigenes, indicating that P. vietnamensis var. fuscidiscus genome is gene rich in comparison to Panax species.  (See figure on previous page.) Figure 1 Putative pathway for triterpene saponin biosynthesis. Putative pathway for triterpene saponin biosynthesis in P. vietnamensis var. fuscidicus. Two proposed pathways (A and B) for the biosynthesis of ocotillol-type saponins, mainlymajonoside R 2 in the horizontally grown rhizome (C) of P. vietnamensis var. fuscidicus (D . Furthermore, about 32.77% of unigenes (41,544) did not show any matches to known genes, these remaining unaligned unigenes may be considered as novel transcripts and specific genes from P. vietnamensis var.

fuscidiscus.
Our results showed that approximately 95% of unigenes over 1,000 bp in length had BLAST matches against the Nr database, whereas only 41% of unigenes with lengths shorter than 1,000 bp generated BLAST matches (Additional file 4A). The same tendency was also observed in BLAST results against the SwissProt database (Additional file 4B). The e-value distribution of the top hits in the Nr database revealed that 62.34% of the mapped unigenes showed significant homology (evalue < 10 −50 ), and 21.35% unigenes had high similarity (greater than 80%) (Additional file 5A and C). The evalue and similarity distributions of the top hits in the Swiss-Prot database had a comparable pattern with 47.51% and 13.48% of the sequences possessing significant homology and similarity, respectively (Additional file 5B and D). Our results also showed that 39.04% of the unigenes showed significant homology with gene sequences from Vitis vinifera (9,011, 19.07%), followed by Arabidopsis thaliana (11.60%), Glycine max (10.97%), and Medicago truncatula (9.14%) (Additional file 6).

Gene ontology classification
Based on the Nr annotation, Gene Ontology (GO) classification was used to classify the functions of all unigenes.

Conserved domain annotation and COG classification
The conserved domains/families of the assembled unigenes encoding proteins were searched against the Pfam database (version 26.0) using Pfam_Scan program. A total of 3,602 conserved domains/families were identified from 46,649 unigenes (36,80% of all unigenes) (Additional file 8). Among these protein domains/families, pentatricopeptide repeat domain (PPR) is the most abundant domain type, found in 2,822 unigenes. The PPR containing proteins are commonly found in the plants and although its function is still unclear, the PPR domain has been found in proteins involved in RNA editing in a number of recent studies [29][30][31][32]. Other highly represented domains/families were, WD repeat (2,520 unigenes), Protein kinase domain (2,468 unigenes), and Leucine Rich Repeat (2,449 unigenes). The WD repeat and Leucine Rich Repeat are involved in protein-protein interactions [33,34]. The role of protein kinase domain is found in signal transduction pathways, development, cell division, and metabolism in higher organisms [35,36]. Other domains identified abundantly included PPR repeat family (2,354 unigenes), RNA recognition motif (1,188 unigenes), Protein tyrosine kinase (1,035 unigenes), ABC transporter (488 unigenes), Mitochondrial carrier protein (462 unigenes), and Myblike DNA-binding domain (459 unigenes). For perspective, we have listed the top 20 most abundant protein domains/families in (Additional file 9).
All unigenes were subjected to a search against the COG database for functional prediction and classification. In total, 34,918 unigenes were annotated and grouped into 25 COG classifications. However, some of these unigenes were assigned to multiple COG classifications, altogether 63,521 COG functional annotations were obtained. Among the 25 COG categories, the cluster for general function prediction was the largest group (11,382,17.92%), followed by replication, recombination and repair (6,561, 10.33%), transcription (6,223, 9.80%), and signal transduction mechanisms (5,338, 8.40%) ( Figure 4).

SSR marker discovery
The potential SSRs were detected in all of the 126,758 assembled unigenes using MISA software. A total of 21,320 SSRs were identified in 17,780 unigenes (Table 3). Of all the SSR containing unigenes, 2,918 sequences contained more than one SSR, and 1,207 SSRs were present in compound form. The information of SSRs derived from all unigenes is shown in Additional file 11. Among these SSRs, the most frequent repeat motifs were di-nucleotides (11,197,52.53%), followed by trinucleotides ( The dendrogram constructed based on UPGMA (unweighted pair group method with arithmetic average) clustering method was used to perform genetic correlation analysis among the 13 P. vietnamensis var. fuscidiscus accessions ( Figure 6). The coefficients of genetic similarity among the 13 P. vietnamensis var. fuscidiscus accessions ranged from 0.82 to 0.94, indicating a high genetic similarity among them. UPGMA cluster analysis grouped these individuals into two groups at the similarity level of 0.836. According to the dendrogram, all the 3 accessions from Laos were clustered into cluster I. In cluster II, all the 10 accession of P. vietnamensis var. fuscidiscus from China were clustered into one group. The results of the cluster analysis showed that the individuals from the same area tend to clustered together. Therefore, the UPGMA cluster analysis based on SSR data was closely related to the geographical origins. Meanwhile, these results demonstrate that SSRs primer pairs derived from P. vietnamensis var. fuscidiscus unigenes can distinguish varieties without morphological diversities, and will be a powerful tool for genetic applications in this herb crop.
Majonoside R 2 is the main ginsenoside in P. vietnamensis var. fuscidiscus, so we focused on the discovery of the putative genes that might be involved in ocotilloltype ginsenoside biosynthesis. As mentioned above, the formation of ocotillol needs SE and OSC with "new" functions. Generally, SE catalyzes the epoxidation of squalene to OS in terpenoid biosynthesis, but in P. vietnamensis var. fuscidicus, SE might catalyze the epoxidation of terminal olefin of protpanaxatriol or OS ( Figure 1). Moreover, 15 unigenes matched to SE of other plants were discovered in our Illumina dataset. Using the primers designed based on the sequences of these SE unigenes, a full-length cDNA of three SE genes were obtained using reverse transcription PCR (RT-PCR), named PvfSE1, PvfSE2, and PvfSE3, respectively (GenBank: KJ946467, KJ946468 and KJ946469). The three cloned SE genes may play different roles in sterol or ginsenoside biosynthesis in this new variant (data not shown), and a series of relevant studies are currently underway to determine the function of them. Many genes encoding OSCs have been isolated in plants; including those encode β-AS, DS, lupeol synthase (LUS), and cycloartenol synthase (CAS) [16,[37][38][39]. Except for β-AS and DS, no unigene matched to LUS and CAS was found (Table 5). Characterizing the functions of these unigenes will help us for understanding the molecular mechanism of biosynthesis of ocotillol-type ginsenoside.

RT-qPCR analysis of the ginsenoside synthesis related genes
The RT-qPCR analysis was used to investigate the tissue-specific expression patterns of 10 unigenes related to ginsenoside biosynthesis in this species. The expression pattern of these genes is shown in Figure 9. The unigenes encoding HMGS, MVK, MVD, and IPPI were expressed at much higher level in young stems than in other tissues (lateral roots, root and leaves). The gene encoding SS was highly expressed in the leaves and stems. The HMGR gene showed very high expression in the leaf tissue. All genes mentioned above play a role in upstream biochemical reactions of the ginsenoside pathway, and showed high expression in leaves and young stems, which indicates that leaves and young stems are the main factories for synthesizing the precursors of ginsenosides. SE gene was involved in the formation of 2,3-oxidosqualene, a precursor of various ginsenosides.
To further identify the potential candidates from SE homologs involved in ginsenoside biosynthesis, the expression levels of three putative SE genes (SE1, SE2, and SE3) in different organs were analyzed. SE1 and SE3 genes were expressed much higher in young stems and  leaves than in other tissues, respectively (Figure 9). Whereas the expression level of SE2 was higher in the roots as compared to that of SE1 and SE3. These three putative SE genes have different expression patterns in different tissues, similar to what was found in previous studies [59]. Thus, we supposed that these three putative SE genes play different roles in ginsenosides biosynthesis. The gene encoding P6H was highly expressed in leaves and young stems than in roots and hairy roots. The P6H was predicted to catalyze protopanaxadiol to protopanaxatriol. A higher expression of P6H observed in leaves and young stems but protopanaxatriol-type ginsenosides accumulated mainly in roots and hairy roots, again indicating that leaves and young stems were the main synthesis site of the triterpene skeletons. The results demonstrate that several genes involved in ginsenoside biosynthesis showed diverse expression patterns in different tissues. The analysis of the expression patterns of these genes in different tissues will be helpful to further understand the mechanism of ginsenoside biosynthesis.
Quantitative analysis of five major triterpene saponins in roots of P. vietnamensis var. fuscidicus The content of main component is the most widely used indicator to measure the quality of herb, so quantitative analysis of main component has important practical significance. According to previous research [2], majonoside R2, ginsenoside Rg1, Rb1, Rd and notoginsenoside R1 are considered as the five main components of P. vietnamensis var. fuscidicus. Herein, the content of five major triterpene saponins in roots of P. vietnamensis var. fuscidicus was determined. High performance liquid chromatography with evaporative light scattering detector (HPLC-ELSD) was employed for quantitative analysis of majonoside R2, due to its low UV absorptivity. As shown in Figure 10A and B, the peak of majonoside R2 was identified by direct comparing the retention times of the peaks with those of the standard majonoside R2 eluted under the same conditions. The content of majonoside R2 in in roots of P. vietnamensis var. fuscidicus is about 68 mg/g, indicated that majonoside R2 were rich in the roots of this species. Quantitative analysis of other four triterpene saponins in the roots of this herb were performed using high performance liquid chromatography (HPLC). As shown in Figure 10C and D, the investigated saponins were well separated within 55 min. The content of ginsenoside Rg1, Rb1, notoginsenoside R1, and ginsenoside Rd in the roots of this herb were approximately 52.7, 17.9, 17.8 and 3.2 mg/g, respectively. The above results were approximately in accordance with previous studies [2], indicated that our quantitative results are reliable. We believe that these data will be useful for pharmacological evaluation and quality control of this new variety.

Conclusions
P. vietnamensis var. fuscidiscus exhibited remarkable disease resistance, and contains higher levels of ocotilloltype saponins. Thus, P. vietnamensis var. fuscidiscus is a suitable material for the study of ocotillol-type saponins biosynthesis and improvements of Panax plants.
Because of the fact that P. vietnamensis var. fuscidiscus is a newly discovered variety of P. vietnamensis, no genomic information was available for this species. This is the first study performed on transcriptome sequencing of P. vietnamensis var. fuscidiscus using Illumina nextgeneration sequencing. In total, 126,758 unigenes were obtained. The large number of transcripts provided in this study not only facilitates the study of ocotillol-type saponins biosynthesis but also could provide opportunities to engineer microorganisms for the de novo production of active ingredients. Furthermore, numerous SSRs were identified and will be very useful for markerassisted selection breeding of this herb.

Ethics statement
No specific permits were required for the described field studies. No specific permissions were required for these locations and activities. The location is not privatelyowned or protected in any way and the field studies did not involve endangered or protected species.

Plant material
Four-year-old P. vietnamensis var. fuscidiscus plants were collected from Jinping County, Yunnan province, southwest of China (Latitude: 22°47′ 38″N, Longitude: 103°2′ 22″E, Altitude: 1690 m), in May 2013. After morphological and molecular identification according the reference [24], the root tissues samples were collected separately from four randomly selected plant individuals. All samples were separately cut into small pieces, and parts of each sample were mixed with equivalent fresh weight (2 g) for RNA isolation. The remaining materials were used for SE gene cloning and RT-qPCR analysis. All samples were frozen immediately in liquid nitrogen and stored at −80°C until use.

RNA library construction and sequencing
The total RNA was extracted from the mixed sample by using Trizol reagent (Invitrogen, Camarillo, CA, USA), following RNA purification by RNeasy MiniElute Cleanup Kit (Qiagen, Hilden, Germany), according to the manufacture's protocol. The RIN (RNA integrity number) values of the isolated RNA were determined by using Agilent 2100 Bioanalyzer (Santa Clara, CA, USA). Samples with RIN of more than 8 were used for further analysis. The construction of the libraries and the RNA-Seq were performed by CapitalBio Corporation (Beijing, China). Firstly, poly (A) mRNA was purified from 20 μg of total RNA using Oligo(dT) magnetic beads. Then, mRNA was fragmented into smaller pieces (200-700 bp), which were used for first-strand cDNA synthesis with reverse transcriptase and random hexamerprimer. The second-strand cDNA was synthesized using buffer, dNTPs, RNaseH and DNA polymerase. The short double-stranded cDNA fragments are purified with QiaQuick PCR extraction kit (Qiagen, Hilden, Germany) and resolved with EB buffer. These cDNA fragments underwent an end-repair process and poly(A) was added and then ligated with the Illumina paired-end sequencing adaptors. Subsequently, Ligation products were purified with magnetic beads and separated by agarose gel electrophoresis. A range of cDNA fragments (200 ± 25 bp) were excised from the gel and selected for PCR amplification as templates. The cDNA library was constructed with a fragment-length range of 200 bp (±25 bp). Final, the cDNA library was sequenced on a paried-end flow cell using Illumina HiSeq™ 2000 platform.

Transcriptome data processing and assembly
Before assembly, raw reads with adaptors and unknown nucleotides above 5% or those that were of low quality (containing more than 50% bases with Q-value ≤ 20) were removed to obtain clean reads using a custom Perl script. Then the clean reads were de novo assembled using Trinity program [60] with default parameters. First, clean reads with a certain length of overlap were combined to form longer fragments without N, which were called contigs. These clean reads were then mapped back to corresponding contigs with paired-end reads to detect contigs from the same transcript as well as the distances between contigs, and their paired-end information was also used to fill gaps or to extend the sequences. Finally, these resultant sequences were clustered to remove redundant sequences using the TIGR gene Indices clustering tools (TGICL) [61] to form longer sequences without N and cannot be extended on either end. Such sequences are defined as unigenes.  [63] using BLASTX at an evalues of less than 1e −10 . A Perl script was used to retrieve KEGG Orthology (KO) information from blast result and then established pathway associations between unigenes and database. Based on the results of Nr database annotation, we use Blast2GO program [64] to perform GO annotation of unigenes. After acheiving GO annotation for every unigene, WEGO [65] software was used to perform GO classification and to draw GO tree. Moreover, the conserved domains/families of the assembled unigenes encoding proteins were searched against the Pfam database (version 26.0) [66] using Pfam_Scan script.

Functional annotation and prediction of CDS
The coding sequence (CDS) for unigene was predicted by BlastX and ESTscan. The unigene sequences were searched against the Nr, COG, KEGG and Swiss-Prot protein databases using BLASTX (e-value <10 − 5 ). Unigenes aligned to a higher priority database will not be aligned to lower priority database. The best alignment results were used to determine the sequence direction of unigenes. When a unigene could not be aligned to any database, ESTScan [67] program was used to predict coding regions and determine sequence direction.

SSR detection and primer design
Potential SSR markers were detected among the 126,758 unigenes using the MISA tool (http://pgrc.ipk-gatersleben.de/misa/). We searched for SSRs with motifs ranging from mono-to hexa-nucleotides in size. The minimum of repeat units were set as follows: ten repeat units for mono-nucleotide, six for di-nucleotides, and five for tri-, tetra-, penta-and hexa-nucelotides. Primer pairs were designed using Primer3 (http://bioinfo.ut.ee/ primer3-0.4.0/primer3/) with default parameters.

Survey of SSR polymorphism
A total of 30 primer pairs (Additional file 17) were randomly selected to evaluate their application and the polymorphism across 13 P. vietnamensis var. fuscidiscus accessions (Additional file 18). Total DNA was isolated from P. vietnamensis var. fuscidiscus leaves using the CTAB method. PCR amplifications were conducted in a final volume of 20 μL containing 1 μL 2.5 mM dNTPs, 1 μL EasyTaq DNA polymerase (Beijing TransGen Biotech Co., Ltd. China), 2 μL 10 × EasyTaq buffer, 1 μL of each primer (10 μM), 13 μL ddH 2 O, and 1 μL template DNA (approx. 10 ng/μL). PCR was performed as follows: initial denaturation at 94°C for 2 min, followed by 35 cycles of denaturation for 30 s at 94°C, annealing for 30 s at different Tm depending on the gene, extension for 30 s at 72°C, and a final step of elongation at 72°C for 5 min. The separation of alleles was performed on 8% polyacrylamide gel. PCR products were mixed with an equal volume of loading buffer. The mixture was denatured at 95°C for 5 min before loading onto the gel.

Data collection and analysis
The presence of each single band was coded as 1 and its absence as 0 in a data matrix. Base on the binary data matrix, popgene program version 1.32 [68] was use to calculate genetic variation parameters, including observed number of alleles (No), effective number of alleles (Ne), Shannon's information index (I), number of polymorphic loci (NP) and percentage of polymorphic loci (PPB). Allelic data were used to calculate the polymorphism Information Content (PIC) of each SSR marker by using the formula: PIC =1-∑pi 2 (pi is the frequency of i th allele for each locus) [69]. By NTSYS pc 2.1 program [70], Jaccard's genetic similarity coefficients were calculated and dendrogram of the 13 P. vietnamensis var. fuscidiscus accessions was constructed by the UPGMA (un-weighted pair group method with arithmetic mean) clustering method.

Full-length cDNA cloning of putative SE genes
Total RNA was reverse transcribed to synthesize first strand cDNA using oligo dT primer and a PrimeScript T M II 1st Strand cDNA Synthesis Kit (TaKaRa, Dalian, China) according to the manufacturer's instructions. The RT-PCR products were used as template for cloning of PvfSE1, PvfSE2 and PvfSE3. The full-length cDNA sequences of PvfSE1, PvfSE2 and PvfSE3 were obtained from our transcriptome data. The specific primers (Additional file 19) used for the amplification of these genes were designed using primer3 program based on based on the predicted cDNA sequences and were then synthesized. PCRs were conducted in a total reaction volume of 25 μL, containing 1 μL of cDNA, 0.5 μM of each of the forward and reverse primers, 200 μM of dNTPs, 5 μL of 5 × Q5 Reaction Buffer, and 0.25 μL of Q5 High-Fidelity DNA polymerase (NEB, Beijing, China). The PCR conditions are as follows: 94°C for 3 min, followed by 35 cycles of 94°C for 1 min, 59°C for 1 min, 72°C for 5 min, with a final 10 min extention at 72°C. The PCR products were electrophoretically separatedon a 1% agarose gel, ligation into the pMD19-T vector (TaKaRa, Dalian, China) and were then subjected to automated DNA sequencing using the ABI 3730XL sequencer(Applied Biosystems, Foster City, USA).

Phylogenetic analysis
Phylogenetic analysis was performed based on the deduced amino acid sequences of Cytochrome P450 (CYP450) and UDP-glycosyltransferase (UGT) from P. vietnamensis var. fuscidiscus and other plants. All of the deduced amino acid sequences were aligned with Clustal X using the default parameters: gap opening penalty, 10; gap extension penalty, 0.1; and delay divergent cutoff, 25%, and evolutionary distances were computed using MEGA5.10 with the Poisson correction method. For the phylogenetic analysis, a neighbor-joining tree was constructed using MEGA5.0. Bootstrap values obtained after 1000 replications are indicated on the branches. The scale repesents 0.1 amino acid substitutions per site.

RT-qPCR analysis
Ten unigenes with potential roles in ginsenoside biosynthesis were chosen for validation using RT-qPCR with gene specific primers designed with Primer3 software. All the primers sequences used for the RT-qPCR analysis are shown in Additional file 20. Total RNA from different organs (roots, hairy roots, stems and leaves) of P. vietnamensis var. fuscidiscus were extracted individually using Trizol Kit (Promega, USA) following the manufacturer's protocol. Subsequently, RNA was treated with 4 × gDNA wiperMix at 42°C for 2 min to remove DNA. The purified RNA (1ug) was reverse transcribed to cDNA using HiScript QRT SuperMix for qPCR (Vazyme, Nanjing, China). The qPCR reactions were performed in a 20 μl volume composed of 2 μl of cDNA, 0.4 μl of each primer, and 10 μl 2 × SYBR Green Master mix (TaKaRa) in Roche LightCycler 2.0 system (Roche Applied Science, Branford, CT). PCR amplification was performed under the following conditions: 30 s at 94°C, followed by 45 cycles of 94°C for 20 s, 55°C for 20 s, and 72°C for 30 s. Three technical replications were performed for all quantitative PCRs. The phosphomevalonate kinase (PMK) gene, which was found in our transcriptome database, was chosen as reference gene control for normalization after the expressions of three reference genes (actin, GAPDH, and PMK) were compared in different tissues. The relative changes in gene expression levels were calculated using the 2 -△△Ct method.

HPLC-ELSD analysis of majonoside R2
The dried powder of P. vietnamensis var. fuscidiscus roots (0.11 g) were extracted by sonication with 50 ml of methanol for 45 min, let cool, then weighed and the weight of methanol to complement the weight loss, shake, with 0.45 μm microporous membrane filtration, and 10 μL of filtrate was analyzed by HPLC-ELSD. For majonoside R2 determination, a Shimadzu LC 20A HPLC system (Shimadzu, Kyoto, Japan) with a Sedex 75 evaporative light scattering detector (Sedere, Alfortville, France) was used. Chromatographic separation was performed on an Waters symmetry shield™RP 10 (4.6 mm × 250 mm, 5.0 um, Milford, MA, USA) column maintained at 30°C. The mobile phase was acetonitrile-water (19.5:80.5, v/v), and the flow rate was 1 μL/min. The drift tube temperature of ELSD was set at 40°C and nebulizer nitrogen gas flow-rate was 1.5 l/min and gain of 9. Authentic majonoside R2 was provided by Yunnan Institute for Food and Drug Control (Kunming, Yunnan, China).

HPLC analysis of other four triterpene saponins
In brief, the dried powder of P. vietnamensis var. fuscidiscus roots (0.6 g) were extracted with 40 mL of 100% MeOH for 30 min and sonicated for 60 min and then diluted to 50 mL with MeOH. The methanol extract was filtered through a 0.45 μm membrane filter and 10 μL of filtrate was directly injected into the HPLC system.
Quantitative analysis of the remaining four triterpene saponins (ginsenoside Rg1, Rb1, notoginsenoside R1, and ginsenoside Rd) in the roots of this herb was performed on Agilent 1260 HPLC systems (Agilent Technologies, Santa Clara, CA, USA). The chromatographic column Agilent Zorbar SB-C 18 (250 mm × 4.6 mm, 5 μm, Agilent Technologies, Santa Clara, CA, USA) was used and the column temperature was maintained at 30°C. The flow rate was fixed at 1 mL/min, and the mobile phase consisted of acetonitrile (A) and water (B) and separation was achieved using the following gradient system: 85% B at 0 min, 80% B at 5 min, 77% B at 30 min, 60% B at 50 min, and 60% B at 55 min. Detection was performed at 203 nm for the remaining four triterpene saponins. Authentic ginsenoside Rg1, Rb1, notoginsenoside R1, and ginsenoside Rd were purchased from J&K Scientific Ltd (Beijing, PR China).