De novo transcriptome analysis and comparative expression profiling of genes associated with the taste-modifying protein neoculin in Curculigo latifolia and Curculigo capitulata fruits

Curculigo latifolia is a perennial plant endogenous to Southeast Asia whose fruits contain the taste-modifying protein neoculin, which binds to sweet receptors and makes sour fruits taste sweet. Although similar to snowdrop (Galanthus nivalis) agglutinin (GNA), which contains mannose-binding sites in its sequence and 3D structure, neoculin lacks such sites and has no lectin activity. Whether the fruits of C. latifolia and other Curculigo plants contain neoculin and/or GNA family members was unclear. Through de novo RNA-seq assembly of the fruits of C. latifolia and the related C. capitulata and detailed analysis of the expression patterns of neoculin and neoculin-like genes in both species, we assembled 85,697 transcripts from C. latifolia and 76,775 from C. capitulata using Trinity and annotated them using public databases. We identified 70,371 unigenes in C. latifolia and 63,704 in C. capitulata. In total, 38.6% of unigenes from C. latifolia and 42.6% from C. capitulata shared high similarity between the two species. We identified ten neoculin-related transcripts in C. latifolia and 15 in C. capitulata, encoding both the basic and acidic subunits of neoculin in both plants. We aligned these 25 transcripts and generated a phylogenetic tree. Many orthologs in the two species shared high similarity, despite the low number of common genes, suggesting that these genes likely existed before the two species diverged. The relative expression levels of these genes differed considerably between the two species: the transcripts per million (TPM) values of neoculin genes were 60 times higher in C. latifolia than in C. capitulata, whereas those of GNA family members were 15,000 times lower in C. latifolia than in C. capitulata. The genetic diversity of neoculin-related genes strongly suggests that neoculin genes underwent duplication during evolution. The marked differences in their expression profiles between C. latifolia and C. capitulata may be due to mutations in regions involved in transcriptional regulation. Comprehensive analysis of the genes expressed in the fruits of these two Curculigo species helped elucidate the origin of neoculin at the molecular level.


(Continued from previous page)
Conclusions: The genetic diversity of neoculin-related genes strongly suggests that neoculin genes underwent duplication during evolution. The marked differences in their expression profiles between C. latifolia and C. capitulata may be due to mutations in regions involved in transcriptional regulation. Comprehensive analysis of the genes expressed in the fruits of these two Curculigo species helped elucidate the origin of neoculin at the molecular level.
Keywords: NGS, RNA-seq, Neoculin, NBS, NAS, Curculigo capitulata, Curculigo latifolia, Expression profile, Gene duplication Background Curculigo latifolia (Hypoxidaceae family, formerly classified in the Liliaceae family) is a perennial plant found in Southeast Asia, especially the Malay peninsula [1,2]. According to the Royal Botanic Gardens, Kew, there are 27 species of Curculigo [3]. The genetic diversity and morphology of Curculigo have long been of interest [4][5][6][7]. C. latifolia and C. capitulata were previously reclassified as members of the Molineria genus, but recent discussions have suggested that they should be returned to the Curculigo genus. Here, we use the traditional name, Curculigo.
Neoculin itself has a sweet taste and is 550 times sweeter than sucrose on the percentage sucrose equivalent scale [19,20]. Furthermore, neoculin has a taste-modifying activity that converts sourness to sweetness: for example, the sour taste of lemons is changed to a sweet orange taste. Moreover, the presence of neoculin induces sweetness in drinking water, and some organic acids taste sweet when consumed after neoculin [21]. Neoculin is perceived by the human sweet taste receptor T1R2-T1R3, a member of the G-protein-coupled receptor family [22]. Neoculin consists of two subunits that form a heterodimer: the neoculin basic subunit (NBS), also called curculin [16], and the neoculin acidic subunit (NAS) [18,23]. NBS is a 11-kDa peptide consisting of 114 amino acid residues [16,24], while NAS has a molecular mass of 13 kDa and 113 residues. The two subunits share 77% identity at the protein level [18]. Several essential amino acids that are responsible for the tastemodifying properties of neoculin have been identified: His-11 in NBS is responsible for the pH-dependent taste-modifying activity of neoculin [25], and Arg-48, Tyr-65, Val-72, and Phe-94 function in the binding and activation of human sweet taste receptors [26]. Changes in the tertiary structure of the subunits at these residues are thought to contribute to the taste-modifying properties of neoculin [27,28].
Lectins are proteins that recognize and bind to specific carbohydrate structures [29,30]. Plant lectins are C. capitulata C. latifolia classified into 12 families. Neoculin NBS and NAS are similar in protein sequence and 3-dimensional (3D) structure to the GNA (Galanthus nivalis agglutinin) family of lectins, which are present in bulbs such as snowdrop (Galanthus nivalis) and daffodil (Narcissus pseudonarcissus) and are thought to function as defense or storage proteins [31][32][33]. However, NBS and NAS lack a mannose-binding site (MBS) and do not have lectin activity [34][35][36]. Furthermore, whereas GNA family members in plants such as snowdrop contain one disulfide bond, which functions in intra-subunit bonding, neoculin forms both two intra-subunit bonds and two inter-subunit bonds between NBS and NAS [32].
The fruit of C. latifolia contains 1.3 mg neoculin per fruit [37] or 1.3 mg per one gram of fresh pulp [38]. This is thought to be considerably higher than the levels of total proteins in typical edible fruits [39]. Although the tastemodifying activity of neoculin is well-known, its biological role in C. latifolia is unknown. In addition, as neoculin is not a lectin, it was not clear which lectins are expressed in C. latifolia fruits, especially lectins of the GNA family. Finally, whether other Curculigo species also accumulate neoculin or neoculin-like proteins is unknown.
Here, we compared the gene expression profiles in the fruits of C. latifolia and C. capitulata by transcriptome deep sequencing (RNA-seq). The aim of this study was to comprehensively analyze the two species from the viewpoint of amino acid sequences and gene expression levels to shed light on the origins of neoculin.

Results
De novo RNA-seq assembly from C. latifolia and C. capitulata fruits We sequenced cDNA libraries from C. latifolia and C. capitulata using the Illumina HiSeq 2500 platform. To analyze the data, we filtered out raw reads with average quality values < 20, reads with < 50 nucleotides, and reads with ambiguous 'N' bases. After trimming reads for adapter sequences and filtering, we obtained 44,396,896 reads from C. latifolia and 43,863,400 from C. capitulata. We then assembled high-quality reads from C. latifolia and C. capitulata into 85,697 and 76,775 contigs with a mean length of 775 bp and 744 bp, respectively, using Trinity 2.11. The distribution of transcript lengths and transcripts per million (TPM) values are shown in Additional files 1 and 2. The N50 values for C. latifolia and C. capitulata transcripts were 1324 and 1205, respectively (Table 1). Unigene clustering using CD-Hit revealed 70,371 unigenes in C. latifolia and 63,704 in C. capitulata (Table 1).
The gene repertoires of the two Curculigo species fitting the monocots Low annotation rate of the transcripts: To gather functional information about the transcripts identified from de novo assembly, we aligned all transcripts against nucleotide sequences from various protein databases, including the nonredundant protein (NR) database at the National Center for Biotechnology Information (NCBI), RefSeq, UniProt/Swiss-Prot, Clusters of Orthologous Groups of proteins (COG), the rice (Oryza sativa) genome (Os-Nipponbare-Reference-IRGSP-1.0, Assembly: GCF_001433935.1), and the Arabidopsis (Arabidopsis thaliana) genome (Assembly: GCF_000001735.4) and selected the top hits from these queries. We obtained annotations for 38,433 out of 85,697 transcripts (44.8%) in C. latifolia and 40,554 out of 76,775 transcripts (52.8%) in C. capitulata with a threshold of 1e − 10 by performing a Basic Local Alignment Search Tool search with our in silico-translated transcripts against protein databases (BLASTx) using the NR, RefSeq, UniProt, and COG databases and the proteomes of rice and Arabidopsis. All annotations are listed in Additional file 3. The number of annotated transcripts for each database is listed in Table 2. The low annotation rate suggests that the two Curculigo species are significantly different from classical model plant systems that drive much of the information stored in public databases.  Conservation across monocots: After BLASTx searches with the C. latifolia and C. capitulata transcripts against the NR database, we determined the extent of gene conservation across plant species by running Blast2GO [40]. We estimated the similarity of the two Curculigo species to various plant species by counting the number of hits from each species obtained by BLAST searches (Fig. 2). The top six species displaying the highest homology with C. latifolia and C. capitulata transcripts were monocots, like Curculigo, supporting the view that the assembled Curculigo genes are highly similar to known genes from other monocots. The top six species sharing the highest similarity with C. latifolia and C. capitulata were identical in terms of both species and rank order.
Expression of functionally similar genes between the two species: Using the COG database, we classified 11, 875 transcripts from C. latifolia and 12,448 from C. capitulata into functional categories (Fig. 3). We observed no significant differences between the two species, which supports the notion that these two species have functionally similar genes.
We also analyzed the functions of the assembled transcripts via Gene Ontology (GO) analysis using the rice genome annotation (Additional file 4). Again, no significant differences were observed between the two species. The results also suggested that the repertoires of genes from the two species are similar to those of betterknown species.
The genes with high similarity between C. latifolia and C. capitulata fruits are less than half of the genes Using the unigene sequences, we analyzed the similarity of between C. latifolia and C. capitulata genes. We   Fig. 2 The de novo assembled C. latifolia and C. capitulata transcriptomes reveal high similarity to known monocot genes. The percentage of genes with matches in C. latifolia (outer circle) and C. capitulata (inner circle) was obtained from the results of BLAST search against the NR database. The top six most highly homologous species were monocot, like Curculigo RNA processing and modification Chromatin structure and dynamics Energy production and conversion Cell cycle control, cell division, chromosome partitioning Amino acid transport and metabolism Nucleotide transport and metabolism Carbohydrate transport and metabolism Coenzyme transport and metabolism Lipid transport and metabolism Translation, ribosomal structure and biogenesis Transcription Replication, recombination and repair Cell wall/membrane/envelope biogenesis Cell motility Posttranslational modification, protein turnover, chaperones Inorganic ion transport and metabolism Secondary metabolites biosynthesis, transport and catabolism General function prediction only Function unknown Signal transduction mechanisms Intracellular trafficking, secretion, and vesicular transport Defense mechanisms Extracellular structures Mobilome: prophages, transposons Nuclear structure Cytoskeleton C. latifolia C.capitulata  performed BLAST searches using each transcript from one species as the query sequence against all transcripts from the other species with a threshold E-value of 1e − 5 or less and selected the reciprocal best hits. We defined unigenes with high similarity between the two species as common genes and unigenes with low similarity between the species, or present in only one species, as unique genes. In total, we deemed 38.6% (27,155 out of 70,371) of genes in C. latifolia and 42.6% (27,155 out of 63,704) of genes in C. capitulata to be common genes (Fig. 4). The relatively small number of common genes suggests that a long time has passed since the divergence of these species, which is consistent with results of lineage analysis based on plastid DNA from Hypoxidaceae family members. Indeed, although the Curculigo genus constitutes a single clade, C. latifolia and C. capitulata are not the most closely related species within this clade [5]. Next, we investigated the proportion of annotated genes in these species using the COG, RefSeq, UniProt, and NR databases and the genomes of rice and Arabidopsis (shown in Table 2). Among the common genes, 17,337 and 17,199 genes were annotated (63.8 and 63.3% of common genes) in C. latifolia and C. capitulata, respectively.

Number of transcripts
By contrast, there were 11,718 annotated unique genes (27.1% of unique genes) among genes found only in C. latifolia and 14,848 (40.6% of unique genes) among those found only in C. capitulata. Thus, the annotation rate was higher for common genes than for unique genes, despite the smaller number of common genes. One possible explanation for this observation is that many of the genes common to both species may also be common genes in other model plant species that are highly represented in the databases employed.
We then compared the expression profiles of 27,155 common genes between C. latifolia and C. capitulata. Although the sequences of the corresponding genes in C. latifolia and C. capitulata were similar, their expression profiles were not necessarily equivalent. Nonetheless, only 111 out of the 27,155 common genes had TPM ratios ≥50 (Table 3). Of these 111 genes, five were neoculin-related genes, indicating that the expression profiles of at least some neoculin-related genes differ significantly between the two species.
Lectin genes expressed in C. latifolia and C. capitulata fruits We previously demonstrated that C. latifolia fruits contain a taste-modifying protein consisting of a NBS-NAS heterodimer that is similar to lectins in the GNA family. We therefore investigated the number of lectin genes expressed in the fruits of C. latifolia and C. capitulata that were categorized into each of the 12 lectin families to better understand the general outline of the GNA gene family in these species. To determine the number of lectin genes, we performed tBLASTN searches against all transcripts in each species using the sequences of 12 representative lectins as query [41] (Table 4). In both species, the largest lectin family was the GNA family, which includes the neoculin (NBS and NAS) genes. Ten of the 45 lectin genes in C. latifolia and 13 of the 49 lectin genes in C. capitulata belonged to the GNA family. Thus, we analyzed the many GNA family genes in these species, including the neoculin genes, in more detail.

Analysis of GNA family and neoculin-related transcripts
We constructed a phylogenetic tree using the deduced protein sequences from 17 transcripts of well-known GNA family members and 25 full-length neoculin-related transcripts from Curculigo (10 from C. latifolia and 15 from C. capitulata; Fig. 5); the method used for sequence selection is shown in Additional file 5. The TPM values (calculated by RSEM) are listed after the transcript IDs. An alignment of all sequences is shown in Additional file 6. The C. latifolia transcript L_16562_c0_ g1_i1 was a good match for NBS, while L_16562_c0_g1_ i2 was a good match for NAS, except for one amino acid substitution (Additional file 7); these transcripts will be L-unique 43 Fig. 4 The majority of unigenes from C. latifolia and C. capitulata correspond to unique genes with low similarity. Number of unigenes based on sequence similarity between C. latifolia and C. capitulata fruits. The number of highly similar unigenes that are common (L-common: common genes of C. latifolia; C-common: common genes of C. capitulata) and unigenes with low similarity, which are thus unique genes (L-unique: unique genes of C. latifolia; C-unique: unique genes of C. capitulata) Table 3 Comparison of the expression profiles of C. latifolia and C. capitulata

9E-165
Common genes with TPM value ≥50 between the two species, except when the TPM values of both genes is < 100. The genes were sorted based on the TPM value of C. latifolia along with the corresponding genes of C. capitulata. Note that there were no cases of genes that were highly expressed in both species. This pattern strongly suggests changes in the gene expression regulatory system due to divergence of two species *neoculin-related transcripts (cf. Figure 5 and Additional file 6) a Pident and E-value are BLASTN results performed with C. latifolia as query against C. capitulata referred to as NBS and NAS hereafter. The predicted proteins derived from neoculin-related transcripts formed a distinct group separate from known GNA family members. Neoculin-like sequences formed one group that included NBS and NAS (named the 'neoculin group'), as well as two other large groups (group 1 and group 2) (Fig. 5). In addition to NBS and NAS, the neoculin group also included proteins whose transcripts were highly expressed (C_9931_c0_g1_i1) and that presented the conserved amino acid residues critical for binding mannose (and thus have the potential for lectin activity). In addition, each transcript had an ortholog in both Curculigo species. Many highly expressed transcripts belonged to group 1 (L_22219_c0_g1_i1 [TPM: 7600]; C_18595_c_g1_i1 [TPM: 2300]; C_9454_c0_g1_i1 [TPM: 2000]). Although these highly expressed transcripts encode proteins that are very similar to mannose-binding lectins, they are not mannose-binding lectins, as they lack the conserved and essential amino acid residues that form the mannosebinding sites. At this time, we do not know their physiological functions or the reason for their high expression. Predicted proteins encoded by group 2 transcripts were also relatively close to the lectins Polygonatum multiflorum agglutinin (PMA) and Polygonatum roseum agglutinin (PRA) from the Polygonatum genus. Unlike in group 1, there were no highly expressed transcripts in this group.
In each group, we detected neoculin-related orthologous transcripts with high similarity between C. latifolia and C. capitulata. The existence of many orthologs in each species, combined with the presence of relatively few common genes (comprising only approximately 40% of all transcripts in both species; Fig. 4), is noteworthy. We infer that these orthologs probably existed before the divergence of these two species, whereas their amino acid differences probably arose afterwards. Genetic diversity is beneficial for plants, including Curculigo, due to their lack of mobility to increase population survival against multiple stresses. It would be interesting to determine whether Curculigo plants other than C. latifolia and C. capitulata contain neoculin-related genes, especially genes in the neoculin group.
Within the neoculin group, we identified transcripts encoding proteins with high similarity to NBS and NAS in both C. latifolia and C. capitulata. Notably, although the corresponding NBS and NAS genes were highly expressed in C. latifolia, their C. capitulata orthologs were only weakly expressed (C_16324_c0_g1_i1 and C_ 16324_c0_g1_i2). The TPM values for NBS and NAS genes in C. latifolia were approximately the same, with 650 and 620 TPMs, respectively. This result is in agreement with the finding that their encoded proteins form a heterodimer [18]. Although C_9931_c0_g1_i1 was highly expressed in C. capitulata, with a TPM value of 15,000 (the fifth highest expression level among all C. capitulata transcripts), its C. latifolia ortholog (L_307_ c0_g1_i1 and L_307_c0_g2_i1) was expressed at a very low level. In order to verify the results of RNA-seq, qRT-PCR analyses for the genes of the neoculin group in two species were performed (Additional files 8 and 9). Then, we compared the expression levels using a ubiquitin gene of each species as a reference gene. In C. latifolia, the expression levels of NBS and NAS were almost same, and that of L_307_c0_g1_i1 and L_307_c0_g2_i1 was considerably lower than them. In C. capitulata, the expression levels of C_16324_i1 and C_16324_c0_g1_i2 were very small, and that of C_9931_c0_g1_i1 was very large. These results support TPM values estimated from RNA-seq analysis. In addition, comparing the high-low  relationship of the expression level in two species, results obtained by RNA-seq analysis was also supported by qRT-PCR analyses. Curiously, in all three groups (neoculin group, groups 1 and 2) for which there were orthologs in both species, if a gene was highly expressed in one species, its ortholog was weakly expressed in the other species; we did not identify a single case where orthologs were highly expressed in both species. The data shown in Table 3 also support this pattern. These results strongly suggest changes in the gene expression regulatory system due to divergence of the two species.
Next, we aligned the deduced amino acid sequences for the proteins belonging to the neoculin group (Fig. 6a). We divided the sequences into nine regions, including the regions removed by cleavage of the secretion signal peptide and three mannose binding site (MBS)-like regions: N pro-sequence (N-Pro), N-terminal (N-term), MBS1, inter1, MBS2, inter2, MBS3, C-terminal (C-term), and C pro-sequence (C-Pro). The His-11 residue was present in the N-term region of NBS and in the predicted proteins encoded by transcripts L_16562_c0_g1_ i1 in C. latifolia and C_16324_c0_g1_i1 in C. capitulata. This site essential for the pH-dependent taste-modifying activity of neoculin. By contrast, transcripts C_9931_c0_ g1_i1 in C. capitulata and L_307_c0_g1_i1 and L_307_ c0_g2_i1 in C. latifolia (abbreviated 'C_9931 series') did not code for His-11, which was replaced by Tyr-11, as in NAS. In addition, Cys-77 and Cys-109, which form an intermolecular disulfide bond between NBS and NAS, were present within the inter2 and C-term regions in both species, but were absent in the C_9931 series. Thus, it is likely that proteins corresponding to the C_ 9931 series do not form dimers.
Four residues are responsible for the binding and activation of the human sweet receptor: Arg-48, Tyr-65, Val-72, and Phe-94 [26]. Although Tyr-65 and Val-72 were identified in the C_9931 series, Leu-48 and Val-94 were missing. The lack of His-11 and these four indispensable residues, as well as the lack of dimerization, indicate that the C_9931 series proteins may not possess the sweet taste or taste-modifying properties of classic neoculin. Indeed, a preliminary test indicated that C. capitulata fruits did not have a sweet taste or tastemodifying properties despite the high expression level of C_9931_c0_g1_i1 (data not shown). Three sites similar to the MBS were present in the MBS1, MBS2, and MBS3 regions of this protein. Moreover, whereas NBS and NAS lack the essential residues of the MBS, all of these residues were conserved in C_9931_c0_g1_i1, making C_9931_c0_g1_i1 a likely lectin candidate.
Based on this protein alignment, we investigated all amino acid substitutions in each region in comparison to the two reference sequences, NBS and NAS (Additional file 10). The amino acid substitution rate with reference to NBS is shown in the heatmap in Fig. 6b. Between the NBS series and the NAS series, 18 to 27% of substitutions occurred in the overall regions from the Nterm region to C-term region (23%, 26 of 114 residues in NBS). The highest substitution rate was 27% in the MBS2 region, followed by 24% in the inter2 and C-term regions. In the C_9931 series, the highest substitution rate was 53% in the C-term region, followed by the MBS3 region (44%) and inter2 region (43%). These results suggest that the region from inter2 to C-term is the main source of sequence diversity among neoculin group members.

Biochemical analysis
We extracted proteins from C. latifolia and C. capitulata fruits and subjected them to SDS-PAGE, followed by Coomassie brilliant blue (CBB) staining and immunoblotting using a mixture of polyclonal anti-NAS and anti-NBS specific antibodies ( Fig. 7 and Additional file 11). The CBB-stained gel is shown in Fig. 7a and the corresponding immunoblot in Fig. 7b. By CBB staining, we detected an 11-kDa band representing NBS and a 13-kDa band representing NAS in C. latifolia fruit samples (Fig. 7a). In C. capitulata fruits, some bands around 11 kDa may be the protein encoded by C_9931_ c0_g1_i1, which had a high TPM value. Immunoblotting confirmed the identity of the bands corresponding to NBS and NAS in C. latifolia fruits. However, we detected no such bands in C. capitulata fruits (Fig. 7b), perhaps because NBS and NAS accumulate at very low levels in this species, as reflected by the low TPM values of their encoding transcripts (as described above). The amino acid sequence of the C-term region, which is recognized by the antibody, was also very different in C_ 9931_c0_g1_i1 compared to both NBS and NAS, which is consistent with the finding that the proteins detected by CBB staining were not detected by immunoblotting.

Discussion
The C. latifolia and C. capitulata transcriptomes contain many neoculin-related genes that are similar within and between species. This diversity is thought to result from gene duplication, which is known to contribute to plant evolution [41][42][43][44][45][46][47]. Such gene duplication might place some genes under the same transcriptional regulation. The neoculin genes NBS and NAS are likely paralogs that arose due to tandem duplication before the divergence of C. latifolia and C. capitulata. The characteristics of NBS and NAS genes in C. latifolia and C. capitulata are summarized in Table 5. Both C. latifolia and C. capitulata produce NBS and NAS transcripts, and the sequences of the C_9931 series transcripts matched those of active GNA family members. However, Fig. 6 The essential amino acid residues in neoculin group members have been conserved. a Amino acid sequence alignment of neoculin group members from C. latifolia and C. capitulata fruits. In each alignment, the residues that are shared with only NBS or only NAS are shown in blue and red, respectively. The residues that are not consistent with NBS or NAS are shown in pink, and those that are consistent with only C_9931_c0_g1_i1 (Ser17) are shown in light green. His-11 and Cys residues are highlighted in dark red and dark green, respectively. Arg-48, Tyr-65, Val-72, and Phe-94 are highlighted in pale green. Mannose-binding sites (MBS, QxDxNxVxY) are indicated by a dagger ( †), and conserved residues are highlighted in yellow. MBS residues that are conserved in all sequences are indicated by a double dagger ( ‡). MBS residues in L_307_c0_g2_i1, L_307_c0_g1_i1, and C_9931_c0_g1_i1 are shown in boxes. The predicted proteins were divided into nine regions-N-Pro, Nterm, MBS1, inter1, MBS2, inter2, MBS3, C-term, and C-Pro-based on the regions removed after signal-peptide cleavage, the N-or C-terminal regions, the regions of MBS 1 to 3, and the regions between the MBSs. b Amino acid residue substitutions in proteins from the neoculin group. The region from inter2 to C-term is the primary region of sequence diversity in the neoculin group. The values shown in the heatmap are amino acid substitution rates (%) of neoculin group. The NBA sequence was used as the reference their expression levels in the two species were very different.
C. latifolia fruits have been reported to accumulate 1.3 mg neoculin g − 1 fresh pulp. Because neoculin is 550 times as sweet as sucrose [19,20], one gram of C. latifolia fruit pulp is thus estimated to be equivalent to 715 mg of sucrose in sweetness, explaining the sweet taste of these fruits. Given that the TPM values of the neoculin genes in C. capitulata were only 1/60 those detected in C. latifolia, C. capitulata fruits would be expected to contain only approximately 22 μg neoculin g − 1 fresh pulp and have the same sweetness as 12 mg of sucrose. Based on these values, it seemed likely that C. capitulata fruits would not taste sweet, which we confirmed in a preliminary test. Thus, neoculin levels, and therefore taste, differ greatly between these fruits, paralleling the The number of amino acid residues difference from the reference sequences, the potential for heterodimerization, lectin activity, taste modification, and expression levels of the transcripts from C. latifolia or C. capitulata fruits are summarized. As the reference sequences, amino acid sequences of NBS, NAS, and C_9931_c0_g1_i1 of C. capitulata were used difference in the expression of NBS and NAS genes in the two species. The taste of C. latifolia fruits may strongly influence its survival strategies. For example, the sweet taste conferred by neoculin may facilitate seed spread by animals.
The structure of the taste-modifying protein miraculin is similar to those of the soybean (Glycine max) Kunitz trypsin inhibitor and thaumatin, a sweet protein with an α-amylase or trypsin-inhibitor-like structure. Similarly, neoculin has a structure similar to that of lectin, a common molecular structure in plants [48][49][50][51][52][53][54]. Trypsin inhibitors, amylase inhibitors, and lectins commonly accumulate in fruits and seeds. The diversity of these proteins arose from gene duplications and mutations during evolution. It appears that over the course of evolution, neoculin, miraculin, and thaumatin all acquired sweetness or taste-modifying activity in regard to human senses.
Lectins are thought to play important protective and storage roles in general plants. Thus, the high expression levels of lectin genes in C. capitulata fruits is likely to reflect important roles of lectins in this plant. In contrast, the low expression levels of neoculin genes in C. capitulata suggest that the encoded protein may be less beneficial in this species. Similarly, and in contrast to C. capitulata, active GNA family members were barely expressed in C. latifolia fruits. Neoculin genes were highly expressed in C. latifolia but weakly expressed in C. capitulata despite the similar vegetative appearance of the two plants (Fig. 1). These physiological differences might be due to mutation(s) of the cis-regulatory elements in these genes. Cis-elements, including promoters, enhancers, and silencers, are very important for the regulation of gene expression [41,[55][56][57][58]. Likewise, the different expression levels of related genes in C. latifolia vs. C. capitulata might be caused by mutations in their cis-elements. For example, the cis-elements of the NBS and NAS genes may have mutated after the divergence of the two species, or the genes may have acquired mutations or lost cis-elements during the gene duplication events that led to their divergence, leading to different expression patterns. Deciphering the genomic information of these two species further might help verify this notion and distinguish among these possible mechanisms.

Conclusions
RNA-seq analysis and de novo transcriptome assembly of C. latifolia and C. capitulata fruits revealed the presence of numerous neoculin-like genes. Among the various neoculin-related genes that arose from gene duplication, several mutations accumulated, resulting in the genes encoding NBS and NAS. These proteins form the heterodimeric protein neoculin, which exhibits taste-modifying activity in humans. Our comprehensive investigation of the genes expressed in the fruits of these two Curculigo species will help uncover the origin of neoculin at the molecular level.

Plant materials
C. latifolia (voucher ID 26092) was obtained from the Research Center for Medicinal Plant Resources, National Institutes of Biomedical Innovation, Health, and Nutrition, Tsukuba, Japan (originated in Indonesia). C. capitulata (voucher ID 31481) was obtained from The Naito Museum of Pharmaceutical Science and Industry, Kakamigahara, Japan. The plants were cultivated in a greenhouse at the Yamashina Botanical Research Institute. Photographs of the fruits of these plants are shown in Fig. 1.

Fruit setting
C. latifolia flowers were pollinated by hand in the morning on the first day of flowering. C. capitulata flowers were placed in 50 ppm of 1-naphthylacetic acid (NAA) in the morning of both the first and second days of flowering. This is the first report of a method to induce C. capitulata fruit set through plant hormone application. About 60 days after flowering, mature fruits were harvested and immediately soaked in RNA later™ solution (Thermo Fisher Scientific, MA. USA). The fruits were stored at − 80°C until use. The samples were ground into a powder in liquid nitrogen prior to RNA extraction. Total RNA was extracted from the frozen samples using the phenol-SDS method, and poly(A) + mRNA was purified using an mRNA Purification Kit (Amersham Biosciences, Buckinghamshire, UK).
Comparison of gene expression in C. latifolia vs. C. capitulata fruits To compare the transcripts in C. latifolia vs. C. capitulata fruits, a BLASTN search was performed with E-value <1e − 5 using each transcript from one species as the query against all transcripts from the other species, and then the best hits were selected. cDNA was synthesized from 1 μg of total RNA using Super-Script IV Reverse Transcriptase (Thermo Fisher Scientific, MA. USA) according to the manufacturer's instructions. PowerUp SYBR Green Master Mix (Thermo Fisher Scientific, MA. USA) was used with an ABI 7500 real-time PCR system (Thermo Fisher Scientific, MA. USA). The thermal cycling program was performed using the following parameters: denaturation at 95°C for 2 min, prior to 40 amplification cycles (95°C for 15 s, 60°C for 1 min). Melting curves were constructed after 40 cycles to confirm the specificity of the reactions. The 2 -ΔΔCT method was used to calculate the relative expression of six genes following normalization to L_19431_c0_g1_i2 for C. latifolia and C_20039_c0_g6_i1 for C. capitulata, which are probably ubiquitin genes in C. latifolia and C. capitulata. The primer sequences are shown in Additional file 8.

Biochemical analysis
SDS-PAGE was carried out using fruit extracts from C. latifolia and C. capitulata. The proteins were visualized by Coomassie brilliant blue (CBB) staining. Immunoblot analysis was carried out using anti-NBS and anti-NAS specific polyclonal antibodies [38,67], which were raised against the C terminus of NAS or NBS, respectively. Preparation and purification of fruit extracts were performed as described previously [18,38]. Each 0.1 g pulp sample was treated with 0.5 mL of 0.5 M NaCl to obtain an extract, which was combined with the appropriate volume of buffer containing 2-mercaptoethanol for SDS-PAGE. After the SDS-PAGE, proteins were transferred to PVDF membrane pore size of 0.45 μm (Merck Millipore, MA. USA). The membrane was soaked in Tris-buffered saline/Tween-20 (TBST) containing 5% skim-milk to block the non-specific protein reaction.
After blocking, the membrane was reacted with the mixture of anti-NBS and anti-NAS specific polyclonal antibodies diluted 1:500 in TBST solution for 1 h at room temperature. And then, the membrane was washed with TBST solution at three times for 5 min. Next, the membrane was reacted with Rabbit IgG HRP Linked Whole Ab (Sigma-Aldrich, MO. USA) diluted 1:4000 in TBST solution for 1 h at room temperature. The membrane was washed with TBST solution at three times for 5 min. Signals were visualized with Clarity Western ECL Substrate kit (BIO-RAD, CA. USA) according to the protocol attached to ECL Kit. The signals were detected at 428 nm for 20 s exposure using Luminescent Image Analyzer (Image Quant LAS 4000 mini, GE Healthcare, IL. USA).