Genome-wide transcriptome analysis of Chinese pollination-constant nonastringent persimmon fruit treated with ethanol

Background The persimmon Diospyros kaki Thunb. is an important commercial and deciduous fruit tree. The fruits have proanthocyanidin (PA) content of >25% of the dry weight and are astringent. PAs cause astringency that is often undesirable for human consumption; thus, the removal of astringency is an important practice in the persimmon industry. Soluble PAs can be converted to insoluble PAs by enclosing the fruit in a polyethylene bag containing diluted ethanol. The genomic resource development of the persimmon is delayed because of its large and complex genome. Second-generation sequencing is an efficient technique for generating huge sequences that can represent a large number of genes and their expression levels. Results We used 454 sequencing for the de novo transcriptome assembly of persimmon fruit treated with 5% ethanol (Tr library) and without treatment as the control (Co library) to investigate the genes and pathways that control PA biosynthesis and other secondary metabolites. We obtained 374.6 Mb in clean nucleotides comprising 624,690 and 626,203 clean sequencing reads from the Tr and Co libraries, respectively. We also identified 83,898 unigenes; 54,719 (~65.2%) unigenes were annotated based on similarity searches with known proteins. Up to 14,954 of the unigenes were assigned to the protein database Clusters of Orthologous Groups (COG), 24,337 were assigned to the term annotation database of Gene Ontology (GO), and 45,506 were assigned to 200 pathways in the database of Kyoto Encyclopedia of Genes and Genomes (KEGG). The two libraries were compared to identify the differentially expressed unigenes. The expression levels of genes involved in PA biosynthesis and tannin coagulation were analysed, and some of them were verified using quantitative real time PCR (qRT-PCR). Conclusions This study provides abundant genomic data for persimmon and offers comprehensive sequence resources for persimmon research. The transcriptome dataset will improve our understanding of the molecular mechanisms of tannin coagulation and other biochemical processes in persimmons.


Background
The persimmon Diospyros kaki Thunb. (2n = 6X = 90) originated in China and was principally cultivated in China, Korea and Japan [1]. Persimmon cultivars are classified into four types, including pollination-constant nonastringent (PCNA), pollination-constant astringent (PCA), pollination-variant nonastringent (PVNA), and pollination-variant astringent (PVA); this classification is based on the effect of pollination on flesh colour and the natural loss of astringency at the harvest time on the tree [1]. The PCNA type includes Japanese PCNA (JPCNA) and Chinese PCNA (CPCNA), which differ in their genetic characteristic of PCNA trait [2]. The natural loss of astringency is a trait that is qualitatively inherited and recessive in JPCNA cultivars [3,4] but dominant in CPCNA cultivars. When the CPCNA cultivar 'Luotiantianshi' is crossed as the maternal parent to a JPCNA or non-PCNA type, the F 1 offspring are segregated into a 1:1 ratio for PCNA:non-PCNA types [5,6]. CPCNA cultivars have attracted attention in the breeding industry because of their natural ability to lose astringency, which is a dominant trait. In addition, CPCNA has the potential to be an important parent in PCNA persimmon breeding in the future.
High-throughput sequencing technologies developed in recent years provide a convenient way of establishing a rapid and efficient molecular research platform. Nextgeneration sequencing (NGS) is related to the Sanger sequencing method, which is represented by firstgeneration sequencing technologies. Currently, the three mainstream NGS technologies are Roche/454 pyrosequencing (developed in 2005, http://www.454.com), Illumina/Solexa sequencing (developed in 2006, http://www. illumina.com), and ABI/SOLiD sequencing (developed in 2007, http://www.appliedbiosystems.com). These NGS technologies vary in their input requirements and sequence output with regard to the total bases sequenced, length of each sequence read, and price per megabase of sequence information [24]. Among these technologies, 454 sequencing, which generates a minimum number of sequence reads, produces the longest reads (i.e. from 100 bp to~800-1000 bp). Long reads are optimal for initial genome and transcriptome characterisation because longer pieces are assembled more efficiently than shorter pieces [25]. Given their rapid processing, high throughput, and cost effectiveness, NGS technologies have been successfully used to study genomes and transcriptomes of species with and without sequenced genomes. Many novel and functional genes can be obtained from massive amounts of data.
Abundant genetic resources for persimmons are currently available. However, genomic information and EST sequences for this fruit tree are lacking. In addition, molecular data on persimmons are insufficient when compared to those of other fruit trees, such as apple, pear, peach, citrus, and grape. Accordingly, we performed large-scale transcriptome sequencing of CPCNA persimmon fruit using Roche/454 technology to create a transcript sequence database of the persimmon and identify candidate genes involved in PA biosynthesis and tannin coagulation. We used IDEG6 to filter the differentially expressed genes in the treatment (Tr) and control (Co) libraries. We also verified the differentially expressed unigenes by quantitative real time PCR (qRT-PCR). The present study provides a platform for studying the genes involved in persimmon tannin coagulation and tannin biosynthesis to analyse the relationship between differentially expressed genes and persimmon fruit deastringency and clarify the mechanism of astringency loss for CPCNA.

Sequencing and assembly
The soluble PA concentration of the CPCNA persimmon fruit was <0.2% in the group treated with 5% ethanol after 3 d but was still very high (1.4%) in the control group ( Figure 1A), suggesting that astringency was successfully removed in the treated fruit. In the printing test, the colour change observed after the reaction between FeCl2 and soluble tannin was dark in the control group but light in the treatment group, suggesting that the amount of soluble tannin decreased in the treated fruit ( Figure 1B). A half-plate run using the 454 GS FLX Titanium platform was carried out on the cDNA that was generated by SMART technology with the total RNA from the Tr (5% ethanol-treated) and Co libraries. A total of 624,690 and 626,203 high-quality reads were generated from the Tr and Co libraries, with average sequence lengths of 319 and 309 bp, respectively [National Center for Biotechnology Information (NCBI) Short Read Archive, accession SRA091427]. After trimming the adaptor sequences and removing those shorter than 100 bp, the clean reads of the two libraries were assembled into 83,898 unique sequences using Mimicking Intelligent Read Assembly (MIRA), with an average size of 579 bp. A summary of the 454 sequencing and assembly is presented in Table 1, and the size distributions for these reads and unigenes are presented in Figure 2.

Sequence annotation
Estimating the number of genes and the level of transcript coverage is difficult because of the lack of genetic or genomic information on persimmons. We performed BLASTX alignments (E-value < 10 −5 or 10 −10 ) against the databases of NCBI-nonredundant (Nr), SwissProt, Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG) to identify the putative functions of the unigene sequences. A total of 54,719 unigenes, accounting for 65.2% of the total unigenes, were annotated to Nr database ( Table 2).
We used GO to obtain a functional annotation of the persimmon unigenes [26]. A total of 24,337 unigenes (29.0%) were assigned to at least one GO term ( Figure 3). Of the unigenes assigned to GO terms, 27,406 were assigned to cellular location, 30,629 to molecular function, and 39,037 to biological processes (Additional file 1). These unigenes were further classified into 47 functional subcategories. Within each of the three main categories of the GO classification scheme (i.e. cellular location, molecular function, and biological process), the dominant subcategories were 'cell' (9106), 'cell part' (9106), 'binding' (13,212), 'catalytic activity' (11,774), 'cellular process' (11,929), and 'metabolic process' (12,712).  The protein database Clusters of Orthologous Group (COG) has been designed to classify proteins from completely sequenced genomes based on orthology [27]. We searched annotated sequences for genes involved in COG classification to predict and classify the possible functions of all oriental persimmon unigenes. Of the 54,719 sequences that returned a hit with the Nr database, 14,954 could be assigned to 24 COG categories ( Figure 4). Among these categories, the cluster for 'general function prediction only' was the largest group (2456, 18.83%), followed by 'post-translational modification, protein turnover, chaperones' (1900, 14.57%) and 'translation, ribosomal structure and biogenesis' (1811, 13.89%). The clusters for 'nuclear structure' (6, 0.05%), 'cell motility' (10, 0.08%), and 'cell cycle control, cell division, chromosome partitioning' (46, 0.35%) were the smallest groups.

Detection of differentially expressed unigenes in the Tr and Co libraries
A previous study proposed that comparing the number of reads for a gene between different libraries or different genes in the same library could be a reliable indicator of relative gene expression [29]. Thus, IDEG6 was used to identify unigenes that show a statistically significant difference in terms of relative abundance (as reflected by the total count of individual sequence reads) between the two   (Table 3).

Quantitative real-time PCR confirmation
Fifty differentially expressed unigenes were selected for qRT-PCR analysis (primers are shown in Additional file 3) to confirm the expression of the unigenes from the sequencing and computational analyses. cDNA fragments of the control and flesh treated for 3 d (15 July 2011;  73 d after bloom) were used as templates. The results showed that the qRT-PCR assessments (relative expressed level) of 34 unigenes (68%) were consistent with those of the 454 sequencing analysis ( Figure 6). These results suggest that our transcriptome data have a high coverage.

Discussion
Currently, the most common application of NGS in nonmodel species is transcriptome characterisation [30][31][32][33][34]. Among the currently available NGS technologies, 454 pyrosequencing produces the longest reads; thus, it has emerged as a powerful tool for transcriptome sequencing.
In addition, many studies have used de novo assembly of such data to produce genome-level resources for nonmodel organisms [35][36][37][38].
A total of 31,211 persimmon unigenes were mapped into 200 KEGG pathways (Additional file 2). The genetic information processing pathway (12,838) and cellular processes pathway (2606) were highly represented categories in the metabolic pathways. Most persimmon fruits accumulate PAs in their flesh during development; PAs cause astringency due to coagulation of oral proteins and are synthesised from metabolites via the shikimate and flavonoid pathways [10][11][12]. Therefore, we focused on the pathways pertaining to phenylpropanoid biosynthesis (ko00940, 123 unigenes) and flavonoid biosynthesis (ko00941, 62 unigenes). Most of the genes related to PA biosynthesis in these two pathways were found in our transcriptome sequencing data. PAs cause astringency that is often undesirable for human consumption; thus, the removal of astringency is an important practice in the persimmon food industry. Soluble PAs can be converted to insoluble PAs by enclosing the fruit in a polyethylene bag containing diluted ethanol [56]. Acetaldehyde formed in situ from ethanol is involved in the direct insolubilisation of soluble PAs, causing a loss of astringency [57]. Pyruvate decarboxylase (PDC) and alcohol dehydrogenase (ADH) are two important enzymes in this process, which is involved in the glycolysis/gluconeogenesis pathway (ko00010, 496 unigenes). Moreover, six and 31 unigenes were classified into the PDC and ADH families, respectively.
The expression levels of the unigenes were reflected by the number of reads used to characterise the differences in gene expression between the Tr and Co transcriptome libraries. A total of 3639 unigenes were found to be differently expressed ( Table 3). The expression levels of the unigenes involved in PA biosynthesis and tannin coagulation are shown in Table 4. The expression levels of ADH1, 4CL, ANS, and F3′5′H were significantly downregulated, whereas those of ADH3, PDC, CHS, F3H, and LAR were significantly upregulated after the removal of astringency. These results suggest that the expression of genes involved in PA biosynthesis might be affected by ethanol treatment, which is consistent with the findings of Ikegami et al. [14].
Acetaldehyde produced from ethanol is involved in the direct insolubilisation of soluble PAs [57]. The synthesis of acetaldehyde is generally catalysed by PDC, which converts pyruvate to acetaldehyde; meanwhile, ADH is involved in the potentially reversible interconversion of ethanol and acetaldehyde [58]. Furthermore, ADH1 and PDC are suggested as the key genes involved in persimmon astringency removal [59]. In the present study, the PDC gene was upregulated, which resulted in the production of more acetaldehyde. However, the ADH1 gene was downregulated, which may result in the reduction of the conversion of acetaldehyde into ethanol. This result suggests that acetaldehyde accumulated in the ethanoltreated fruit, which resulted in the loss of astringency. However, this result was not consistent with that reported in Min's study [59], where ADH1 was upregulated by ethylene. This inconsistency in the results can be attributed to the different cultivars used (i.e. the non-astringent type 'Luotian-tianshi' was used in our study, while the astringent type 'Mopanshi' was used in Min's research). The use of different materials might cause different gene expression patterns; however, this hypothesis needs to be validated by further experimentation.
We found that the gene aldehyde dehydrogenase family 2 (ALDH2) was highly expressed (1409 reads) in the Co library and downregulated in the Tr library (545 reads), with a total of 11 unigenes in both libraries (Table 4). ALDH2 has a broad expression pattern and is most notably involved in the second step of ethanol metabolism, (i.e. acetaldehyde oxidation). The decrease in ALDH2 in the Tr library might have inhibited the conversion of acetaldehyde to acetic acid, which, consequently, led to acetaldehyde accumulation. Large amounts of acetaldehyde triggered the coagulation of tannins (insolubilisation of soluble PAs) causing the loss of astringency in the treated persimmon fruits. This result suggests that the ALDH2 gene, together with the ADH and PDC genes, might have important functions in tannin coagulation.
The present study on persimmon transcriptome has several biological implications. First, the plant material persimmon, which accumulates PAs (condensed tannins) in its flesh during development, can be considered a model plant for tannin research. Second, the loss of astringency in CPCNA fruits treated with ethanol is an imitation of the natural loss of astringency, especially for tannin coagulation. This imitation helped us to understand mechanism of astringency loss in CPCNA. Third, the current study, based on the present transcriptome data (even in the absence of complete genome sequences for persimmons), will facilitate the advancement future genetic studies.

Conclusions
This work presents the first de novo transcriptome sequencing analysis of the CPCNA persimmon fruit using the 454 GS FLX Titanium platform. A total of 374.6 Mb of data were generated and assembled into 83,898 unigenes. Persimmon unigenes related to PA biosynthesis were characterised, and differentially expressed unigenes in the two libraries were verified using qRT-PCR. ADH, PDC, and the newly discovered persimmon gene ALDH2 were found to have important functions in tannin coagulation. To the best of our knowledge, this study is the first to employ the 454 sequencing technology to investigate the whole transcriptome of the persimmon fruit. The assembly of the reads was also conducted without a reference genome. The transcriptome characterisation described in the present study will provide the initial information needed for the functional study of persimmons to elucidate the molecular mechanisms of tannin coagulation and other biochemical processes in this fruit tree.

Sample preparation
In previous analysis of tannin concentration per fruit, JPCNA and CPCNA varied considerably. Both types accumulate PA in their fruits at an early stage. PA accumulation is halted in JPCNA at 7-9 weeks after bloom (WAB), and a low concentration of PA is observed at 10 WAB. On the other hand, CPCNA continuously accumulates PA until the late stages and maintains a very high PA concentration [22,23]. Thus, it appears that at 9-10 WAB, JPCNA and CPCNA exhibit different PA accumulation patterns. In the present study, 30 young fruits on a CPCNA-type persimmon tree (D. kaki 'Luotian-tianshi' , 2n = 6X = 90) grown in the Persimmon Repository of Huazhong Agricultural University, China, were enclosed with polyethylene bags containing 10 mL of 5% ethanol on 12 July 2011 (10 WAB). Control (untreated) fruits were enclosed with polyethylene bags containing 10 mL of water. Three days later, all treated and control fruits were sampled and peeled. The flesh of the fruits was diced into small pieces, frozen in liquid nitrogen, and stored at −80°C until use for RNA isolation. The concentrations of soluble and insoluble tannins in the control and treated fruit flesh were measured by the Folin-Ciocalteu method after 3 d of treatment [60]. Soluble tannins of the fruit flesh were also examined after 3 d of treatment by the printing method [61], which is a convenient way of identifying persimmon astringency loss. FeCl 2 reacts with the soluble PAs; thus, the darker the resulting product, the more astringent the fruits.
RNA extraction, cDNA library construction, and 454 sequencing For each sample (5% ethanol treated and control), approximately 10 g of mixed flesh (10 individuals) was used for RNA preparation and tannin concentration measurements. Total RNA was extracted using TRIzol Reagent (Invitrogen, USA) following the manufacturer's protocol. The quality and quantity of the total RNA was analysed using the NanoDrop 2000 spectrophotometer (Thermo Scientific, USA) and gel electrophoresis.
Approximately 1 μg of RNA was used to generate double-stranded cDNA using the SMART TM cDNA Library Construction Kit (Clontech, USA). Finally,~5 mg of cDNA was used to construct a 454 library. Roche GS-FLX 454 pyrosequencing was conducted by the Oebiotech Company in Shanghai, China.

de novo transcriptome assembly and analysis
A Perl program was written to remove vector sequences and the PolyA (T) tail from sequences; reads with lengths <100 bp were removed before assembly. Then, high-quality reads were assembled with MIRA [62] to construct unique consensus sequences. The 454 setting parameters were used by MIRA (−−job = denovo,est,normal, 454; -SK:mnr = yes; -SK:rt = 2; 454_SETTINGS -LR: mxti = no; -CL:qc = no).
The functions of the unigenes were annotated by BLASTX with an E-value threshold of 10 −5 to the protein databases, including NCBI-NR, Swiss-Prot, KEGG [28], and COG [63]. InterPro domains [64] were annotated by InterProScan [65] Release 16.0, and functional assignments were mapped onto GO [66]. WEGO [67] was used for GO classification and GO tree construction.

Differentially expressed unigene detection
A freely available web tool IDEG6 [68] was used to identify unigenes showing statistically significant differences in relative abundance (as reflected by the total count of individual sequence reads) between the Tr and Co libraries. The general Chi-squared method was used because it was the most efficient analytical method [68]. Finally, unigenes with P ≤ 0.01 were deemed significantly different between the two libraries.
RNA extraction, first-strand cDNA synthesis, and qRT-PCR analysis A total of 50 unigenes generated by 454 sequencing were selected for experimental validation. The total RNA used for qRT-PCR analysis was extracted from the flesh of the Tr and Co fruits. After RNA extraction, first-strand cDNA was synthesised from 1.0 μg of RNA using the PrimeScript® RT Reagent Kit with gDNA Eraser (TaKaRa, Dalian, China) according to the manufacturer's protocol. The cDNA was diluted threefold and used as the template for qRT-PCR. qRT-PCR was performed on a LightCycler® 480 II System (Roche Diagnostics) using SYBR® Premix Ex Taq TM II (TaKaRa). The reaction was composed as described in the manual and was performed in quadruplicate. A negative control (no template) was included in each run. The standard amplification protocol consisted of an initial denaturing step of 95°C for 30 s, followed by 45 cycles of 95°C for 5 s, 60°C for 10 s, 72°C for 15 s, and a melting temperature cycle with constant fluorescence data acquisition from 65°C to 95°C. The gene quantification method was based on the relative expression of the target gene versus the reference gene (DkActin), and the ratio was calculated with the LightCycler® 480 software. All primers are listed in Additional file 3.

Additional files
Additional file 1: GO classifications of CPCNA persimmon unigenes.
Additional file 3: Information for primers used in qRT-PCR analysis. The primer sequences of 34 unigenes verified successfully by qRT-PCR were listed in this table.