De novo analysis of transcriptome reveals genes associated with leaf abscission in sugarcane (Saccharum officinarum L.)

Background Sugarcane (Saccharum officinarum L.) is an important sugar crop which belongs to the grass family and can be used for fuel ethanol production. The growing demands for sugar and biofuel is asking for breeding a sugarcane variety that can shed their leaves during the maturity time due to the increasing cost on sugarcane harvest. Results To determine leaf abscission related genes in sugarcane, we generated 524,328,950 paired reads with RNA-Seq and profiled the transcriptome of new born leaves of leaf abscission sugarcane varieties (Q1 and T) and leaf packaging sugarcane varieties (Q2 and B). Initially, 275,018 transcripts were assembled with N50 of 1,177 bp. Next, the transcriptome was annotated by mapping them to NR, UniProtKB/Swiss-Prot, Gene Ontology and KEGG pathway databases. Further, we used TransDecoder and Trinotate to obtain the likely proteins and annotate them in terms of known proteins, protein domains, signal peptides, transmembrane regions and rRNA transcripts. Different expression analysis showed 1,202 transcripts were up regulated in leaf abscission sugarcane varieties, relatively to the leaf packaging sugarcane varieties. Functional analysis told us 62, 38 and 10 upregulated transcripts were involved in plant-pathogen interaction, response to stress and abscisic acid associated pathways, respectively. The upregulation of transcripts encoding 4 disease resistance proteins (RPM1, RPP13, RGA2, and RGA4), 6 ABC transporter G family members and 16 transcription factors including WRK33 and heat stress transcription factors indicate they may be used as candidate genes for sugarcane breeding. The expression levels of transcripts were validated by qRT-PCR. In addition, we characterized 3,722 SNPs between leaf abscission and leaf packaging sugarcane plants. Conclusion Our results showed leaf abscission associated genes in sugarcane during the maturity period. The output of this study provides a valuable resource for future genetic and genomic studies in sugarcane. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2552-2) contains supplementary material, which is available to authorized users.


Background
Sugarcane (Saccharum officinarum L.) is an important sugar crop, which is widely grown in the tropical and subtropical areas [1]. The total sugarcane acreage all over the world is more than 200 million hectares, yielding over 13 billion tons of sugar per year [2]. Apart of this, sugarcane is an ideal non-food biomass crop which can produce two kinds of biofuel: ethanol and high biomass raw sugar [3]. In China, sugarcane is mainly distributed in the provinces of Guangxi, Yunnan, Guangdong and Hainan, of them Guangxi province produces more than 60 % of the total sugar per year. The sugarcane harvest mainly relies on hand-cut in several countries like China, large-scale mechanized harvest is less than 15 % of the total amount, and the cost of sugarcane is increasing. Hence, it is important to study the mechanism of leaf abscission in sugarcane and breed sugarcane varieties which can shed their leaves during the maturity time.
Abscission is the programmed developmental process by which some of the organs such as leaves, flowers, or fruits are shed during the life of a plant [4]. It occurs within a specific tissue, called abscission zone (AZ), which is formed at the base of the petiole [5]. The abscission process can be divided into four major steps [6]: development of the AZ tissue (step 1), acquisition of competence to respond to abscission-promoting signaling (step 2), activation of abscission (step 3), and post abscission trans-differentiation (step 4). Steps 2 and 3 have been extensively investigated, by transcriptome analyses, in the gene expression changes during the shedding of AZs tissues, such as leaves, flowers and fruits [7][8][9][10][11][12]. It is interesting that different plant species and different organs share a majority of genes involved in steps 2 and 3, including genes involved in ethylene and auxin biosynthesis and signal transduction, cell wall modification and various stress responses [12].
Although some genes have been found to regulate the development of AZs tissues (step 1), they are different across species and organs. In Arabidopsis, the formation of seed AZs is regulated by the MADS-box transcription factor (TF) gene SEEDSTICK (STK) and the bHLH transcription factor gene HECATE3 (HEC3) [13,14], but the formation of Arabidopsis floral organ AZs is regulated by BLADE-ON-PETIOLE1 (BOP1) and BOP2 which encode BTB/POZ domain and Ankyrin repeat containing NPR1-like proteins [5]. In tomato, MACROCALYX and JOINTLESS containing a MADS-box controls the fruit and flower AZ development [12,15]. In rice, several genes and their products have been reported to regulate the pedicel AZ formation for seed shattering, including qSH1 (a major chromosome 1 quantitative trait locus for seed shattering, encoding a BELL-type homeobox TF) [16], SH4 (a major chromosome 4 seed shattering quantitative trait locus, encoding a MYB3 DNA-binding domain containing protein) [17], HATTERING ABORTION1 (SHAT1, encoding an AP2 family TF) [18], the rice SHAT-TERING1 homologue (OsSH1, encoding a YAB family TF) [19], and CTD phosphatase-like protein1 (OsCPL1) [20]. Overall, AZ development associated genes vary from one to another plant species. In sugarcane, genes controlling leaf AZ development and leaf abscission are still unknown.
RNA-Seq, a next generation sequencing technology, is used for profiling gene expression and plant breeding programs in many plants including rice [21], maize [22] and millet [23]. The genome sequence of sugarcane is not available currently, so transcriptome studies in sugarcane have been proposed, in progress or accomplished in different countries like South Africa [24,25], Australia2 [26,27] and USA [28]. Here, we performed an RNA-Seq study to profile the gene expression of new born leaf tissues using the HiSeq 2000 platform. Differential expression analysis revealed 1,202 transcripts upregulated in leaf abscission sugarcane plants (LASP) which can shed their leaves during the maturity time, compared to the leaf packaging sugarcane plants (LPSP) which are packed by the leaves during the maturity time. Functional analysis showed the upregulated transcripts in LASP were enriched in "plant-pathogen interaction", "response to stress" and abscisic acid (ABA) associated pathways. Due to their up regulation, we assumed these transcripts may involve in the processes of AZ development and leaf abscission in sugarcane. This is the first time to study the genes associated with AZ development and leaf abscission in sugarcane. Our results will provide a valuable resource for understanding the mechanism of leaf abscission in sugarcane and will contribute to the researchers in the field of sugarcane breeding.

Results and discussion
Deep sequencing and de novo assembly In order to understand the mechanism of leaf abscission in sugarcane, six transcriptome libraries for two parent sugarcane varieties (Q1 and Q2), two F1 generation sugarcane varieties (T1 and T2), which can shed their leaves during the maturity time, and another two F1 generation sugarcanes (B1 and B2), which are packed by the leaves during the maturity, were constructed and sequenced. As shown in Table 1, by using the HiSeq 2000 platform we generated a total of 524,328,950 paired raw reads for six libraries. The Q20 values of six libraries were from 97.58 % to 97.96 %. After removing low quality reads and reads with adaptors, we obtained~76.9 M,~83.2 M,~81.4 M, 77.2 M,~78.6 M and~80.9 M clean reads for B1, B2, Q1, Q2, T1 and T2, respectively. As the estimated genome sizes of S. officinarum accessions ranged from 7.50 to 8.55 Gb with an average size of 7.88 Gb [29], our data equivalent to~10-fold coverage of the sugarcane genome sequences. The clean reads (478.1 M) were then used for de novo assembly analysis by Trinity software [30], resulting a total of 275,018 unique transcripts corresponding to 164,803 genes. The GC percentage of the assembled transcripts was 48.23 %, the N50 statistic was 1,177 which represented at least 50 % of the sum of the lengths of all contigs include contigs that were at least 1,177 bp long, and the distribution of contig length can be seen in Fig. 1. Approximately 40 % of the assembled contigs were 2004 00 bp in length, however, we obtained 67,238 contigs longer than 1,000 bp. The longest contig was 15,553 bp in length and a total of 6,234 contigs were longer than 3,000 bp. The number of total assembled bases was 215,019,578, which meant about 215 M size of mRNA sequences were characterized in this study.

Functional annotation of assembled transcripts
To understand the features and functions of the assembled transcripts, we annotated the assembled transcripts by mapping them to several public databases, like NCBI non-redundant (NR), UniProtKB/Swiss-Prot, GO and KEGG pathway. The numbers of transcripts aligned to each database can be seen in Fig. 2a Fig. 2b, Additional file 1). Due to the unavailability of sugarcane genome and gene sequences in public databases, we found only 550 transcripts were mapping to Saccharum species. GO analysis is an international standardized gene function classification system that provides a controlled vocabulary to facilitate high-quality functional gene annotation [31,32]. GO term distribution (Fig. 2c) of the assembled transcripts showed  Table 1 Overview of transcriptome sequencing and de novo assembl results   B1  B2  Q1  Q2  T1  T2   Raw reads  83,879,424  92,222,556  89,370,426  89,370,426  85,447, the "cellular process" and "metabolic process" were two most abundantly represented with 24,907 (53.32 %) and 27,955 (59.85 %) transcripts, respectively. In the "cellular components" ontology transcripts were mainly distributed in "cell" (36,954, 79.12 %), "cell part" (36,954, 79.12 %) and "organelle" (30,693, 65.71 %). GO analysis also showed 33,849 (72.47 %) transcripts had "binding" function and 30,060 (64.36 %) with "catalytic activity" function in "molecular function" ontology. Detailed information can be found in Additional file 2. In addition, some transcripts (59.83 %) showed no similarity to any known protein database, there were probabilities that they were putative long noncoding RNAs or novel genes in sugarcane [33]. More experiments are required to characterize them [34,35]. Then, we extracted likely coding sequences from the assembled transcripts using TransDecoder. In total, 100,813 likely protein sequences, 44,041 5'-prime-UTRs and 57,521 3'-prime-UTRs (Fig. 2a) were obtained. By using the Trinotate pipeline likely protein sequences were annotated in terms of known proteins, protein domains, signal peptides, transmembrane regions and rRNA transcripts (Fig. 2a) Fig. 2d. Likely protein sequences of sugarcane leaves were annotated with "threonine protein kinase" (4,133 transcripts) which had more than three times of the second "leucine rich repeat" (1264 transcripts). The threonine protein kinase has been reported to play a key role in the regulation of cell proliferation, cell differentiation and cell death [39,40]. These annotations for the assembled transcripts including gene and protein description, putative conserved domains and potential biological pathways provided a valuable

Transcriptome profile
We next aligned the clean reads of all six samples to the assembled transcripts using Bowtie2 [41] and estimated the abundance of each transcript using RSEM (RNA-Seq by Expectation-Maximization) [42]. According to the phenotypes of F1 generation sugarcane varieties, B1 and B2 were considered as B, T1 and T2 samples were considered as T in this study. As shown in Fig. 3a, we detected a total of 198,816, 189,635, 229,117 and 227,900 transcripts in Q1, Q2, T and B, respectively. There were 143,021 transcripts common to all the samples. It is notable that 11,988, 3955, 10,035 and 8400 transcripts were exclusively detected in Q1, Q2, T and B, respectively. KEGG pathway analysis showed specifically expressed transcripts were enriched in different biological pathways. For example, transcripts unique to B were mainly enriched in "ribosome" (ko03011, 34 transcripts), "polycyclic aromatic hydrocarbon degradation" (ko00624, 12 transcripts), and "stilbenoid, diarylheptanoid and gingerol biosynthesis" (ko00945, 15 transcripts). Transcripts exclusively detected in T might involve in "plant-pathogen interaction" (ko04626, 94 transcripts), "NOD-like receptor signaling pathway" (ko04621, 12 transcripts), and "antigen processing and presentation" (ko04612, 16 transcripts). Different KEGG pathways for transcripts detected only in B and T indicated they may have differences on their genetic information and phenotypes.

Differentially expressed transcripts
We next used edgeR [43] to identify differentially expressed transcripts in LASP (Q1 and T), compared to LPSP (Q2 and B). As described in Material and Methods, we filtered the transcripts by their fold changes (>2) and p-values (<0.05). By using this critical, differentially expressed transcripts were shown in red in the volcano plots of Fig. 3b and c. In total, we detected 32,917 transcripts upregulated and 25,764 transcripts down regulated in Q1 sugarcane in comparison of Q2 sugarcane (Fig. 3b), and 6,309 upregulated and 5,757 down regulated transcripts in T sugarcane relatively to B sugarcane (Fig. 3c). The numbers of up and down regulated transcripts in different ranges can be found in Fig. 3d. Notably, 1,054 and 44 transcripts were upregulated very significantly (log2FC >10) in Q1 and T in comparison of Q1 and B, respectively. In addition, we found Q1 and T shared a total of 1,202 upregulated and 953 down regulated transcripts (Additional file 3). GO analysis of the commonly upregulated transcripts (Table 2) showed 38, 35 and 15 transcripts were enriched in "response to stress", "transition metal ion binding" and "cellular protein modification process", respectively. KEGG pathway analysis (Table 3) showed commonly upregulated transcripts were enriched in the pathways of "plant-pathogen interaction", "one carbon pool by folate" and "diterpenoid biosynthesis".

Transcripts associated with leaf abscission
The leaf abscission mechanism in sugarcane is still unknown, however, genes associated with hormonal regulation, stress and diseases has been studied to regulate the process of leaf abscission in other plants [46][47][48][49]. In this study, we analyzed the commonly upregulated transcripts in LASP in comparison of LPSP and found genes related to leaf abscission in sugarcane, which involved in plant-pathogen interaction, responses to stress and ABA-associated pathways.

Plant-pathogen interactions
First, we analyzed the transcripts of "plant-pathogen interaction" because it was the most significant KEGG pathway of commonly upregulated transcripts. Unlike in animals, pathogen interactions often trigger programmed cell death in plants [50][51][52]. Among the commonly upregulated transcripts, 62 were annotated to regulate the pathway of plant-pathogen interaction. Table 4 showed their log 2 values of fold changes, ranging from 1.45 to 11.45 in Q1/Q2 and 1.35 to 11.12 in T/B. Of these 62 transcripts, 36 (58.06 %) transcripts were annotated to encode disease resistance proteins, including 7 RPM1 (resistance to Pseudomonas syringae pv. Maculicola 1), transcripts, 7 RPP13     [53,54]. In Arabidopsis, RPM1 acts through its interaction with RIN4 (RPM1interacting protein 4), an essential regulator of plant defense, and triggers plant disease resistance when RIN4 is phosphorylated by AvrRpm1 [55,56]. After infection of tomato yellow leaf curl virus, RPP13 is upregulated in a resistant tomato line [57]. In addition, a WRKY transcription factor called WRK33 was identified to regulate the plant-pathogen interaction. WRK33 can interact with the W box (5'-TTGAC[CT]-3' , an elicitor-responsive cisacting element) [58] and function in ABA signaling [59].
In Arabidopsis, WRK33 is involved in regulation of the antagonistic relationship between defense pathways mediating responses to the bacterial pathogen P. syringae and the necrotrophic pathogen B. cinerea promoters [60]. The up regulation of transcripts encoding plant disease resistance proteins suggested that the defense system of sugarcane was activated by some reason and might contribute on shedding sugarcane leaves during the maturity time.

Stress responses
Several stresses can lead cell death and leaf abscission in plants, like drought [61,62] and salt [63] stresses. Hence, we analyzed the transcripts involved in the response to stresses. Compared to LPSP, 38 transcripts were upregulated in LASP and annotated to respond the stresses (Table 5). Notably, 18 (47.37 %) transcripts of them had the ability of encoding disease resistance proteins, including 4 RGA2 (resistance gene analog 2) transcripts, 3 At1g50180 homolog transcripts and 2 RGA4 (resistance gene analog 4) transcripts. In addition, we identified 6 upregulated transcripts encoding the hypothetical protein (SORBI-DRAFT_08g002290) in response to stresses. Although the function of SORBIDRAFT_08g002290 is not clear, several important motifs like leucine-rich repeats, AAA ATPase and NB-ARC domains are found in SORBI-DRAFT_08g002290 [64]. It is said that proteins containing the short motif of leucine-rich repeats can regulate the interactions between proteins [65] and NB-ARC domain has been found in many resistance proteins [66]. These evidence told us that transcripts encoding proteins in response of stresses can act in the pathway of plantpathogen interactions, and their over-expression indicated some relationship with leaf abscission in sugarcane.

ABA-associated transcripts
In the abscission process, ABA plays a critical role and involves in different pathways [67]. In citrus, leaf  abscission induced by rehydration after a period of water stress requires ABA accumulation [62]. In our study, we found different upregulated transcripts involved in ABA signaling pathways in the two sugarcane generations. Compared to Q2 and B, a total of 341 and 26 upregulated ABA-associated transcripts were identified Q1 (Additional file 6) and T (Table 6), respectively. Q1 and T shared 10 upregulated transcripts involved in ABA signaling pathways. GO annotation showed commonly upregulated transcripts were involved in ABA Dglucopyranosyl ester transmembrane transport, ABA catabolic process, ABA-activated signaling pathway, response to ABA and ABA binding. Compared to T sugarcane, Q1 had more upregulated ABA-associated transcripts, like ABC (ATP-binding cassette) transporter G family members (5, 11, 15, 25, 36, and 40), ABA receptor (PYL2 and PYL8) and zeaxanthin epoxidase (ZEP). In Arabidopsis, ABA transport can be mediated by PDR-type ABC transporter, ABG25 and AB40G [68,69]. ZEP also plays an important role in the xanthophyll cycle and ABA biosynthesis, can convert zeaxanthin into antheraxanthin and subsequently violaxanthin, and is required for resistance to osmotic and drought stresses, seed development and dormancy [70].

Transcription factors
Further, 2,031 of the assembled transcripts were annotated to encode TFs in this study. There were 603 and 58 transcripts were upregulated in Q1 and T relatively to Q2 and B, respectively. Of the upregulated transcripts encoding TFs, 16 were in common, including two transcripts encoding MADS-box transcription factors, six transcripts encoding heat stress transcription factors, two transcripts encoding transcription factor DIVARI-CATA, one WRKY33 transcript, two transcripts encoding NAC transcription factor 29, and three transcripts encoding ethylene-responsive transcription factor. Although the function of these TFs in sugarcane is not clear, they have been reported to regulate the leaf senescence and abscission in Arabidopsis [71][72][73].
Overall, pathways of plant-pathogen interaction, response to stresses and ABA signaling are important pathways in plants and involve in several bio-activities including cell development and death [50,74]. Although it is hard to tell which pathways are involved in the regulation of leaf abscission in sugarcane, the up regulation of transcripts involved in these pathways strongly supports that they may play an important role in the AZ development and leaf abscission in sugarcane. Q1 and T upregulated transcripts were enriched in different pathways, we can infer that they may promote the shedding of their leaves in different ways during the maturity time. Transcripts identified here showed their potential availability which can be used in sugarcane breeding programs.

qRT-PCR validation
Differentially expressed transcripts between LASP and LPSP identified by RNA-Seq were confirmed by quantitative real-time PCR (qRT-PCR) experiment. A total of 20 transcripts were randomly selected as candidates. We used RT 2 First Strand Kit (QIAGEN) and PCR mix (QIAGEN) to perform the cDNA synthesis and realtime PCR experiment. Forward and reverse primers for these 20 candidate transcripts were designed by Primer Express Software (v2.0, ABI) and actin gene was used as the reference gene. The information of primers used in this study can be found in Additional file 7. As shown in Fig. 4, we used relatively normalized expression (RNE) and log 2 fold change (log2FC) to show the expression  Fig. 4 qRT-PCR validation for candidate transcripts changes of candidate transcripts in Q1 vs. Q2 and T vs. B identified by qRT-PCR and RNA-Seq, respectively. The dysregulation of 19 transcripts (90.48 %) were consistent between qRT-PCR and RNA-Seq. Especially for those transcripts absent in several samples, they were not detected by both qRT-PCR and RNA-Seq. The Pearson correlations showed high relevance between RNA-Seq and qRT-PCR (0.911 for Q1/Q2 and 0.971 for T/B).

SNP discovery
Specificity and high abundance of genes support we can call single nucleotide polymorphisms (SNPs) using RNA-Seq [75]. RNA-Seq has been used for SNP discovery in non-model plants such as Eucalyptus grandis [76], Brassica napus [77], and Medicago sativa [78]. In this study, we used samtools [79] and bcftools to call SNPs, and found a total of 1,544,787 SNPs in the transcriptome alignment files for six libraries. The numbers of different SNP types were shown in Fig. 5. Because multiple SNP types could happen at one site, the total number of SNP types is a little greater than the total SNP site number (1,544,787). A total of 285,080 SNPs (18.45 %) were annotated to occur in the potential protein coding regions. Four SNP types (A-> G, C-> T, G-> A and T-> C) were significant because they took 61.50 % of the total SNP types in this study. Finally, we identified 3,722 SNPs, which were different between LASP and LPSP but same in LASP or LPSP. As shown in Additional file 8, these 3,722 SNPs were divided into 1,134 nonsynonymous SNPs and 2,588 synonymous SNPs. KEGG pathway analysis showed the transcripts containing nonsynonymous SNPs were enriched in the pathways of cytosolic DNA-sensing pathway, ribosome biogenesis in eukaryotes and inositol phosphate metabolism. Due to limit sugarcane gene annotation, SNP results seemed not relevant to our aim, which is simply to identify leaf abscission associated genes in sugarcane, more experiments and data are required to investigate the functions of these SNPs in leaf abscission.

Conclusions
In current study we employed next generation sequencing method RNA-Seq to analyze the transcriptome of sugarcane leaves and characterize candidate genes related to the leaf abscission in sugarcane during the maturity time. A total of 215,019,578 transcripts were initially assembled with N50 of 1,177 bp in length. We annotated them by mapping them to several known databases such as NR, UniProtKB/Swiss-Prot, GO and KEGG pathway databases. Further annotations including signal protein, rRNAs, protein domains and membrane proteins were performed by Transdecoder and Trinotate pipelines. Based on the assembled transcriptome, we identified several transcripts differentially expressed between LASP and LPSP, which were annotated to involve in the plant-pathogen interaction, stress response and ABA-associated pathways. qRT-PCR was used to validate the expression levels of 20 randomly selected transcripts. The results showed high consistency between qRT-PCR and RNA-Seq methods. Furthermore, a total of 1,544,787 SNPs were identified successfully, of them 1,134 were nonsynonymous SNPs and 2,588 were synonymous SNPs. The transcriptome produced by this study will provide a valuable resource of molecular information for future investigations and process understanding the roles of genes in the leaf abscission of sugarcane. Our study may also help develop sugarcane varieties which may shed their leaves during the maturity time and benefit the sugarcane farmers.

Plant material
The sugarcane plants were grown and maintained in the experimental filed of Sugarcane Research Institute in Nanning, Guangxi Province of China. ROC-26 (from Taiwan) is a precocious and productive sugarcane variety, but the sugar cane is wrapped very tight at the physiological maturity [80]. GT96-167 is a late-maturing and high-yield sugarcane variety which is bred by Guangxi Sugarcane Research Institute [81][82][83].

RNA extraction
Total RNA was isolated from sugarcane leaves by using TRIzol® reagent (Invitrogen) according to the manufacture's protocol. Briefly, 1 ml of TRIzol® reagent was added into 100 mg of leaf sample, the sample was homogenized using power homogenizer and centrifuged at 12,000 × g for 10 min at 4°C. After the fatty layer was removed and discarded, the cleared supernatant was transferred into a new tube and mixed with 0.2 ml of chloroform. The tube was shaken vigorously for 15 s, followed by an incubation for 3 min at room temperature. Next, the sample was centrifuged at 12,000 × g for 15 min at 4°C and the aqueous phase was moved into a new tube for RNA precipitation. For precipitating RNA from each sample, 10 μg of RNasefree glycogen was added into the aqueous phase as a carrier, followed by 0.5 ml of 100 % isopropanol, then, samples were incubated at room temperature for 10 min and centrifuged at 12,000 × g for 10 min at 4°C. To wash the RNA pellet, we added 1 ml of 75 % ethanol into the tube, vortexed the tube gently, centrifuged the tube 7,500 × g for 5 min at 4°C and discarded the wash. The RNA pellet was air-dried, suspended in RNase-free water, water bathed at 60°C for 10 min, quality controlled by Agilent 2100 Bioanalyzer and used for cDNA library construction and deep sequencing.
cDNA library construction and transcriptome sequencing De novo assembly of the transcriptome Raw reads of six libraries were cleaned by removing adapter sequences and low-quality sequences (reads with ambiguous bases 'N' and reads with more than 10 % Q < 20 bases). The resulting high quality reads of each library were quality controlled by using the program FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) before assembly. De novo assembly of the transcriptome was performed by Trinity software (release 2014-07-17) [30], according to the protocol [84]. Initially, highly quality RNA-Seq reads were used to generate overlapping k-mers (25). Based on the (k-1)-mer overlaps, Inchworm was used to assemble sorted k-mers into transcript contigs. Next, Chrysalis was used to cluster related Inchworm contigs into components by using grouped raw reads and paired read links. Then, a de Bruijn graph for each cluster was built by Chrysalis and reads were partitioned among the clusters. Finally, Butterfly was used to process the individual graphs and ultimately report the full length transcripts. To ensure a uniform transcriptome reference across samples, all clean reads were pooled together for assembly, then clean reads of six samples were individually aligned back to the assembled transcriptome reference.

Extract likely coding sequences from Trinity transcripts
We used TransDecoder, which is included in the Trinity software distribution, to extract potential coding sequences from assembled transcripts. Briefly, the longest open reading frame (ORF) was first identified within the assembled transcripts. Using a Markov model based on hexamers, a subset corresponding to the very longest transcripts were identified from all the longest ORFs. Then, all the longest ORFs identified were scored according to the Markov Model (log likelihood ratio based on coding/noncoding potential) in each of the six possible reading frames. For a particular transcript, the ORF was reported when its proper coding frame score calculated by GeneID software [85] was positive and highest of the other presumed wrong reading frames. A high scored putative ORF was excluded when it was fully encapsulated by the coordinates of another candidate ORF. The identified ORFs were set to encode a protein at least 100 amino acids.

Functional annotation of assembled transcripts using Trinotate
After the ORFs were extracted from the assembled transcripts, the deduced proteins were annotated using  [86] using HMMER [87]. Potential signal peptides, transmembrane domains and rRNA transcripts were predicted by SignalP [36], TMHMM Sever v2.0 [37] and RNAMMER [88], respectively. Then, the likely protein sequences were used to search against EggNOG database (v4.1, http://eggnogdb.embl.de) [38] which enables to identify the proteins distributed in EuKaryotic Orthologous Groups (KOG), Clusters of Orthologous Groups (COGs), and non-supervised orthologous groups (NOGs). Finally, above annotations were loaded into a Trinotate SQLite database and a final annotation report was generated. The maximum e-values for reporting best hit and associated annotation were no more than 1e-5.

GO and KEGG pathway annotation
To identify the assembled transcripts related with Gene Ontology (GO) and biology pathways, they were annotated by comparing to previously annotated genes in three public databases, NCBI non-redundant (NR) database, Swiss-Prot database and Kyoto Encyclopedia of Genes and Genomes (KEGG) database. We used BLAST software [89] to map assembled transcripts against NR database and filtered the mapping hits using a cut-off of e-value (1 × 10 −5 ). The resulting transcripts were then processed to retrieve associated Gene Ontology (GO) items describing biological processes (BPs), cellular components (CCs) and molecular functions (MFs) by BLAS-T2GO software [90]. By using unique gene accession numbers, BLAST2GO also produces corresponding enzyme commission (EC) numbers for sequences with an e-value ≤1e-5. Then, the transcripts with corresponding ECs were obtained and mapped to KEGG metabolic pathway database.

Transcriptome profile and different expression
The abundance of each transcript was evaluated by Bowtie2 and RSEM (RNA-Seq by Expectation-Maximization) tools in every sample. First, high quality reads of each library were mapped to Trinity assembled transcripts by Bowtie2 [41]. Then, an R package RSEM was used to evaluate the expression levels of each transcript in every library by estimating the abundance of reads that aligned to the transcript. Differential expression of transcripts across samples was identified by using an R package called edgeR [43]. edgeR can proceed differential expression of a transcript in two groups as we performed biological replicates for T and B sugarcane varieties in this study. The significance of differential expression was evaluated by the fold change (≥2) and p-value (<0.05).

SNP discovery
As the phenotypes of sugarcanes used in this study were different, we used samtools (v1.2, https://github.com/samtools/samtools/releases/download/1.2/samtools-1.2.tar.bz2) [79] and bcftools (v1.2, https://github.com/ samtools/bcftools/releases/download/1.2/bcftools-1.2.tar. bz2) to find possible single nucleic polymorphisms (SNPs). In brief, clean reads of each sample were aligned against the assembled transcriptome reference by Bowtie2, generating BAM formatted files. The BAM files were then indexed and processed by the mplieup function of samtools to produce a BCF file that contains all the locations in the genome. The BCF file was then used to call genotypes and reduce the list of sites to those found to be variant by passing this file into bcftools call. Finally, after filtering low quality SNPs, reliable SNPs were left and stated by a self-developed Perl script.

GO and KEGG pathway enrichment analysis
Functional analysis was performed using the Gene Ontology and KEGG pathway annotations for the transcripts. To find enriched GO items and KEGG pathways, we used p-value (Fisher's exact test) and q-value [91] to show the significance of enrichment and control the false discovery rate. Significant GO items and KEGG pathways should satisfy the critical of p-value < 0.05 and q-value < 0.9. Detected pathways only related to animal or human GO items and KEGG pathways were filtered.

Quantitative real-time PCR
Quantitative real-time PCR (qRT-PCR) experiment was used to validate the expression patterns of RNA transcripts in different sugarcane varieties. Total RNA (4 μg) which was used for RNA-Seq previously described was used for cDNA synthesis by using RT 2 First Strand Kit (QIAGEN) and qRT-PCR was performed by using PCR mix (QIAGEN), according to the manufactures' protocols. Briefly, genomic DNA elimination mix (10 μl) was first made by using total RNA (4 μg), Buffer GE and RNase-free water. After incubated at 42°C for 5 min and on ice for 5 min, the genomic DNA elimination mix was mixed with 5x Buffer BC3 (4 μl), control P2 (1 μl), RE3 Reverse Transcriptase Mix (2 μl) and RNase-free water (3 μl).
The final reverse transcription mix (20 μl) was incubated at 42°C for exactly 15 min and at 95°C for 5 min to finish cDNA synthesis. Primers for qRT-PCR were designed by Primer Express Software (v2.0, ABI) and synthesized by BGI (Additional file 7). A total of 16 μl reaction mix was made by 2x PCR mix (8 μl), forward primer (0.2 μl, 50pM/ul), reverse primer (0.2 μl, 50pM/ul), cDNA template (1 μl) and RNase-free water (6.6 μl). The final cDNA concentration of each reaction was 12.5 ng/μl. Actin was used as control and three reactions were performed for each transcript in every sample. The PCR reaction was performed and analyzed by using ABI ViiA 7 Real Time PCR System. After the threshold cycle (C T ) numbers of each transcript in every samples were evaluated, mean C T values were calculated for subsequent analysis. Base on the mean C T value, ΔC T was used to present and normalize the expression of a candidate transcript. Relatively normalized expression (RNE, −ΔΔC T method) was used to show the expression change of a transcript in two samples. C T values greater than 35 were set to 35.