Illumina paired-end sequencing and de novoassembly
In order to characterize the transcriptome of ramie and generate a broad survey of genes associated with ramie vegetative growth and development, total RNA was extracted from several vegetative tissues including leaf, root, stem bast, stem xylem and stem shoot at vegetative stages. Using Illumina paired-end sequencing technology, each sequencing feature can yield 2 × 90 bp independent reads from the either end of a DNA fragment. In this study, a total of 57,471,036 raw sequencing reads with the length of 90 bp were generated from a 200 bp insert library.
An assembler, Trinity developed specifically for use with next-generation short-read sequences [14], was employed for de novo assembly. After stringent quality checking and data cleaning, approximately 53 million high quality reads were obtained with 97.46% Q20 bases (base quality more than 20). Based on the high quality reads, 84,876 contigs were assembled with an average length of 393 bp ranging from 100 to 17,496 bp. Of these, 44.8% and 11.6% contigs had a length more than 200 bp and 1000 bp, respectively (Figure 1).
To obtain the unigenes, the paired-end reads were realigned to contigs and these contigs in one transcript were assembled by the Trinity and gained the sequence not being extended on the either end were defined as unigenes. Then the TGICL [15] was used to get rid of redundant unigene and to further assemble all the unigenes to form a single set of non-redundant unigenes. Finally the de novo assembly yielded 43,990 unigenes with an average length of 824 bp and a total length of 36.26 Mb. The length of assembled unigenes ranged from 200 to 17,496 bp. There were 11,349 unigenes (25.80%) with length no more than 300 bp, 9,966 unigenes (22.66%) with length varying from 301 to 500 bp, 10,081 unigenes (22.92%) in the length range of 501 to 1000 bp, and 12,564 unigenes (28.57%) with length more than 1000 bp (Figure 1). To evaluate the quality of the assembled unigenes, all the usable sequencing reads were realigned to the unigenes. The sequencing depth ranged from 0.1 to 132,686 folds, with an average of 62.98 folds. About 91.0% of the unigenes were realigned by more than 10 reads and 45.4% were remapped by more than 100 reads (Figure 2).
Sequence orientations of all unigenes were pridicted via ESTScan or BLAST (Basic Local Alignment Search Tool) with an E-value threshold of 10-5 in the NCBI database of non-redundant protein (Nr), along with the Swiss-Prot protein database, the Kyoto Encyclopedia of Genes and Genomes (KEGG) database and the Clusters of Orthologous Groups (COG) database. Finally, sequence orientation of 34251 (77.8%) unigenes was predicted and 9739 (22.2%) unigenes sequence orientation is still unknown.
Functional annotation by searching against public databases
For validation and annotation of the assembled unigenes, sequence similarity search was conducted in the Nr database, the COG database, the Swiss-Prot protein database, and the KEGG database [16, 17] with an E-value threshold of 10-5. The results indicated that out of 43,990 unigenes, 32,790 (74.5%), 21,206 (48.2%), 19,846 (45.1%) and 13,042 (29.6%) unigenes showed significant similarity to known proteins in Nr, SwissProt, KEGG and COG databases, respectively (Figure 3). Together, 34,192 (77.7%) unigenes showed similarity to known proteins in four databases mentioned above. The E-value distribution of the top hits in the Nr database revealed that 57.39% of the mapped sequences showed significant homology (less than 1.0E-45) (Figure 4A), and 71.56% and 27.63% of the sequences with similarities greater than 60% and 80%, respectively, were found (Figure 4B). Interestingly, 35.79% of the unigenes showed significant homology with sequence of Vitis vinifera and 19.61%, 15.86% and 12.42% of the mapped sequences have a significant similarity with the sequence of Ricinus communis, Populus trichocarpa and Glycine max, respectively (Figure 4C).
Functional classification by GO and COG
Gene Ontology (GO) is an international standardized gene functional classification system which offers a dynamic-updated controlled vocabulary and a strictly defined concept to comprehensively describe the properties of genes and their products in any organism. GO has three ontologies: Molecular function, Cellular component and Biological process. On the basis of Nr annotation, the Blast2GO program [18] was used to obtain GO annotation for the unigenes annotated by the Nr database. Then the WEGO software [19] was used to perform GO functional classification for these unigenes. In total, 16,050 unigenes (36.5%) with BLAST matches to known proteins were assigned to GO classes with 111,333 functional terms. Of these, assignments to the biological process made up the majority (45,848, 44.18%), followed by molecular function (44,282, 39.77%) and cellular component (21,203, 19.05%, Figure 5).
The Clusters of Orthologous Groups (COG) is a database where the orthologous gene products were classified. Every protein in the COG database is assumed to be evolved from an ancestor protein, and the whole database is built on coding proteins with complete genome as well as system evolution relationships of bacteria, algae and eukaryotes. All unigenes were aligned to the COG database to predict and classify potential functions. Totally, 13,042 genes (29.6%) were assigned to the 25 COG classifications (Figure 6). Some unigenes were assigned to several COG categories, which lead to a total of 30140 sequences assigned in 25 COG categories. Among the 25 COG categories, the cluster of General function prediction (4,606, 15.28%) represented the largest group, followed by Transcription (2,999, 9.95%), Replication, recombination and repair (2,481, 8.23%), Function unknown(2,061, 6.84%), Signal transduction mechanisms (2,038, 6.76%), Translation, ribosomal structure and biogenesis (2,023, 6.71%), Posttranslational modification, protein turnover, chaperones (1,913, 6.35%), Carbohydrate transport and metabolism (1,810, 6.01%) and Cell cycle control, cell division, chromosome partitioning (1,507, 5.00%), Amino acid transport and metabolism (1,158, 3.84%), whereas only a few unigenes were assigned to Nulcear structure and Extracellular structure (Figure 6).
Metabolic pathway analysis by KEGG
The Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway database records the networks of molecular interactions in the cells, and variants of them specific to particular organisms. Pathway-based analysis helps us to further understand the biological functions and interactions of genes. First, based on a comparison against the KEGG database using BLASTx with an E-value threshold of 10-5, 19,846 (45.1%) sequences of the 43,990 unigenes were found to have significant matches in the database and were assigned to 126 KEGG pathways. For some genes being assigned to several KEGG pathways, there were 21,603 sequence hit in 126 KEGG pathways together. Of the 126 KEGG pathways, there were 25 pathways with over 200 unigenes assigned. The pathway of RNA transport was assigned the most unigene (1,287, 5.96%), followed by plant-pathogen interaction (1,147, 5.31%), endocytosis (979, 4.53%), glycerophospholipid metabolism (947, 4.38%), mRNA surveillance pathway (942, 4.36%), plant hormone signal transduction (928, 4.30%), ether lipid metabolism (799, 3.70%), spliceosome(745, 3.45%), starch and sucrose metabolism(565, 2.62%), protein processing in endoplasmic reticulum (493, 2.28%), whereas only no more than 10 unigenes were assigned to C5-branched dibasic acid metabolism, biotin metabolism, caffeine metabolism and betalain biosynthesis (Additional file 1: Table S1).
Identification of the genes encoding cellulose synthase (CesA genes)
As an important natural fiber, ramie fiber development is a major research area with a focus on improving the fiber yield and quality. Cellulose is a major component of the ramie fiber, and its content in the fiber significantly influences the yield and quality of ramie fiber. Out of 43,990 genes de novo assembled, 34,192 unigenes were annotated their function based on the sequence similarity search against the public databases. By searching the annotation with the keyword “cellulose synthase” from these 34,192 genes, 51 CesA genes were identified. Among these 51 genes, 30 unigenes have a significant homology (less than 1.0E-50) with the CesA gene of other species; 22 genes encoding proteins have their orthologous protein with more than 80% sequence similarity; 21 ramie CesA genes are homologous with the CesA gene of Vitis vinifera (Additional file 2: Table S2). Additionally, the conserved domain of ramie 51 CesA genes encoding protein was searched from the conserved domain database (CCD) and the result showed that 23 CesA gene-encoded proteins possessed the conserved domain of cellulose synthase. Interestingly, there are 12 genes assigned into the KEGG pathway of starch and sucrose metabolism involved in cellulose biosynthesis, and 39 genes were not assigned to any pathway (Additional file 2: Table S2).
In order to identify the potential candidates of the CesA genes involved in bast fiber biosynthesis, the expression levels of all 51 CesA genes in the stem bark, stem xylem, stem shoot and leaves were analyzed by RT-qPCR assay. The result showed that there were 36 CesA genes with relatively high expression levels in the stem bark from which the fiber was extracted (Additional file 2: Table S2). Among the 36 genes expressed in the bark, 33 genes showed higher expression levels in the bark than in other tissues (Figure 7); the Unigene14589 displayed a relatively high expression level in the stem bark and xylem and CL1547.contig1 showed similarly high expression levels in the stem bark, shoot and leaves, respectively (Figure 7); while the Unigene 21178 showed a constitutive expression in stem bark, shoot, xylem and leaf (Additional file 2: Table S2).