Ultra-high-throughput mRNA sequencing technology is a fast, efficient, and cost-effective way to characterize the poly (A)+ transcriptome. It is especially suitable for gene expression profiling in non-model organisms that lack genomic sequences. To date, most sequencing efforts in C. sinensis were based on EST sequencing, with a limited number of tags reported in public databases. In this study, we applied RNA-seq technology for C. sinensis transcriptome profiling, in which the poly (A)+ transcriptome was sequenced on the Illumina GA IIx platform. We obtained 2.59 G bp coverage with 34.5 million high-quality reads. We generated a total of 127,094 unigenes (≥100 bp) by de novo assembly. Among them, 55,088 assembled unigenes were annotated. Our coverage is approximately 10-fold more than all C. sinensis sequences deposited in GenBank combined (as of August 2010).
Since C. sinensis is self-incompatible and recalcitrant to genetic manipulations, little genetic or genomic information is currently available. Therefore, instead of a comprehensive in-depth investigation of the tea transcriptome, our experiment was designed to generate a quick landscape view. A number of strategies were adopted to obtain sufficient coverage of expressed transcripts, to improve the accuracy of de novo assembly, and to increase the effectiveness of the gene annotations. First, experimental materials for RNA preparation came from seven organs of the tea plant, which were selected to acquire as comprehensive coverage as possible. Second, an Illumina library was constructed based on the fragmenting RNA method, which has been shown both to reduce the amount of RNA secondary structure and 5' bias and to have better overall uniformity . Third, a paired-end library sequencing strategy was applied not only to increase the sequencing depth, but also to improve the efficiency of de novo assembly. Finally, all six public databases were selected for gene annotation comparisons in order to acquire complete functional information.
As a result, 55,088 unigenes (43.3% of all assembled unigenes) returned significant hits from BLAST comparisons with the six public databases. These unigenes were assigned not only gene or protein name descriptions, but also putative conserved domains, gene ontology terms and metabolic pathways. Detailed functional information is important to understand overall expression profiles of C. sinensis. In particular, the number of unigenes that hit all six public databases summed up to 9,139. Because these genes had relatively unambiguous annotation, they were selected for tea-specific pathway analyses. The remaining 72,006 unigenes (56.7% of all assembled unigenes) did not generate significant homology to existing genes. The absence of homology could be caused by several factors. Obviously, a large proportion (82.1%) of unigenes was shorter than 500 bp, some of which were too short to allow statistically meaningful matches. However, for some unigenes, the absence of homologous sequences in the public databases may indicate specific roles for them in C. sinensis. We are currently cataloging the longer unigenes (≥500 bp; 22,757 unigenes) in tea plants.
The annotated unigenes were used to study primary and secondary metabolic pathways. For the four primary metabolic pathways investigated, all essential structural genes were found (Additional File 3). The putative pathway genes from tea were highly similar to those from the model dicot plant Arabidopsis or other dicot plants. We also analyzed six families of house-keeping genes to evaluate the completeness of our transcriptome coverage. More than 70.5% of these high copy number genes had full-length ORFs. We believe future large-scale sequencing efforts on tea genome and transcriptome will increase the coverage of our dataset even further.
The quality of tea in large part depends on its metabolic profiles. We focused on flavonoids, theanine and caffeine biosynthesis for additional analyses. We were able to find almost all metabolic genes from these pathways (Figure 7 and Additional File 4). Many of these genes appeared to form multi-gene families. It implies that the tea genome, like many other higher plants, went through one or more round of genome duplications during evolution. C. sinensis has a diploid genome, thus, extensive genome re-arrangement might have occurred. We are interested in using SNPs analysis to better understand the genome structure when more RNA-seq data has been obtained.
A few genes are currently missing in these pathways, which might be due to their low expression, insufficient sampling, or ineffective annotations. Some of these genes, such as guanosine deaminase and N-methylnucleosidase, have not been reported in plants before. On the other hand, we found some genes that might play important roles in the above mentioned pathways. For example, many glycosylation enzymes and cytochrome P450 genes were discovered in the transcriptome, which might contribute to the extensive modifications of various secondary compounds found in tea leaf extracts.
By comparing our transcriptome data with four previously prepared cDNA libraries analyzed by EST sequencing, we showed that the number of unigenes from RNA-seq was approximate 20 times more than the existing cDNA libraries. Yet, a small number of genes discovered in the cDNA libraries did not generate BLAST hits in the Illumina transcriptome, which could be resolved by increasing the sequencing depth, enhancing the accuracy of the assembly, and perfecting gene annotation strategies. We have selected two sets of structural enzymes to validate our gene annotations. Each one of them generated the expected band size by RT-PCR, and qRT-PCR analysis showed consistent expression patterns. We are confident that our transcriptome dataset is a valuable addition to the publicly available tea genomic information.