- Research article
- Open Access
De novo assembled expressed gene catalog of a fast-growing Eucalyptus tree produced by Illumina mRNA-Seq
- Eshchar Mizrachi†1,
- Charles A Hefer†2,
- Martin Ranik1,
- Fourie Joubert2 and
- Alexander A Myburg1Email author
© Mizrachi et al; licensee BioMed Central Ltd. 2010
- Received: 30 May 2010
- Accepted: 1 December 2010
- Published: 1 December 2010
De novo assembly of transcript sequences produced by short-read DNA sequencing technologies offers a rapid approach to obtain expressed gene catalogs for non-model organisms. A draft genome sequence will be produced in 2010 for a Eucalyptus tree species (E. grandis) representing the most important hardwood fibre crop in the world. Genome annotation of this valuable woody plant and genetic dissection of its superior growth and productivity will be greatly facilitated by the availability of a comprehensive collection of expressed gene sequences from multiple tissues and organs.
We present an extensive expressed gene catalog for a commercially grown E. grandis × E. urophylla hybrid clone constructed using only Illumina mRNA-Seq technology and de novo assembly. A total of 18,894 transcript-derived contigs, a large proportion of which represent full-length protein coding genes were assembled and annotated. Analysis of assembly quality, length and diversity show that this dataset represent the most comprehensive expressed gene catalog for any Eucalyptus tree. mRNA-Seq analysis furthermore allowed digital expression profiling of all of the assembled transcripts across diverse xylogenic and non-xylogenic tissues, which is invaluable for ascribing putative gene functions.
De novo assembly of Illumina mRNA-Seq reads is an efficient approach for transcriptome sequencing and profiling in Eucalyptus and other non-model organisms. The transcriptome resource (Eucspresso, http://eucspresso.bi.up.ac.za/) generated by this study will be of value for genomic analysis of woody biomass production in Eucalyptus and for comparative genomic analysis of growth and development in woody and herbaceous plants.
- Gene Catalog
- Immature Xylem
- Express Gene Sequence
- Digital Expression Profile
- Genome Annotation Effort
Ultra-high-throughput second-generation DNA sequencing technologies from companies such as Roche (454 pyrosequencing), Illumina (sequencing by synthesis, Solexa GA) and Applied Biosystems (sequencing by ligation, SOLiD), are increasingly being used for novel exploratory genomics in small to medium-sized laboratories. "Short-read" (36 - 72 nt) technologies such as those of Illumina and Applied Biosystems have proven to be exceptionally successful in a wide variety of whole-transcriptome investigations [1–5], but most of these studies have relied on prior sequence knowledge such as an annotated genome for qualitative and quantitative transcriptome analyses.
Genome assembly of short sequences without any auxiliary knowledge has primarily utilized 454 sequencing data, due to the longer individual read lengths of 150-400 base pairs (bp). However, short-read sequencing (Illumina GA and SOLiD) has been successfully used for de novo assembly of small bacterial genomes (2-5 Mbp), where 36 bp reads have been assembled [6–8] and hybrid approaches, where genomes are de novo assembled using a combination of reads from multiple sequencing platforms to overcome the inherent limitations of each technology, have been used to successfully assemble genomes of up to 40 Mbp [9, 10]. More recently, the sequencing of the giant panda genome was demonstrated  using de novo assembly of sequence derived from a single platform (Illumina), but utilizing a combination of different insert sizes, allowing assembly of an estimated 94% of the genome (2.25 Gbp). De novo assembly of large, highly repetitive and highly heterozygous eukaryotic genomes from short-read data remains a challenge.
In transcriptome studies, 454 pyrosequencing has proven very useful for generating ESTs representing the majority of expressed genes. This has enabled gene discovery in a variety of previously uncharacterized eukaryotic organisms with no or little a priori DNA sequence information [12–16]. However, relatively few published studies have attempted de novo assembly of whole-transcriptome sequences from short-read data such as that generated by Illumina GA or SOLiD technologies. Assembly of short (36-72 bp) read data into accurate, contiguous transcript sequences has only recently been reported [17–19] demonstrating that assembly of long, potentially full-length, transcript assemblies is indeed possible.
Eucalyptus tree species and hybrids presently constitute the most widely planted (≈ 20 Mha) and commercially important hardwood fibre crop in the world. They are mainly utilized for timber, pulp and paper production . Their fast growth rates and wide adaptability may in future allow sustainable and cost efficient production of woody biomass for bioenergy generation [21, 22]. Eucalyptus will soon be only the second forest plantation genus (after Populus) for which a reference genome sequence will be completed by end 2010 . To support the genome annotation effort, there is much value in having a dataset of genes with strong transcriptional evidence across a range of tissues and developmental stages. Until recently, limited amounts of Eucalyptus EST/unigene data were available in public databases, mainly due to the fact that commercial interests have necessitated private EST collections . As of March 2010, aside from a mixed-species collection of ≈56,000 nucleotide sequences on NCBI (≈ 37,000 of which are Sanger EST sequences) and which contain extensive redundancy, the largest effort to date to generate a comprehensive catalogue of expressed genes in a single Eucalyptus species was based on 454 sequencing of cDNA fragments from E. grandis trees . While this study provided an excellent representation of expressed genes and gene ontology classes in E. grandis, the relatively short lengths of the assembled contigs (mean length of 389 bp for all contigs longer than 200 bp) meant that very few complete gene models were represented. There remains therefore a fundamental need for a high-quality expressed gene catalog for Eucalyptus, to support genome annotation efforts and discern authentically expressed genes from predicted gene models, as well as for future genomics research, which will include transcriptome, proteome and metabolome profiling.
In the process of producing such a high-quality expressed gene catalog for Eucalyptus, we addressed three main questions: First, is it feasible to de novo assemble Illumina mRNA-Seq data into contiguous, near full-length gene model sequences for Eucalyptus? Second, what genes make up the expressed gene catalog for a fast-growing Eucalyptus plantation tree? Finally, can we re-use the mRNA-Seq data to create a tissue and organ-specific digital expression profile for each assembled contig? We addressed these questions by generating a comprehensive set of expressed gene sequences from a commercially grown Eucalyptus hybrid (E. grandis × E. urophylla) clone using Illumina mRNA-Seq technology and de novo short-read assembly. We report herein the complete annotation of the expressed gene catalog based on comparative analysis with the published Arabidopsis thaliana , Populus trichocarpa  and Vitis vinifera  protein-coding datasets. We describe an interactive database of annotated transcript sequences, coding sequences (CDSs) and derived protein sequences (Eucspresso, http://eucspresso.bi.up.ac.za/, CA Hefer, E Mizrachi, AA Myburg, F Joubert, unpublished), which will be continuously updated and curated in association with the Eucalyptus Genome Network (EUCAGEN, http://www.eucagen.org) as part of an effort to initiate a publicly accessible database for Eucalyptus transcriptomics research similar to that produced for Populus .
De novo assembly, validation and annotation of contigs
To compare the completeness of our expressed gene catalogue to that of all publicly available gene sequence data for Eucalyptus, we generated a separate dataset, termed EucALL, containing all publicly available Eucalyptus gene sequence data to date (March 2010). This included all NCBI unigenes and ESTs, assembled 454 EST data from E. grandis leaf tissue (DOE-JGI, http://eucalyptusdb.bi.up.ac.za/), assembled 454 EST data produced by Novaes and colleagues , and the EucaWood contig dataset . We compared the representation of Arabidopsis genes in the EucALL dataset and in our assembled E. grandis × E. urophylla (EGU) transcript dataset by BLASTX at significance levels of < 1e-05, < 1e-10 and < 1e-20 (Additional file 1 Table S2). While the overall numbers of hits were higher in the EucALL dataset, these were mostly in the lower size ranges. For our de novo assembled contigs, a much higher number of significant hits in contigs larger than 2000 bp in size (6,602 compared to 1,940 at significance < 1e-10) suggested that a greater proportion of our contigs represent full-length gene models than the publicly available Eucalyptus gene sequence set (EucALL).
Functional annotation of the expressed gene catalog
Digital expression profiling
An accepted method of identifying large scale differences in gene expression is to use EST abundance as an indicator of transcript abundance. This method has been implemented and validated in numerous studies using Sanger-derived ESTs [34, 35], as well as 454-pyrosequencing methods [13, 36–39]. Quantitative transcriptome analysis using ultra-high-throughput sequencing technologies such as Illumina and SOLiD has been shown to be accurate and highly correlated with other quantitative methods such as RT-qPCR and microarray analysis [1, 5]. To quantify tissue-specific transcript abundance reflected in our short-read dataset, we combined data (multiple lanes in most cases) generated from the same tissues and mapped six tissue-specific datasets (Additional file 1 Table S1) to the assembled gene catalog using Bowtie . Following this, we used the Cufflinks  program (http://cufflinks.cbcb.umd.edu), which provides relative abundance values by calculating Fragments Per Kilobase of exon per Million fragments mapped (FPKM) as validated previously . This enabled the allocation of a tentative digital expression profile for each transcript-derived contig (Additional file 4).
We also interrogated our transcriptome data using the "core xylem gene set" identified in Arabidopsis by Ko and colleagues . Of the 52 genes identified by the authors as markers of secondary xylem formation in Arabidopsis, 33 had putative homologues in the Eucalyptus transcriptome (BLASTX, < 1e-10) and in total 43 contigs were identified. Of these, 40 (93%) showed greater than two-fold "Xylem" to "Leaf" digital expression profile ratios and six were only detected in xylem tissues (Additional file 1 Table S4). Most of the expression profiles were also highly correlated with that of secondary cell wall-specific Eucalyptus cellulose synthase genes, similar to the patterns previously observed in Arabidopsis. These results are comparable to the 80% (51 out of 63 genes) reported recently for the same set of Arabidopsis homologs in Populus , which provided further support for the biological validity of the short-read-based digital expression profiles associated with the Eucalyptus expressed gene catalog.
Public data resource
We constructed a public data resource, Eucspresso (http://eucspresso.bi.up.ac.za), which provides a searchable interface to the assembled contigs. The database can be queried based on closest homologous entry in the Arabidopsis thaliana (TAIR9), Populus trichocarpa (Version 2.0) and Vitis vinifera (Sept 2009 build) sequence data sets. Simple and compound keyword searches can be performed based on all of the functional annotation terms and the predicted coding and protein sequences can be obtained for all contigs. Finally, the tissue-specific (FPKM) digital expression profile and the location of each contig in the draft 8X E. grandis genome assembly (http://eucalyptusdb.bi.up.ac.za/) can be viewed from within Eucspresso.
We have assembled nearly 19,000 expressed gene sequences from xylogenic and non-xylogenic tissues of an actively growing Eucalyptus plantation tree using only Illumina mRNA-Seq technology and de novo short-read assembly. Quality control comparisons to full-length, cloned, Sanger-derived transcript sequences from Eucalyptus, as well as multiple lines of evidence such as CDS prediction and Pfam prediction showed that the transcript assemblies are robust and that thousands of full-length coding sequences and their respective 5' and/or 3' UTR regions were successfully assembled. Comparison of assembled gene models to gene catalogs of other angiosperm species by BLAST analysis and functional annotation (GO, InterProScan and KEGG category numbers and proportions, Figure 4 and Additional file 1 - Figure S6, Figure S7 and Figure S8) indicate that we have sampled an expansive and diverse expressed gene catalog representing a large proportion of the genes expressed in mature Eucalyptus trees across a variety of woody and non-woody tissues. Comparison to all publicly available Eucalyptus DNA sequence suggests that we have sampled a more comprehensive set of genes, which is also more complete in length (Additional file 1 - Table S2) from a single eucalypt tree genotype than has been available to date for the entire genus. Additionally, using a validated approach to quantify mRNA-Seq data we have produced an informative database of transcript abundance across six Eucalyptus tree tissues, which, due to the depth of sequencing, results in higher sensitivity and wider dynamic range than Sanger or 454-derived EST counts usually associated with this type of analysis.
A concern associated with de novo assembly of transcript sequences, be it Sanger derived  or 454 sequence derived  assemblies, is the contiguity of assembled sequences. This concern intuitively increases as the read length decreases, and may be one of the main reasons why most transcriptome de novo assembly approaches have utilized technologies with longer read lengths to date. We provide several lines of evidence which jointly support the contiguity of transcript sequences assembled in our study using short-read data. First, a high proportion of the contigs exhibited high-confidence BLASTX similarity to protein sequences from annotated gene catalogs of three angiosperm species Arabidopsis, Populus and Vitis (Figure 3). Second, a large proportion of the contigs contained long, near full-length, predicted CDSs (Figure 3). Third, InterproScan analysis predicted 45,687 protein domains, which is indicative of contiguous, in-frame predicted protein sequences (Additional File 1). Finally, a random subset of the contigs, which represented a variety of length and read coverage, were validated by direct alignment to previously published, Sanger sequenced, full-length Eucalyptus genes that were directly cloned from cDNA (Additional File 3).
Assigning biological significance to de novo assembled contigs should be approached with caution. In our study, 13,806 assembled gene models (73.07% of the total assembled contigs, Figure 3A) were considered high confidence annotations due to the presence of a significant high stringency BLAST hit in other angiosperm species, as well as a predicted CDS. These contigs had relatively high coverage per base (CPB) values (median 47X) as compared to contigs lacking a predicted CDS (median CPB of 20× or lower, Figure 3B and 3D and Supplemental Table S3). Thus, a lack of CDS prediction was generally associated with low gene expression level and low CPB, which resulted in 'N's inserted by Velvet in the contig sequences (Figure 3B and 3D and Supplemental Table S3). The assembly quality and annotation of these sequences could be improved in future by even deeper sequencing and the addition of data from new tissue types. Another possible source of error is the spurious prediction of CDSs in long, non-coding RNAs, which has been previously shown to occur [44, 45]. It is notable that of the 1,813 Eucalyptus-derived contigs with no significant BLAST hit to other angiosperms, but containing a predicted CDS (Figure 3C), only 81 contigs had predicted InterProScan domains. Additionally, the median CDS to contig length ratio was 0.33, as compared to 0.62 in the 13,806 high confidence contigs in Figure 3A, which suggests that many of these CDS predictions may be false positives. De novo assembled transcriptome datasets lack the ability to distinguish and classify the lower confidence annotations, an exercise that is beyond the scope of this study, albeit one that can be resolved once a genome-based predicted set of gene models is available.
Validation of the digital expression (FPKM) profiles using the "core xylem gene set" identified in Arabidopsis  has precedence in similar investigations in conifers , cotton  and poplar . This analysis, combined with the results shown in Figure 5A and Figure 5B, lend support to the biological significance of digital expression profiles derived from short-read sequencing technology, which will assist in the discovery and annotation of novel Eucalyptus genes - and using the genome sequence, promoters - playing key roles in growth and development, and particularly in woody biomass production. The Eucspresso online resource produced from this study, as well as future comparative analysis with other woody species such as Vitis and Populus, will be valuable for studying the unique biology of woody perennials.
Taking into consideration the number, length, coverage and quality of assembled gene models, as well as their digital expression profiles, this dataset surpasses several previous de novo transcriptome assemblies using Illumina [17, 18] or 454 technology [13–16]. This can primarily be attributed to the amount of data generated (3.93 Gbp of non-rRNA derived reads), the diversity of tissues sampled and strategy of paired-end sequencing, as well as read-length (mostly 50-60 bp, compared to only 36 bp in earlier studies). Our dataset was generated using several generations of Illumina GA technology, but considering the current throughput of Illumina sequencing (up to 100 Gbp per flowcell), a gene catalog of this scale can now be produced using a single lane of Illumina mRNA-Seq. Finally, non-normalized short-read data will be extremely useful for downstream applications such as digital gene expression profiling and detection of alternative transcript structure, once reference models are available from the genome.
Plant tissue collection
Tissues from a six-year-old ramet of a commercially grown E. grandis × E. urophylla hybrid clone (GUSAP1, Sappi Forestry, Kwambonambi, South Africa) were collected in a clonal field trial and immediately frozen in liquid nitrogen, as previously described by Ranik and Myburg . The following tissues were sampled from approximately breast height (1.35 m) on the main stem following bark removal: immature xylem (outer glutinous 1-2 mm layer comprising early developing xylem tissue) and xylem (after removal of the immature xylem layer, 2-mm-deep planing including xylem cells in advanced stages of maturity). Early developing phloem tissue including small amounts of cambial cells was collected by scraping the first 1-2 mm layer from the inner surface of the bark. Additionally, we sampled shoot tips (soft green termini of young crown tip branches containing shoot primordia and apical meristems), young leaves (rapidly-growing leaves in the process of unfolding) and mature leaves (older, fully expanded leaves of the current growth season).
Paired-end mRNA-Seq library preparation and sequence generation
Total RNA was extracted from the six tissues using the protocol described previously . Total RNA quality and concentration were determined using the Agilent RNA 6000 Pico kit (Agilent, Santa Clara, CA) on a 2100 Bioanalyzer (Agilent). Enrichment of polyA+ RNA was performed using the Oligotex midi kit (Qiagen, Valencia, CA). Two hundred nanograms of polyA+ RNA were fragmented in 1× RNA fragmentation solution (Ambion, Austin, TX) at 70°C for 5 minutes. The fragmented RNA was precipitated with three volumes of ethanol and re-dissolved in water. Double-stranded cDNA was synthesized using the cDNA Synthesis System (Roche, Indianapolis, IN) according to manufacturer's instructions using random hexamers (Invitrogen, Carlsbad, CA) to prime the first strand cDNA synthesis. Paired-end libraries with approximate average insert lengths of 200 base pairs were synthesized using the Genomic Sample Prep kit (Illumina, San Diego, CA) according to manufacturer's instructions. Prior to cluster generation, library concentration and size were assayed using the Agilent DNA1000 kit (Agilent) on a 2100 Bioanalyzer (Agilent). Libraries were sequenced on a Genome Analyzer equipped with a paired-end module (versions I, II and IIx, Illumina).
De novo assembly of mRNA-Seq data
After removing sequences containing low quality bases ('N's) or single base repeats and ribosomal RNA sequences, the 3.93 Gbp dataset was used for assembly and subsequent coverage per base (CPB) estimation for each assembled contig. We assembled the filtered Illumina paired-end (PE) reads using Velvet version 0.7.30 . Previous studies [1–3, 50] have demonstrated that mRNA-Seq technology produces uneven coverage over a transcript, which prompted us to follow a coverage-assisted reference assembly strategy. Using Mosaik (http://bioinformatics.bc.edu/marthlab/Mosaik) to align the filtered Illumina PE sequences to the assembled contigs, the average coverage per contig was calculated. A custom script was then developed to extract the pairs of sequences that mapped to each contig, and using that contig as a template, each contig was re-assembled using Velvet with the associated expected coverage parameter set to the Mosaik average coverage value for that contig.
The degree to which the assembled contigs represented long, contiguous RNA transcript sequences, was evaluated by aligning 35 Velvet contigs and their respective predicted CDSs to full-length, cloned, Sanger-derived Eucalyptus reference sequences present in NCBI. CPB was calculated for the sequences using BWA  and a global pairwise alignment of the sequences was performed using the Needle package from EMBOSS . Plots were constructed from the alignments with the CPB on the y-axis of the plot. Zero coverage values were assigned to gaps in the alignments. This revealed where gaps and/or potentially misassembled regions were present in the assembled contigs, and to what depth these contigs were sequenced.
Coding sequence prediction
Coding sequence predictions were performed using GENSCAN  and AUGUSTUS , predicting 15,713 and 15,904 proteins respectively. The difference in coding sequences predicted could be attributed to the different training data sets used and inherent difficulty of predicting coding sequences from incomplete genomic sequences. The GENSCAN results (15,713 predicted proteins) were used in downstream analyses.
Annotation of assembled contigs
Homology searches were performed against public sequence databases. The newest versions as of February 2010 of the protein sequences of Arabidopsis (TAIR 9), Vitis (Sept 2009 build) and Populus (version 2.0, Phytozome) were used to construct the individual BLAST datasets. The Eucalyptus public dataset (EucAll) consisted of 45,442 entries in Genbank (downloaded March 2010), 13,930 entries from the Eucalyptus Wood unigenes and ESTs , E. grandis leaf tissue ESTs (120,661 entries from DOE-JGI-produced 454 sequences, http://eucalyptusdb.bi.up.ac.za/) and 190,106 Unigenes and singlets from E. grandis 454 data . The BLAST e-value threshold was set at 1e-10, with a minimum alignment length of 100 nucleotides (33 amino acids). Functional annotation (GO and KEGG) was performed using BLAST2GO , using the default annotation parameters (BLAST e-value threshold of 1e-06, Gene Ontology annotation threshold of 55). InterPro annotations were performed using InterProScan (http://www.ebi.ac.uk/Tools/InterProScan/).
Coverage and FPKM determination
Sequence depth and base coverage were calculated using BWA (Lin et al. 2009) and the FPKM values estimated by aligning the Illumina reads to the assembled transcriptome using Bowtie  and estimating the expression level of each predicted transcript (FPKM value) using Cufflinks (http://cufflinks.cbcb.umd.edu) .
The authors would like to acknowledge J. Rees and J.-M. Celton of the University of the Western Cape (Cape Town, South Africa) for assistance with Illumina GA sequencing. Plant materials were kindly provided by Sappi Forestry (Kwambonambi, South Africa). This work was supported through a strategic research grant from the South African Department of Science and Technology (DST) and by research funding from Sappi and Mondi, through the Wood and Fibre Molecular Genetics (WFMG) Programme, the Technology and Human Resources for Industry Programme (THRIP) and the National Research Foundation (NRF) of South Africa.
- Cloonan N, Forrest ARR, Kolle G, Gardiner BBA, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, et al: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008, 5 (7): 613-619. 10.1038/nmeth.1223.PubMedView ArticleGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5 (7): 621-628. 10.1038/nmeth.1226.PubMedView ArticleGoogle Scholar
- Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008, 320 (5881): 1344-1349. 10.1126/science.1158441.PubMed CentralPubMedView ArticleGoogle Scholar
- Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, Wang X, Bodeau J, Tuch BB, Siddiqui A, et al: mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009, 6 (5): 377-382. 10.1038/nmeth.1315.PubMedView ArticleGoogle Scholar
- Wilhelm BT, Landry JR: RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods. 2009, 48 (3): 249-257. 10.1016/j.ymeth.2009.03.016.PubMedView ArticleGoogle Scholar
- Farrer RA, Kemen E, Jones JDG, Studholme DJ: De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads: RESEARCH LETTER. FEMS Microbiol Lett. 2009, 291 (1): 103-111. 10.1111/j.1574-6968.2008.01441.x.PubMedView ArticleGoogle Scholar
- Hernandez D, François P, Farinelli L, Østerås M, Schrenzel J: De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer. Genome Res. 2008, 18 (5): 802-809. 10.1101/gr.072033.107.PubMed CentralPubMedView ArticleGoogle Scholar
- Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ: Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat Methods. 2009, 6 (4): 291-295. 10.1038/nmeth.1311.PubMed CentralPubMedView ArticleGoogle Scholar
- DiGuistini S, Liao N, Platt D, Robertson G, Seidel M, Chan S, Docking TR, Birol I, Holt R, Hirst M: De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data. Genome Biology. 2009, 10 (9): 10.1186/gb-2009-10-9-r94.Google Scholar
- Nowrousian M, Stajich JE, Chu M, Engh I, Espagne E, Halliday K, Kamerewerd J, Kempken F, Knab B, Kuo HC: De novo Assembly of a 40 Mb eukaryotic genome from short sequence reads: Sordaria macrospora, a model organism for fungal morphogenesis. PLoS Genet. 2010, e1000891-10.1371/journal.pgen.1000891. 6Google Scholar
- Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y, et al: The sequence and de novo assembly of the giant panda genome. Nature. 2010, 463 (7279): 311-317. 10.1038/nature08696.PubMed CentralPubMedView ArticleGoogle Scholar
- Dassanayake M, Haas JS, Bohnert HJ, Cheeseman JM: Shedding light on an extremophile lifestyle through transcriptomics. New Phytol. 2009, 183 (3): 764-775. 10.1111/j.1469-8137.2009.02913.x.PubMedView ArticleGoogle Scholar
- Hahn DA, Ragland GJ, Shoemaker DD, Denlinger DL: Gene discovery using massively parallel pyrosequencing to develop ESTs for the flesh fly Sarcophaga crassipalpis. BMC Genomics. 2009, 10 (234):Google Scholar
- Meyer E, Aglyamova GV, Wang S, Buchanan-Carter J, Abrego D, Colbourne JK, Willis BL, Matz MV: Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx. BMC Genomics. 2009, 10 (219):Google Scholar
- Novaes E, Drost DR, Farmerie WG, Pappas GJ, Grattapaglia D, Sederoff RR, Kirst M: High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome. BMC Genomics. 2008, 9 (312):Google Scholar
- Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH: Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol. 2008, 17 (7): 1636-1647. 10.1111/j.1365-294X.2008.03666.x.PubMedView ArticleGoogle Scholar
- Birol I, Jackman SD, Nielsen CB, Qian JQ, Varhol R, Stazyk G, Morin RD, Zhao Y, Hirst M, Schein JE, et al: De novo transcriptome assembly with ABySS. Bioinformatics. 2009, 25 (21): 2872-2877. 10.1093/bioinformatics/btp367.PubMedView ArticleGoogle Scholar
- Gibbons JG, Janson EM, Hittinger CT, Johnston M, Abbot P, Rokas A: Benchmarking next-generation transcriptome sequencing for functional and evolutionary genomics. Mol Biol Evol. 2009, 26 (12): 2731-2744. 10.1093/molbev/msp188.PubMedView ArticleGoogle Scholar
- Wu T, Qin Z, Zhou X, Feng Z, Du Y: Transcriptome profile analysis of floral sex determination in cucumber. J Plant Physiol. 2010, 167 (11): 905-913. 10.1016/j.jplph.2010.02.004.PubMedView ArticleGoogle Scholar
- Eldridge K, Davidson J, Harwood C, van Wyk G: Eucalypt domestication and breeding. 1993, Oxford: Clarendon PressGoogle Scholar
- FAO: Forests and Energy. FAO Forestry Paper No. 2008, 154: (Rome):(ISBN 978-992-975-105985-105982)Google Scholar
- Hinchee M, Rottmann W, Mullinax L, Zhang C, Chang S, Cunningham M, Pearson L, Nehra N: Short-rotation woody crops for bioenergy and biofuels applications. In Vitro Cell Dev Biol - Plant. 2009, 45 (6): 619-629. 10.1007/s11627-009-9235-5.PubMed CentralPubMedView ArticleGoogle Scholar
- Myburg AA, Grattapaglia D, Tuskan GA, Schmutz J, Barry K, Bristow J, The Eucalyptus Genome Network: Sequencing the Eucalyptus genome: Genomic resources for renewable energy and fiber production. Plant & Animal Genome XVI Conference: January 12-16, 2008; San Diego, CA. 2008Google Scholar
- Hibino T: "Post-genomics" research in Eucalyptus in the near future. Plant Biotechnol. 2009, 26 (1): 109-113.View ArticleGoogle Scholar
- Kaul S, Koo HL, Jenkins J, Rizzo M, Rooney T, Tallon LJ, Feldblyum T, Nierman W, Benito MI, Lin X, et al: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408 (6814): 796-815. 10.1038/35048692.View ArticleGoogle Scholar
- Tuskan GA, DiFazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam M, Ralph S, Rombauts S, Salamov A, et al: The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science. 2006, 313 (5793): 1596-1604. 10.1126/science.1128691.PubMedView ArticleGoogle Scholar
- Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C, et al: The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. 2007, 449 (7161): 463-467. 10.1038/nature06148.PubMedView ArticleGoogle Scholar
- Sjödin A, Street NR, Sandberg G, Gustafsson P, Jansson S: The Populus Genome Integrative Explorer (PopGenIE): A new resource for exploring the Populus genome. New Phytol. 2009, 182 (4): 1013-1025. 10.1111/j.1469-8137.2009.02807.x.PubMedView ArticleGoogle Scholar
- Zerbino DR, Birney E: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.PubMed CentralPubMedView ArticleGoogle Scholar
- Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997, 268 (1): 78-94. 10.1006/jmbi.1997.0951.PubMedView ArticleGoogle Scholar
- Pinheiro M, Afreixo V, Moura G, Freitas A, Santos MAS, Oliveira JL: Statistical, computational and visualization methodologies to unveil gene primary structure features. Methods Inf Med. 2006, 45 (2): 163-168.PubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.PubMedView ArticleGoogle Scholar
- Rengel D, Clemente HS, Servant F, Ladouce N, Paux E, Wincker P, Couloux A, Sivadon P, Grima-Pettenati J: A new genomic resource dedicated to wood formation in Eucalyptus. BMC Plant Biol. 2009, 9 (36):Google Scholar
- Geisler-Lee J, Geisler M, Coutinho PM, Segerman B, Nishikubo N, Takahashi J, Aspeborg H, Djerbi S, Master E, Andersson-Gunneras S, et al: Poplar carbohydrate-active enzymes. Gene identification and expression analyses. Plant Physiol. 2006, 140 (3): 946-962. 10.1104/pp.105.072652.PubMed CentralPubMedView ArticleGoogle Scholar
- Pavy N, Laroche J, Bousquet J, Mackay J: Large-scale statistical analysis of secondary xylem ESTs in pine. Plant Mol Biol. 2005, 57 (2): 203-224. 10.1007/s11103-004-6969-7.PubMedView ArticleGoogle Scholar
- Hale MC, McCormick CR, Jackson JR, DeWoody JA: Next-generation pyrosequencing of gonad transcriptomes in the polyploid lake sturgeon (Acipenser fulvescens): The relative merits of normalization and rarefaction in gene discovery. BMC Genomics. 2009, 10 (203):Google Scholar
- Kristiansson E, Asker N, Förlin L, Joakim DGJ: Characterization of the Zoarces viviparus liver transcriptome using massively parallel pyrosequencing. BMC Genomics. 2009, 10 (345):Google Scholar
- Schwarz D, Robertson HM, Feder JL, Varala K, Hudson ME, Ragland GJ, Hahn DA, Berlocher SH: Sympatric ecological speciation meets pyrosequencing: Sampling the transcriptome of the apple maggot Rhagoletis pomonella. BMC Genomics. 2009, 10 (633):Google Scholar
- Weber APM, Weber KL, Carr K, Wilkerson C, Ohlrogge JB: Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing. Plant Physiol. 2007, 144 (1): 32-42. 10.1104/pp.107.096677.PubMed CentralPubMedView ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009, 10 (3): 10.1186/gb-2009-10-3-r25.Google Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28 (5): 511-515. 10.1038/nbt.1621.PubMed CentralPubMedView ArticleGoogle Scholar
- Ko JH, Beers EP, Han KH: Global comparative transcriptome analysis identifies gene network regulating secondary xylem development in Arabidopsis thaliana. Mol Genet Genomics. 2006, 276 (6): 517-531. 10.1007/s00438-006-0157-1.PubMedView ArticleGoogle Scholar
- Dharmawardhana P, Brunner AM, Strauss SH: Genome-wide transcriptome analysis of the transition from primary to secondary stem development in Populus trichocarpa. BMC Genomics. 2010, 11 (1): 150-10.1186/1471-2164-11-150.PubMed CentralPubMedView ArticleGoogle Scholar
- Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K, Lander ES: Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci USA. 2007, 104 (49): 19428-19433. 10.1073/pnas.0709013104.PubMed CentralPubMedView ArticleGoogle Scholar
- Dinger ME, Pang KC, Mercer TR, Mattick JS: Differentiating protein-coding and noncoding RNA: Challenges and ambiguities. PLoS Comput Biol. 2008, 4 (11): 1-5. 10.1371/journal.pcbi.1000176.View ArticleGoogle Scholar
- Pavy N, Boyle B, Nelson C, Paule C, Giguère I, Caron S, Parsons LS, Dallaire N, Bedon F, Bérubé H, et al: Identification of conserved core xylem gene sets: Conifer cDNA microarray development, transcript profiling and computational analyses. New Phytol. 2008, 180 (4): 766-786. 10.1111/j.1469-8137.2008.02615.x.PubMedView ArticleGoogle Scholar
- Betancur L, Singh B, Rapp RA, Wendel JF, Marks MD, Roberts AW, Haigler CH: Phylogenetically distinct cellulose synthase genes support secondary wall thickening in arabidopsis shoot trichomes and cotton fiber. J Integr Plant Biol. 2010, 52 (2): 205-220. 10.1111/j.1744-7909.2010.00934.x.PubMedView ArticleGoogle Scholar
- Ranik M, Myburg AA: Six new cellulose synthase genes from Eucalyptus are associated with primary and secondary cell wall biosynthesis. Tree Physiol. 2006, 26 (5): 545-556.PubMedView ArticleGoogle Scholar
- Chang S, Puryear J, Cairney J: A simple and efficient method for isolating RNA from pine trees. Plant Mol Biol Report. 1993, 11 (2): 113-116. 10.1007/BF02670468.View ArticleGoogle Scholar
- Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR: Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. 2008, 133 (3): 523-536. 10.1016/j.cell.2008.03.029.PubMed CentralPubMedView ArticleGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-10.1093/bioinformatics/btp324.PubMed CentralPubMedView ArticleGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: the European molecular biology open software suite. Trends Genet. 2000, 16 (6): 276-277. 10.1016/S0168-9525(00)02024-2.PubMedView ArticleGoogle Scholar
- Stanke M, Waack S: Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003, 19 (SUPPL 2): ii215-ii225.PubMedView ArticleGoogle Scholar
- Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M: Blast2GO: A universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005, 21 (18): 3674-3676. 10.1093/bioinformatics/bti610.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.