Comprehensive transcriptome analysis of the highly complex Pisum sativum genome using next generation sequencing
- Susanne U Franssen†1,
- Roshan P Shrestha†2,
- Andrea Bräutigam†2, 3,
- Erich Bornberg-Bauer1 and
- Andreas PM Weber2, 3Email author
© Franssen et al; licensee BioMed Central Ltd. 2011
Received: 16 March 2010
Accepted: 11 May 2011
Published: 11 May 2011
The garden pea, Pisum sativum, is among the best-investigated legume plants and of significant agro-commercial relevance. Pisum sativum has a large and complex genome and accordingly few comprehensive genomic resources exist.
We analyzed the pea transcriptome at the highest possible amount of accuracy by current technology. We used next generation sequencing with the Roche/454 platform and evaluated and compared a variety of approaches, including diverse tissue libraries, normalization, alternative sequencing technologies, saturation estimation and diverse assembly strategies. We generated libraries from flowers, leaves, cotyledons, epi- and hypocotyl, and etiolated and light treated etiolated seedlings, comprising a total of 450 megabases. Libraries were assembled into 324,428 unigenes in a first pass assembly.
A second pass assembly reduced the amount to 81,449 unigenes but caused a significant number of chimeras. Analyses of the assemblies identified the assembly step as a major possibility for improvement. By recording frequencies of Arabidopsis orthologs hit by randomly drawn reads and fitting parameters of the saturation curve we concluded that sequencing was exhaustive. For leaf libraries we found normalization allows partial recovery of expression strength aside the desired effect of increased coverage. Based on theoretical and biological considerations we concluded that the sequence reads in the database tagged the vast majority of transcripts in the aerial tissues. A pathway representation analysis showed the merits of sampling multiple aerial tissues to increase the number of tagged genes. All results have been made available as a fully annotated database in fasta format.
We conclude that the approach taken resulted in a high quality - dataset which serves well as a first comprehensive reference set for the model legume pea. We suggest future deep sequencing transcriptome projects of species lacking a genomics backbone will need to concentrate mainly on resolving the issues of redundancy and paralogy during transcriptome assembly.
Pisum sativum (var. Little Marvel) is a legume of agro-commercial relevance  with a large genome, 4300 Mb , which is approximately five to ten times larger than that of Medicago . At least a third but possibly more than half of the genome may consist of repetitive elements . The garden pea was established as a biochemical model since it is easy to cultivate and fast growing. Additionally, unlike the Brassicaceae , it is low in glucosinolates, which interfere with enzyme activity and organelle intactness during isolation. Consequently, a large body of work on enzymes and organelles was carried out using pea as the model system, e.g. [5–11]. The presence of a sequenced genome or transcriptome is a massive advantage for the analysis of a model system with Arabidopsis thaliana providing the best example as the first plant model with genomic resources . In the absence of a completely sequenced genome, plant EST collections, such as unigenes at NCBI , or tentative consensus sequences at DFCI , produced by traditional Sanger sequencing have proven extremely useful for plant research, e.g. .
Recently, it has become feasible to produce transcriptomic resources for non-model species by next generation sequencing (NGS) at reasonable cost. Next generation sequencing was employed to create transcriptome databases of species without a sequenced genome such as mangroves , eucalyptus , olive , chestnut  and Artemisia annua . In all these projects 454/Roche NGS technology (reviewed in ) was used. For this RNAseq approach either fragmented mRNA or fragmented cDNA  can be used as input and read lengths ranging from 100 nucleotides (nts), 250 nts and 500 nts modal length can be received depending on the sequencer and sequencing kit employed, GS 20, GS FLX Standard Series and GS FLX Titanium Series, respectively (reviewed in [21, 23]). In the various projects different assemblers were used, employing both overlap based methods [18, 20] and strategies using de Bruijn graphs [16, 17, 19, 24], either alone or in combination. Independent of the assembler used, the contigs obtained remained fairly short compared to traditional assemblies performed with Sanger reads: For GS20 data the average contig length was 130 bases for Eucalyptus  and 168 bases for chestnut ; for GS FLX reads of up to 250 bases the average contig length was between 334 bases and 433 bases [16–20]. Currently, new tools are developed for de novo assembly of these EST sequences since they are considerably shorter than ESTs generated by traditional Sanger sequencing, for example a version 3.0 of the overlap based assembler MIRA , the GS De Novo Assembler (alias Newbler) developed and provided by Roche/454 Life Sciences , both designed for the longer NGS reads (Roche/454) or Velvet designed for shorter NGS reads (e.g. Illumina) based on de Bruijn graphs . To our knowledge, there is currently no standard as to how 454 reads are best assembled. After assembly, the resulting EST contigs and singletons ('unigenes') were annotated using publicly available databases and analyzed further for their biological information [16–20].
We chose to develop a transcriptome resource for Pisum sativum to (i) facilitate future biochemical, physiological, and cell biological experiments in P. sativum and (ii) evaluate different methods for generating sequence resources for non-model species with large and complex genomes. The different sequencing and assembly strategies were explored with respect to their potential for gene discovery and assessment of completeness. A transcriptome resource of the pea will greatly facilitate molecular and -omics approaches for research on this legume.
Results and Discussion
The sequence read databases yielded 450 Megabases of sequence
Properties of the different libraries; abbreviations are as follows: COT cotyledons, E etiolated leaves, L light treated etiolated leaves, EPI epicotyl, LVN.1-5 leaf libraries 1-5, FLO flower, HYP hypocotyl, LVR.1 leaf library, non normalized; asequenced with GS flex, bsequenced with GS 20; all libraries were normalized except LVR.1;
number of raw reads
number of reads after crossmatch clean up
number of reads after MIRA
number reads for the assembly
number of reads with AGI mapping
nts after crossmatch
mean readlength raw
mean read length after crossmatch
Two assemblies had very different properties
While MIRA was a conservative assembly program  CAP3 and phrap have been used to assemble a transcriptome [16, 24] and can create a more lenient assembly. A second less stringent assembly was produced from the MIRA contigs using the TGICL pipeline which includes CAP3 [31, 32]. Relaxed parameters of 40 bases overlap and ≥94% identity were chosen because these parameters were used for assembling mangrove transcriptomes . This resulted in a total of 81,449 unigenes, reducing the number of singletons to 18% and the number of contigs to 35% of the original number of singletons and contigs, respectively. The mean contig length increased to 454 bases. Although the number of singletons and short contigs decreased in the second pass assembly, the number of large contigs did not increase (Figure 1A). To analyze whether the assemblies only reduced redundancy or whether they also created chimeric contigs, the initial reads, the contigs of the first pass assembly, and the contigs of the second pass assembly were mapped against the transcriptomes of the model species Arabidopsis thaliana, Glycine max (soy bean) and Medicago truncatula. The reads themselves mapped to 18,856 genes of A. thaliana; the unigenes of the first pass assembly mapped to 90.4% while the unigenes of second pass assembly only mapped to 62.2% of the originally found genes. Mappings to G. max and to M. truncatula yielded similar results in that the second pass assembly lost about a third of contig annotations obtained for the first pass assembly (Additional File 2). The reduced identification of genes by the second pass unigenes for the various references indicated that the first pass assembly massively decreased redundancy while leading only to a modest decrease in matchable reference genes. The lenient second pass assembly lead to a further reduction in redundancy (Figure 1A) but also created chimeric unigenes joining sequences originating from different transcripts indicated by the loss of one third of the identified genes. It was also attempted to benchmark a MIRA assembly performed with 454 reads only against Sanger sequencing generated coding sequences from public sources. However, there were only 2,281 partial and complete coding sequences of the garden pea available at NCBI . Comparison of these sequences to the contigs yielded no conclusions beyond the comparison to more distant reference genomes. It was also attempted to use de Bruijn graph based assemblers such as velvet  and SOAP  for 454 read based transcriptome assembly, however the contig length distribution was inferior to the overlap based assemblers (data not shown) and the analysis was not pursued further.
All subsequent analyses were thus based on the MIRA first pass assembly unigenes. As the mappings to the three reference genomes yielded qualitatively similar results, we chose to base subsequent analyses on the A. thaliana reference as it is the plant genome which is currently best annotated.
The database annotation revealed low contamination and high redundancy
The ten most frequent AGIs that were used for the annotation of unigenes with BlastX (e-value ≤ 10-4)
# of AGI occurrences
40S ribosomal protein S16 (RPS16C)
LHCB3 (LIGHT-HARVESTING CHLOROPHYLL B-BINDING PROTEIN 3)
LHB1B1; chlorophyll binding
LHCB2.2; chlorophyll binding
RAN-1; GTP binding/GTPase/protein binding
RBCS1A (RIBULOSE BISPHOSPHATE CARBOXYLASE SMALL CHAIN 1A)
CAB1 (CHLOROPHYLL A/B BINDING PROTEIN 1); chlorophyll binding
LHCB2.3; chlorophyll binding
LHB1B2; chlorophyll binding
Quantification of different sources contributing to redundancy between unigenes; unigenes where mapped against 14 cDNA reference sequences of P. sativum: orthologs of the ten most frequent AGIs (Table 2), five known single copy genes from P. sativum, and four genes encoding Mendelian traits; the original alignments can be viewed by loading additional files 4 and 5 into an alignment viewer such as tablet .
corresponding pea cDNA
length in bases
# reads mapped
putative SNP positions
putative sequencing error positions
While the transcripts were well covered, the transcriptome coverage was limited
Properties of the different libraries; a and b, fitting parameters of the equation y = ax/(b+x) with 'a' representing the AGI (Arabidopsis genome identifier) detections maximally possible (for a detailed description please see material and methods)
total AGIs detected
slope at final read count
Since leaf libraries were analyzed with different sequencing technologies, GS20 and GS FLX, and with and without normalization, the effect of each factor on the transcript coverage could be assessed. In the normalized leaf libraries the detection of between 12,718 and 13,715 genes was the upper limit whereas in the non-normalized library only 10,087 AGIs could be identified based on the model and unlimited sequencing. Theoretically it should be possible to detect ESTs for lowly expressed genes even from a non-normalized library given unlimited sample sizes. Nevertheless the data clearly showed that with comparable library sizes the number of tagged genes was significantly increased in normalized libraries. The reason was probably that sequencing libraries were large but not unlimited. Hence very abundant leaf transcripts out-competed transcripts of low abundance for "sequencing space" in the library leading to the lower number of AGIs that could be identified in the non-normalized library. Possibly, cDNA synthesis primers got depleted by very abundant messages in non-normalized libraries, thereby leading to the suppression of less abundant transcripts in the sequencing library. GS20 sequenced libraries and GS FLX sequenced libraries yielded similar numbers of possible AGIs tagged by at least one read. Based on the mathematical analysis the majority of sequencing libraries were sequenced to exhaustion (Table 4).
It was more difficult to estimate the amount of bases covered in relation to all bases of the complete pea transcriptome (in other words the 'transcriptome coverage') since the genome of the garden pea has not been sequenced yet. To approximate the transcriptome coverage, the number of AGIs tagged by multiple reads was tested. Most transcripts will be longer than one read; thus coverage with multiple reads is required for complete bases coverage for most of the transcripts. Reads were again drawn from the read pool of a specific library/combination of all libraries with the number of AGIs identified by 5, 10 and 100 reads, respectively, recorded. In the combined library, a considerable slope remained at the final read number, indicating that respective fold coverage was reached only for a subpopulation of AGIs (Figure 3A). Since a subpopulation of AGIs was hit by only one read, the unigenes resulting from this subpopulation were expected to remain short; i.e., the coverage did not suffice for assembly despite the large number of sequenced reads (Table 1). In an alternative approach, BlastX combined with in house python scripts was used to determine the total sequence coverage for all tagged Arabidopsis proteins. The database of Medicago coding sequences  covered 35% of the Arabidopsis proteome; the database of G. max coding sequences  covered 49% of of the Arabidopsis proteome (based on BlastX, e-value ≤ 10-4). The pea unigene collection from this database covered 31% of the tagged AGIs. A simulation based on a fragmented Arabidopsis transcriptome estimates that with 450 Mb of bases, the complete transcriptome ought to be covered . Since the percentage of the pea database mapping remained below both other legume mappings (31% to 35% and 49%, respectively), the approach based on BlastX supported the results of the tagging analysis: both analyses indicated that the transcriptome coverage was not complete. The gaps resulting from incomplete coverage probably precluded complete assembly and many short unigenes persisted (Figure 1). The missing coverage was not located at either the 5' or 3' end of the transcripts. The sequence read population was tested for 5' and 3' prime bias since the libraries were created with poly d(T) priming. The results indicated only little bias against the 5' end compared to the 3' end of the coding sequence (Additional file 8). Based on the transcript and transcriptome coverage, the number of sequence reads was sufficient to tag the majority of transcripts but not to assemble the transcriptome to completeness.
Normalization has unexpected effects on the sequence recovery
Correlation coefficients between expression profiles of the different normalized leaf libraries and the non-normalized leaf library; the expression was determined as the number of reads mapping to an AGI; R1 non normalized library LNR
Spearman's rank coefficient
R1 vs. N1
R1 vs. N2
R1 vs. N3
R1 vs. N4
R1 vs. N5
N1 vs. N2
N1 vs. N3
N1 vs. N4
N1 vs. N5
N2 vs. N3
N2 vs. N4
N2 vs. N5
N3 vs. N4
N3 vs. N5
N4 vs. N5
The different libraries had different pathway representations
Taken together, the pathway representation analysis validated the strategy to sample multiple aerial tissues to achieve good coverage of pathways. Each of the single libraries contributed sequences corresponding to 8515 to 13298 AGIs presenting 50% to 78% of the 17104 AGIs identified by the union of all libraries (Table 4).
The application of next generation sequencing to the transcriptome of Pisum sativum, the garden pea, resulted in 450 Megabases of transcriptome sequence derived from above-ground organs of pea. Comparison to Arabidopsis and mathematical analysis showed that the transcript coverage was near complete and revealed the effects of normalization on sequence yield and gene content. The pathway representation analysis also showed that the different libraries used for sequencing have different pathway signatures, which fit biological expectations. Based on the analysis of the transcriptome resources the pea can now be treated as a biochemical model plant with near complete transcript coverage regarding different aerial tissues. The usefulness of the database in a preliminary form has already been demonstrated for organelle proteomics [5, 24]. Although the data was assembled with programs used by other transcriptome NGS projects [16–20], the assembly itself was revealed to be a major bottleneck. 454 sequencing technology followed by read quantification and read assembly were successfully used not only for the study of non-model legumes but also for the analysis of the C4 syndrome [46, 63].
Completing the transcriptome coverage for pea will require not only sequencing of libraries derived from below ground organs and various stages of seed development but also improvement in assembly technology.
Plant material and treatment
Pea seeds of the variety 'Little Marvel' were purchased from a commercial supplier and seeds were sown in soil and grown under cool light fluorescent lamps for two weeks. All cDNA libraries reported in this study were made from plants grown from the same commercial batch of seeds. Green leaves and flowers were harvested and plunged immediately into liquid nitrogen. Yellow etiolated leaves were harvested as above from plants grown in the dark. De-etiolated leaves were harvested from the dark grown plants after exposure to the cool fluorescent lamps for 6 hours. The epicotyls and hypocotyls were harvested from 6 days-old pea seedlings germinated on a moist filter paper in dark. Cotyledons were harvested after removing epicotyls and hypocotyls.
Total RNA extraction and synthesis of double stranded cDNA was performed as described previously . Briefly, one gram of pea plant samples were ground with a mortar and pestle in liquid nitrogen and total RNA was extracted in guanidinium thiocyanate-phenol-chloroform mixture and pelleted, followed by two washes of the RNA pellet with 3 M sodium acetate (pH 6.0). The quality of the isolated RNA was analyzed using formaldehyde agarose gel electrophoresis and the Agilent 2100 Bioanalyzer RNA chip (Agilent Technologies, CA). The mRNA was isolated using the PolyATract mRNA isolation system (Promega, WI) and concentrated by precipitation with ethanol.
cDNA library preparation
The preparation of cDNA libraries was conducted as described previously : The cDNA was synthesized using the Smart PCR cDNA synthesis kit according to the manufacturers suggestions (Clontech, CA) using 1 mg mRNA. Double-strand cDNA was prepared from 2 mL of the first-strand reaction by PCR (13 cycles). The cDNA was purified using QIAquick PCR purification spin columns (Qiagen) and was checked for purity and degradation using the Agilent 2100 Bioanalyzer DNA chip.
Normalization of cDNA library
Some cDNA libraries were normalized to decrease the amount of highly abundant transcripts. To this end, 1 μg of double-stranded cDNA was normalized using a commercial kit (Trimmer-kit, Evrogen, Moscow, Russia) that is based on Kamchatka crab duplex-specific nuclease. The normalization efficiency was analyzed by quantitative PCR using primers for Rubisco small subunit (highly redundant) and CP12 (moderately redundant).
Following normalization, the double stranded cDNA was PCR amplified, quality checked with agarose gels and the Agilent's Bioanalyzer DNA chip and 3 μg of normalized cDNA was used for sequencing with a GS 20 or GS FLX sequencer (Roche/454 Life Sciences, CT), respectively.
EST pre-processing and assembly
Sequence reads obtained from 454 software were cleaned from cDNA primer contaminations using crossmatch  and an in house python script clipped the masked contaminations, discarded reads shorter than 50 nts and adjusted the quality files accordingly. The cleaned reads were assembled with 1,198 partial and complete coding sequences from Pisum sativum downloaded from NCBI in a hybrid assembly using MIRA [25, 29]. MIRA parameters used were: de novo, est, accurate, sanger, 454 including polyA/T clipping for 454 ESTs. Besides singletons that MIRA evaluated as high confidence singletons, MIRA discards many singletons during the assembly to a separate debris file. Of these debris singletons all with a significant BlastX hit against the A. thaliana proteome (TAIR9)  were added to subsequent analyses.
A second pass assembly using all the 324,428 unigenes obtained by the previous MIRA assembly was performed with the TGICL clustering and assembly pipeline including CAP3 [31, 32]. Both programs were run with default parameter settings, requiring 94% sequence identity, a minimum of 40 nucleotides overlap and a maximal overhang of 30 nucleotides for assembly. As the largest three clusters that were produced by TGICL could not be assembled with CAP3 due to memory limitations, they were additionally preclustered and assembled with scripts provided within the TGICL pipeline with default parameter settings .
The unigenes were annotated by queries against the proteomes of A. thaliana (TAIR9), M. trunculata (version 3.0), G. max (version1.0) [50, 51, 65] and the non redundant protein database from NCBI  using BlastX (e-value ≤ 10-4). Functional annotation of the unigenes was done with MapMan categories  and gene ontology terms  via the AGI annotation.
Quantification of different sources contributing to redundancy between unigenes
Orthologous cDNA sequences from P. sativum to the most frequent AGIs used for the annotation of the first pass MIRA unigenes (Table 2) where retrieved from NCBI . This yielded four complete and one partial coding sequence: (Lhcb1) light-harvesting chlorophyll-a/b binding protein, (Lhcb2) light-harvesting chlorophyll-a/b binding protein, (RuBP) ribulose 1,5 bisphosphate carboxylase, (Ran1) Ran1 and (Ccbp) chloroplast chlorophyll-a/b binding protein. Five genes published as single copy genes from pea [36–40] and four genes encoding Mendelian traits [41–44] were added to the analysis. All 324,428 unigenes from the first pass MIRA assembly were mapped against the retrieved reference sequences using the BWA-SW Aligner . The column wise information of the alignments was read out employing SAM tools  and custom written python scripts. For each alignment the number of alignment positions (columns) with the following characteristics was read out: (i) identical nucleotides for all unigenes at that position, (ii) coverage by at least 25 identical variants (identical point mutations) within the unigenes, (iii) coverage by at least 4 insertions/deletions of a nucleotide length dividable by three and (iv) coverage by at least one variant (point mutation or insertions/deletions of a length not dividable by three) but not exceeding the required threshold of 25 identical variants.
Mathematical analysis of library completeness
To test the transcript coverage of reads from a specific library/the combination of all libraries mathematical analysis similar to rarefaction analysis was employed. For that purpose a read pool was defined, i.e. all reads obtained from one library or all reads from all libraries combined. From such a read pool reads were randomly drawn to create different sets of reads with increasing sample sizes. For each of those given sets the number of Arabidopsis genes that could be identified by the reads within the set was recorded with the size of the read set. This process was automated in a python script. The identification of the AGI by a read was done via the MIRA unigenes and their best BlastX hit against the Arabidopsis proteome (TAIR9) . The numbers of identified Arabidopsis genes were plotted against the corresponding sample sizes and the data points were fitted by non-linear regression with the model y = ax/(b+x) (SigmaPlot software, Systat Software Inc - Scientific Software Products). If the sequencing of transcripts from a particular tissue was exhaustive, the resulting curve was expected to "saturate"; it converged against a fixed value, parameter "a" in the model function indicating an upper limit for gene detection. This "saturation" was also represented by a decreasing slope at higher sample sizes, which indicated decreasing potential to detect additional Arabidopsis genes when further sampling from the defined read pool. An identical approach was also taken to approximate the transcript coverage with the difference that the number of all identified AGIs was not recorded but the number of AGIs that was identified at least by 5, 10 and 100 reads.
All enrichment tests were performed based on A. thaliana annotation of the unigenes. Enrichment analysis with MapMan categories  was tested with Fisher's exact test, using the PageMan application  and multiple testing was corrected for by FDR (Benjamini-Hochberg). The gene test sets always consisted of all Arabidopsis genes identified by the unigenes present in that library/tissue type. As the reference gene set for the analysis all Arabidopsis genes identified with reads of any library important for a specific question were chosen. These were the following library combinations: union of all genes in all normalized and non-normalized leaf libraries (Figure 4), union of all genes in libraries E and L (Figure 5) and union of all genes from all obtained libraries (Figure 6).
Enrichment of GO terms was tested with Fisher's exact test, using the Bioconductor package topGO version 0.9.7 [58, 59]. The weight01 algorithm of topGO was used accounting for local dependencies within the graph structure of the gene ontology.
Accession numbers of raw data
The sequence read data reported in this manuscript have been deposited in the NCBI Sequence Read Archive and are available under the Accession Number [NCBI:SRA031288]. The initial MIRA assembly reported in this manuscript has been modified according to NCBI guidelines and deposited in the NCBI Transcriptome Shotgun Assembly Archive and is available under the Accession Number [JI896856 - JI981123].
We acknowledge expert support in 454/Roche sequencing by the RTSF at Michigan State University. This project was supported by AFGN-DFG to APMW (WE 2231/4-1) and by the Volkswagen Foundation to SUF.
- Cronk Q, Ojeda I, Pennington RT: Legume comparative genomics: progress in phylogenetics and phylogenomics. Curr Opin Plant Biol. 2006, 9 (2): 99-103. 10.1016/j.pbi.2006.01.011.PubMedView Article
- Kew Royal Botanical Gardens. [http://www.kew.org/cval/]
- Kalo P, Seres A, Taylor SA, Jakab J, Kevei Z, Kereszt A, Endre G, Ellis THN, Kiss GB: Comparative mapping between Medicago sativa and Pisum sativum. Mol Genet Genomics. 2004, 272 (3): 235-246. 10.1007/s00438-004-1055-z.PubMedView Article
- Macas J, Neumann P, Navratilova A: Repetitive DNA in the pea (Pisum sativum L.) genome: comprehensive characterization using 454 sequencing and comparison to soybean and Medicago truncatula. BMC Genomics. 2007, 8:
- Bräutigam A, Hoffmann-Benning S, Weber APM: Comparative Proteomics of Chloroplast Envelopes from C3 and C4 Plants Reveals Specific Adaptations of the Plastid Envelope to C4 Photosynthesis and Candidate Proteins Required for Maintaining C4 Metabolite Fluxes. Plant Physiol. 2008, 148 (1): 568-579. 10.1104/pp.108.121012.PubMed CentralPubMedView Article
- Lloyd JR, Kossmann J, Ritte G: Leaf starch degradation comes out of the shadows. Trends Plant Sci. 2005, 10 (3): 130-137. 10.1016/j.tplants.2005.01.001.PubMedView Article
- Smith AM, Zeeman SC, Smith SM: Starch degradation. Ann Rev Plant Biol. 2005, 56: 73-98. 10.1146/annurev.arplant.56.032604.144257.View Article
- Lu Y, Sharkey TD: The importance of maltose in transitory starch breakdown. Plant Cell Environ. 2006, 29 (3): 353-366. 10.1111/j.1365-3040.2005.01480.x.PubMedView Article
- Pohlmeyer K, Soll J, Grimm R, Hill K, Wagner R: A High-Conductance Solute Channel in the Chloroplastic Outer Envelope from Pea. Plant Cell. 1998, 10 (7): 1207-1216.PubMed CentralPubMedView Article
- Stitt M: Nitrate regulation of metabolism and growth. Curr Opin Plant Biol. 1999, 2 (3): 178-186. 10.1016/S1369-5266(99)80033-8.PubMedView Article
- Tobin AK, Bowsher CG: Nitrogen and carbon metabolism in plastids: Evolution, integration, and coordination with reactions in the cytosol. Advances In Botanical Research. 2005, London: Academic Press Ltd, 42: 113-165.
- The Arabidopsis Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408 (6814): 796-815. 10.1038/35048692.View Article
- National center for biotechnology information. [http://www.ncbi.nlm.nih.gov]
- DFCI Plant Gene Indices. [http://compbio.dfci.harvard.edu/tgi/plant.html]
- Majeran W, Zybailov B, Ytterberg AJ, Dunsmore J, Sun Q, van Wijk KJ: Consequences of C4 differentiation for chloroplast membrane proteomes in maize mesophyll and bundle sheath cells. Mol Cell Proteomics. 2008, 7: 1609-1638. 10.1074/mcp.M800016-MCP200.PubMed CentralPubMedView Article
- Dassanayake M, Haas JS, Bohnert HJ, Cheeseman JM: Shedding light on an extremophile lifestyle through transcriptomics. New Phytol. 2009, 183 (3): 764-775. 10.1111/j.1469-8137.2009.02913.x.PubMedView Article
- Novaes E, Drost DR, Farmerie WG, Pappas GJ, Grattapaglia D, Sederoff RR, Kirst M: High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome. BMC Genomics. 2008, 9: 14-10.1186/1471-2164-9-14.View Article
- Alagna F, D'Agostino N, Torchia L, Servili M, Rao R, Pietrella M, Giuliano G, Chiusano ML, Baldoni L, Perrotta G: Comparative 454 pyrosequencing of transcripts from two olive genotypes during fruit development. BMC Genomics. 2009, 10: 15-10.1186/1471-2164-10-15.View Article
- Barakat A, DiLoreto DS, Zhang Y, Smith C, Baier K, Powell WA, Wheeler N, Sederoff R, Carlson JE: Comparison of the transcriptomes of American chestnut (Castanea dentata) and Chinese chestnut (Castanea mollissima) in response to the chestnut blight infection. BMC Plant Biology. 2009, 9: 11-10.1186/1471-2229-9-11.View Article
- Wang W, Wang YJ, Zhang Q, Qi Y, Guo DJ: Global characterization of Artemisia annua glandular trichome transcriptome using 454 pyrosequencing. BMC Genomics. 2009, 10: 10-10.1186/1471-2164-10-10.View Article
- Metzker ML: APPLICATIONS OF NEXT-GENERATION SEQUENCING Sequencing technologies - the next generation. Nat Rev Gen. 2010, 11 (1): 31-46. 10.1038/nrg2626.View Article
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Gen. 2009, 10 (1): 57-63. 10.1038/nrg2484.View Article
- 454 Life Sciences - A Roche company. [http://www.454.com/]
- Bräutigam A, Shrestha RP, Whitten D, Wilkerson CG, Carr KM, Froehlich JE, Weber APM: Comparison of the use of a species-specific database generated by pyrosequencing with databases from related species for proteome analysis of pea chloroplast envelopes. Journal of Biotechnology. 2008, 136 (1): 44-53. 10.1016/j.jbiotec.2008.02.007.PubMedView Article
- Website of Chevreux. [http://www.chevreux.org/projects_mira.html]
- Zerbino DR, Birney E: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.PubMed CentralPubMedView Article
- Bräutigam A, Mullick T, Schliesky S, Weber APM: Critical assessment of assembly strategies for non-model species mRNA-Seq data and application of next-generation sequencing to the comparison of C3 and C4 species. J Exp Biol. 2011
- Phred, Phrap and Consed. [http://www.phrap.org/phredphrapconsed.html]
- Chevreux C: PhD Thesis: MIRA: An Automated Genome and EST Assembler. 2006
- Emrich SJ, Barbazuk WB, Li L, Schnable PS: Gene discovery and annotation using LCM-454 transcriptome sequencing. Genome Res. 2007, 17 (1): 69-73.PubMed CentralPubMedView Article
- Huang XQ, Madan A: CAP3: A DNA sequence assembly program. Genome Res. 1999, 9 (9): 868-877. 10.1101/gr.9.9.868.PubMed CentralPubMedView Article
- Pertea G, Huang X, Liang F, Antonescu V, Sultana R, Karamycheva S, Lee YH, White J, Cheung F, Parvizi B, et al: TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics. 2003, 19 (5): 651-652. 10.1093/bioinformatics/btg034.PubMedView Article
- SOAP::Short Oligonucleotide Assembly Package. [http://soap.genomics.org.cn/soapdenovo.html]
- Jing R, Johnson R, Seres A, Kiss G, Ambrose MJ, Knox MR, Ellis THN, Flavell AJ: Gene-based sequence diversity analysis of field pea (Pisum). Genetics. 2007, 177 (4): 2263-2275. 10.1534/genetics.107.081323.PubMed CentralPubMedView Article
- Blanc G, Wolfe KH: Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell. 2004, 16 (7): 1667-1678. 10.1105/tpc.021345.PubMed CentralPubMedView Article
- Davidson SE, Smith JJ, Helliwell CA, Poole AT, Reid JB: The pea gene LH encodes ent-kaurene oxidase. Plant Physiol. 2004, 134 (3): 1123-1134. 10.1104/pp.103.032706.PubMed CentralPubMedView Article
- Gupta R, Webster CI, Gray JC: The single-copy gene encoding high-mobility-group protein HMG-I/Y from pea contains a single intron and is expressed in all organs. Plant Mol Biol. 1997, 35 (6): 987-992. 10.1023/A:1005890012230.PubMedView Article
- Last DI, Gray JC: Plastocyanin Is Encoded By A Single-Copy Gene In The Pea Haploid Genome. Plant Mol Biol. 1989, 12 (6): 655-666. 10.1007/BF00044156.PubMedView Article
- Elliott RC, Pedersen TJ, Fristensky B, White MJ, Dickey LF, Thompson WF: Characterization Of A Single Copy Gene Encoding Ferredoxin-I From Pea. Plant Cell. 1989, 1 (7): 681-690.PubMed CentralPubMedView Article
- Mittler R, Zilinskas BA: Molecular-Cloning And Characterization Of A Gene Encoding Pea Cytosolic Ascorbate Peroxidase. J Biol Chem. 1992, 267 (30): 21802-21807.PubMed
- Burton RA, Bewley JD, Smith AM, Bhattacharyya MK, Tatge H, Ring S, Bull V, Hamilton WDO, Martin C: Starch Branching Enzymes Belonging To Distinct Enzyme Families Are Differentially Expressed During Pea Embryo Development. Plant J. 1995, 7 (1): 3-15. 10.1046/j.1365-313X.1995.07010003.x.PubMedView Article
- Martin DN, Proebsting WM, Hedden P: Mendel's dwarfing gene: cDNAs from the Le alleles and function of the expressed proteins. Proc Natl Acad Sci USA. 1997, 94 (16): 8907-8911. 10.1073/pnas.94.16.8907.PubMed CentralPubMedView Article
- Lester DR, Ross JJ, Davies PJ, Reid JB: Mendel's stem length gene (Le) encodes a gibberellin 3 beta-hydroxylase. Plant Cell. 1997, 9 (8): 1435-1443.PubMed CentralPubMed
- Hellens RP, Moreau C, Lin-Wang K, Schwinn KE, Thomson SJ, Fiers M, Frew TJ, Murray SR, Hofer JMI, Jacobs JME: Identification of Mendel's White Flower Character. PLoS One. 2010, 5 (10):
- SCRI living technology: tablet. [http://bioinf.scri.ac.uk/tablet/]
- Bräutigam A, Mullick T, Schliesky S, Weber APM: Critical assessment of assembly strategies for non-model species mRNA-Seq data and application of next-generation sequencing to the comparison of C3 and C4 species. J Exp Bot. 2011
- Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-Hernandez M, Foerster H, Li D, Meyer T, Muller R, Ploetz L, et al: The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucl Acids Res. 2008, 36: D1009-D1014.PubMed CentralPubMedView Article
- Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M, Scholkopf B, Weigel D, Lohmann JU: A gene expression map of Arabidopsis thaliana development. Nature Genet. 2005, 37 (5): 501-506. 10.1038/ng1543.PubMedView Article
- Wang HC, Moore MJ, Soltis PS, Bell CD, Brockington SF, Alexandre R, Davis CC, Latvis M, Manchester SR, Soltis DE: Rosid radiation and the rapid rise of angiosperm-dominated forests. Proc Natl Acad Sci USA. 2009, 106 (10): 3853-3858. 10.1073/pnas.0813376106.PubMed CentralPubMedView Article
- Medicago truncatula. [http://medicago.org]
- Glycine max. [ftp://ftp.jgi-psf.org/pub/JGI_data/Glycine_max]
- Wall PK, Leebens-Mack J, Chanderbali AS, Barakat A, Wolcott E, Liang HY, Landherr L, Tomsho LP, Hu Y, Carlson JE, et al: Comparison of next generation sequencing technologies for transcriptome characterization. BMC Genomics. 2009, 10:
- Usadel B, Nagel A, Steinhauser D, Gibon Y, Blasing OE, Redestig H, Sreenivasulu N, Krall L, Hannah MA, Poree F, et al: PageMan: An interactive ontology tool to generate, display, and annotate overview graphs for profiling experiments. BMC Bioinformatics. 2006, 7: 8-10.1186/1471-2105-7-8.View Article
- Thimm O, Blasing O, Gibon Y, Nagel A, Meyer S, Kruger P, Selbig J, Muller LA, Rhee SY, Stitt M: MAPMAN: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. Plant J. 2004, 37 (6): 914-939. 10.1111/j.1365-313X.2004.02016.x.PubMedView Article
- Czechowski T, Bari RP, Stitt M, Scheible WR, Udvardi MK: Real-time RT-PCR profiling of over 1400 Arabidopsis transcription factors: unprecedented sensitivity reveals novel root- and shoot-specific genes. Plant J. 2004, 38 (2): 366-379. 10.1111/j.1365-313X.2004.02051.x.PubMedView Article
- Lister R, Gregory BD, Ecker JR: Next is now: new technologies for sequencing of genomes, transcriptomes, and beyond. Curr Opin Plant Biol. 2009, 12 (2): 107-118. 10.1016/j.pbi.2008.11.004.PubMed CentralPubMedView Article
- Palmieri N, Schlotterer C: Mapping Accuracy of Short Reads from Massively Parallel Sequencing and the Implications for Quantitative Expression Profiling. PLoS ONE. 2009, 4 (7): 10-PubMed CentralView Article
- Alexa A, Rahnenfuhrer J, Lengauer T: Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics. 2006, 22 (13): 1600-1607. 10.1093/bioinformatics/btl140.PubMedView Article
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene Ontology: tool for the unification of biology. Nature Genet. 2000, 25 (1): 25-29. 10.1038/75556.PubMed CentralPubMedView Article
- von Arnim A, Deng XW: Light control of seedling development. Annu Rev Plant Physiol Plant Mol Biol. 1996, 47: 215-243. 10.1146/annurev.arplant.47.1.215.PubMedView Article
- Ma LG, Li JM, Qu LJ, Hager J, Chen ZL, Zhao HY, Deng XW: Light control of Arabidopsis development entails coordinated regulation of genome expression and cellular pathways. Plant Cell. 2001, 13 (12): 2589-2607.PubMed CentralPubMedView Article
- Muntz K: Proteases and proteolytic cleavage of storage proteins in developing and germinating dicotyledonous seeds. J Exp Bot. 1996, 47 (298): 605-622.View Article
- Bräutigam A, Kajala K, Wullenweber J, Sommer M, Gagneul D, Weber KL, Carr KM, Gowik U, Mass J, Lercher MJ, et al: An mRNA blueprint for C4 photosynthesis derived from comparative transcriptomics of closely related C3 and C4 species. Plant Physiol. 2011, 155: 142-156. 10.1104/pp.110.159442.PubMed CentralPubMedView Article
- Weber APM, Weber KL, Carr K, Wilkerson C, Ohlrogge JB: Sampling the Arabidopsis Transcriptome with Massively Parallel Pyrosequencing. Plant Physiol. 2007, 144 (1): 32-42. 10.1104/pp.107.096677.PubMed CentralPubMedView Article
- Arabidopsis thaliana. [http://www.arabidopsis.org]
- Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010, 26 (5): 589-595. 10.1093/bioinformatics/btp698.PubMed CentralPubMedView Article
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralPubMedView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.