Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology
© Cheung et al. 2006
Received: 08 August 2006
Accepted: 24 October 2006
Published: 24 October 2006
Skip to main content
© Cheung et al. 2006
Received: 08 August 2006
Accepted: 24 October 2006
Published: 24 October 2006
In this study, we addressed whether a single 454 Life Science GS20 sequencing run provides new gene discovery from a normalized cDNA library, and whether the short reads produced via this technology are of value in gene structure annotation.
A single 454 GS20 sequencing run on adapter-ligated cDNA, from a normalized cDNA library, generated 292,465 reads that were reduced to 252,384 reads with an average read length of 92 nucleotides after cleaning. After clustering and assembly, a total of 184,599 unique sequences were generated containing over 400 SSRs. The 454 sequences generated hits to more genes than a comparable amount of sequence from MtGI. Although short, the 454 reads are of sufficient length to map to a unique genome location as effectively as longer ESTs produced by conventional sequencing. Functional interpretation of the sequences was carried out by Gene Ontology assignments from matches to Arabidopsis and was shown to cover a broad range of GO categories. 53,796 assemblies and singletons (29%) had no match in the existing MtGI. Within the previously unobserved Medicago transcripts, thousands had matches in a comprehensive protein database and one or more of the TIGR Plant Gene Indices. Approximately 20% of these novel sequences could be found in the Medicago genome sequence. A total of 70,026 reads generated by the 454 technology were mapped to 785 Medicago finished BACs using PASA and over 1,000 gene models required modification. In parallel to 454 sequencing, 4,445 5'-prime reads were generated by conventional sequencing using the same library and from the assembled sequences it was shown to contain about 52% full length cDNAs encoding proteins from 50 to over 500 amino acids in length.
Due to the large number of reads afforded by the 454 DNA sequencing technology, it is effective in revealing the expression of transcripts from a broad range of GO categories and contains many rare transcripts in normalized cDNA libraries, although only a limited portion of their sequence is uncovered. As with longer ESTs, 454 reads can be mapped uniquely onto genomic sequence to provide support for, and modifications of, gene predictions.
In genome projects such as the Medicago Sequencing Initiative , Expressed Sequenced Tags are of great value for genome annotation because they provide evidence of expression of predicted genes, and by spliced alignment to genomic DNA, they can provide support for gene structures. In instances where genome sequence is not available, EST sequencing provides a first catalog of a species' gene inventory. Combinations of library normalization and deep sequencing are used to maximize gene discovery. Here, we examine whether the deep sequencing made cost-effective by pyrosequencing technology leads to significant new gene discovery in a normalized cDNA library, and whether the short reads produced are of value in gene structure annotation and SSR identification.
454 Life Sciences Corporation, Branford, CT, has developed the first DNA pyrosequencing platform to employ picoliter volumes in a highly multiplexed, flow-through array capable of producing 20–40 million bases per run. Sequencing is performed on randomly fragmented DNA using microbead-based pyrosequencing chemistry. This technology enables sequence data generation for large genome organisms that was previously inaccessible with conventional sequencing platforms because of prohibitive cost and throughput limitations . The usefulness of 454-derived sequences was assessed with the specific goal of identifying new genes and improving gene predictions.
ESTs, are single-pass and partial sequences from cDNA clones that provide a rapid and cost-effective method to analyze transcribed portions of the genome while avoiding the non coding and repetitive DNA that can make up much of the genome of some crop plants. EST sequencing has been shown to accelerate gene discovery including gene family identification , large-scale expression analysis [4, 5], establishing phylogenetic relationships , developing PCR based molecular markers , and identifying simple sequence repeats  and single nucleotide polymorphisms [9, 10]. In both of the finished plant genomes of Arabidopsis thaliana and Oryza sativa, expressed sequences (ESTs and cDNAs) have been invaluable [11, 12] in defining the correct components of gene structure, with spliced alignments of transcript sequences resolving partial or complete exons, splice sites, and in the case of full-length cDNAs, complete gene structures. EST assemblies generated using PASA (Program to Assemble Spliced Alignments) were shown to allow the automated modeling of novel genes and more than 1,000 alternative splicing variations, as well as updates (including UTR annotations) to nearly half of the ~27,000 annotated protein coding genes in the Arabidopsis genome . Since experimental biologists are more interested in the reliability of individual predictions than in the average performance of gene predictions, it is important to have an extensive EST collection to guarantee the quality of individual gene models. Although large numbers of ESTs have been generated for many species including barley, rice, maize, sorghum, soybean, Medicago and wheat, characterizations of the transcriptomes of these species are likely far from complete.
Here, we examine whether the deep sequencing made cost-effective by the 454 technology leads to significant new gene discovery in a cDNA library by comparisons with The Institute for Genomic Research (TIGR) Medicago truncatula Gene Index (MtGI) containing 226,923 high-quality ESTs . We also investigate whether the short reads produced are of value in gene structure annotation by comparisons to high quality automated gene prediction generated by the International Medicago Genome Annotation Group (IMGAG) .
The overall objective of this study was to determine if the 454 technology leads to significant new gene/transcript discovery in a cDNA library, and whether the short reads produced by this technology are of value in gene structure annotation.
Library normalization, 454 derived cDNA sequences, assembly, functional annotation, SSRs and new transcripts/genes discovery will be discussed first, followed by an analysis of genome mapping to determine their ability to validate and update gene structures.
Size distribution of the full length Medicago ESTs from convention sequencing using SMART technology
Protein length (amino acids)
No of unique sequences
A single 454 run on this sample generated 292,465 reads with an average length of 99 nucleotides and a total length of 29 Mb. In the previous section, conventional sequencing of the plasmid-based library was discussed with reference to the percentage of full length cDNAs. The remainder of the analysis in this manuscript will focus on 454 sequencing of the normalized (but un-cloned) cDNA population. Conventional sequencing on the plasmid library was performed using a plasmid-located primer close to the 5' end of the cloned cDNA and thus represents the 5' ends of the cDNA population generated by reverse transcription and second strand cDNA synthesis. By contrast, the 454 library preparation involves random shearing of the normalized but un-cloned cDNA population, fragment end polishing, adaptor ligation, library immobilization, fill-in reaction, single stranded DNA library isolation and pyrosequencing . The 454 reads therefore originate from random locations within each cDNA and may have either orientation. However because the original cDNA preparation involved the use of directional adapters that were subsequently used for cloning, we can also recognize the 454 reads that originate from the 5' and 3' ends of the cDNA population by the presence of different adapter sequences and, in the case of the 3' end, also by polyA tracts (both adapter and polyA tracts were removed from the 454 reads before assembly). Searching for the directional adapters resulted in 41,877 reads containing the 5'end adaptor while 50,594 reads either contained the 3'end adaptor or a poly A/T tail. In addition the complete set of reads had matches to over 50% of the Arabidopsis proteome. The presence of adapter sequences, poly A tracts and hits to the Arabidopsis proteome indicates that the 454 sequences represents both good coverage both of the ends (presumably UTRs) and within the protein open reading frames.
Sequence length distribution before and after assembly of sequence reads from a single 454 run of a normalized cDNA library.
Size distribution of 454 assemblies
Number of reads per assembly
Mapping 454 derived ESTs against the top five most abundant TCs in the MtGI
% ests in MtGI
% 454 derived ests
RuBisCO small subunit
Chlorophyll a/b binding protein
Many plant genomes have a high proportion of repetitive sequences, and many multi-gene families. Short reads from recent duplications might not be distinguishable. In order to address this, 454 derived sequences were compared to the MtGI with regard to their ability to map to a single location on the genome. Using a threshold of 95% identity plus 95% coverage, 70% of the 454 unique sequences could be mapped to a unique location. Similarly, 70% of the MtGI sequences could be mapped to a unique location using the same threshold of 95% identity plus 95% coverage. This demonstrates that although short, 454 reads can be mapped to the Medicago Genome, with the same confidence as longer ESTs.
Sequence length distribution before and after assembly of 23 Mb of randomly selected ESTs.
Size distribution of assemblies produced from 23 Mb of conventional ESTs
Number of reads per assembly
A total of 401 unique 454 cDNA sequences contained a SSR. Among the SSRs, 143 are trinucleotides, followed by dinucleotides (132), mononucleotides (56) tetranucleotides (47), pentanucleotides (23), and hexanucleotides (8). AG/CT (127) is the most frequent repeat motif, followed by AAG/CTT (86). Among the ESTs without a MtGI hit; 104 sequences contained a SSR. Among these SSRs, 41 are dinucleotides, followed by trinucleotides (30), mononucleotides (10) tetranucleotides (10) and pentanucleotides (10). Randomly selected ESTs from the previous section generated 121 more sequences containing SSRs, three times more sequences with more than one SSR (25) and twice as many SSRs in compound formation. (19) probably due to the longer read and contig lengths.
Thus 454 sequencing has revealed many transcripts not previously detected in Medicago, some of which have matches in protein or EST databases. This supports the idea that deeper EST sequencing using 454 technology will identify a larger number of expressed sequences than conventional EST sequencing and is effective in revealing the expression of many rare transcripts.
Of the 53,796 sequences not found in MtGI, 13,260 and 9,362 could be mapped using blat with a threshold of 90% identity and 90% coverage and 95% identity and 95% coverage, respectively, to all available M. truncatula genome sequence in Genbank (1 April 2006). These transcripts had low levels of repeats sequences  and are unlikely to represent sequencing errors, since alignments on the genome using lower thresholds, 60% identity and 60% coverage generated only a 10% increase in matches. At this point, it is estimated that ~50–60% of the Medicago euchromatic gene space has been sequenced. Thus the 454 sequences without matches to genomic DNA may be derived from transcripts from the as yet unsequenced euchromatin, from expressed genes residing in the 200+ Mb of heterochromatin or from contaminating nucleic acid e.g. plant bacteria or fungi, although there is no evidence for this last possibility. Thus it is likely that a large fraction of the novel sequences will be useful for gene structure or expression validation in the remainder of the Medicago euchromatin or as an indication of the presence of expressed genes in the heterochromatin.
Statistics of PASA alignments of M. truncatula 454 cDNA reads on finished Medicago BACs
Total 454 reads (mapped to genome using blat)
Valid Blat alignments
Valid Sim4 alignments
Total Valid alignments
Number of assemblies
Gene structure updates generated by PASA alignments of 454 cDNA reads on finished M. truncatula BACs
# Gene Updates
EST assembly extends UTRs
EST assembly alters protein sequence, passes validation
EST assembly found capable of merging multiple genes
EST assembly stitched into gene model requires alternative splicing isoform
To determine whether the novel transcripts which lack a MtGI match were useful in Medicago gene structure annotation, the 53,796 sequences consisting of assemblies plus singletons that had no match in the existing MtGI were mapped to 785 M. truncatula Phase 3 BACs using PASA. From the 10,360 sequences that aligned, 3,221 PASA assemblies were generated and incorporated into 2,186 existing IMGAG gene models. An additional 439 assemblies extend the UTRs of 429 genes and 127 assemblies altered the protein sequence of 127 gene prediction. Five gene models were capable of merging with five other gene models and eight gene model isoforms could be created. A total of 553 of these PASA assemblies conformed to the [GT, GC]/AG consensus donor/acceptor splice sites.
Thus, as with longer ESTs, 454-generated sequences derived from both known and novel transcripts in a cDNA library can be mapped onto genomic sequence and provide valid spliced alignments to provide support for and modifications of gene predictions providing gene structure updates and defining exon-intron boundaries.
In this study, two major ideas were examined: whether the deep sequencing made cost-effective by the 454 technology leads to significant new gene discovery in a cDNA library, and whether the short reads produced by the 454 technology are of value in gene structure annotation. Approximately 30% of the reads produced by a single 454 run were not found in the Medicago Gene Index derived from over 220,000 ESTs from more than 30 libraries, illustrating the power of the deep sequencing facilitated by this technology to generate more gene hits, and reveal rare and novel transcripts, albeit only a small portion of each sequence. Although the read lengths are short, 70 % of the reads were of sufficient length to map to unique locations on the Medicago genome as with ESTs from the MtGI via conventional sequencing. Functional annotation shows that the 454 sequences cover a broad range of GO categories. In addition, 454 reads can be mapped onto genomic sequence to provide support for and modifications of gene predictions. We expect that a similar analysis using other plant species would work synergistically with existing EST data and identify new genes/transcripts and/or support a significant number of existing gene models at a very cost effective and efficient manner.
The normalized cDNA population and cDNA plasmid library were constructed employing the Smart cloning methodology [17, 18] using the services of Evrogen  and Sfi IA/B primers/adapters that permit directional cloning. Reverse transcription was carried out on a pool of RNA from three Medicago truncatula tissues (flowers, stems early and late seed) The primer annealing mixture (5 μl) containing 0.3 μg of total RNA; 10 pmol SMART-Sfi IA oligonucleotide (5'-AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCrGrGrG-3') and 10 pmol CDS -Sfi IB primer (5'-AAGCAGTGGTATCAACGCAGAGTGGCCGAGGCGGCCd(T)20–3') was heated at 72°C for 2 min and cooled on ice for 2 min. First-strand cDNA synthesis was then initiated by the addition of PowerScript Reverse Transcriptase (BD Biosciences Clontech) in a final volume of 10 μl, containing 1X First-Strand Buffer (50 mM Tris-HCl (pH 8.3); 75 mM KCl; 6 mM MgCl2); 2 mM DTT;1 mM of each dNTP, incubated at 42°C for 1.5 hr and then cooled on ice. The first-strand cDNA was diluted 5 times with TE buffer, heated at 72°C for 7 min and used for amplification by Long-Distance PCR in a 50 μl reaction containing 1 μl diluted first-strand cDNA, 1 × Advantage 2 reaction buffer (BD Biosciences Clontech), 200 μM dNTPs, 0.3 μM SMART PCR primer (5'-AAGCAGTGGTATCAACGCAGAGT-3') and 1 × Advantage 2 Polymerize mix (BD Biosciences Clontech). 18 PCR cycles were performed using the following parameters: 95°C for 7 s; 65°C for 20 s; 72°C for 3 min. Amplified cDNA PCR product was purified using QIAquick PCR Purification Kit (QIAGEN, CA), concentrated by ethanol precipitation and adjusted to a final concentration of 50 ng/μl. For cDNA normalization, 3 μl (about 150 ng) purified ds cDNA plus 1 μl 4× Hybridization Buffer (200 mM HEPES-HCl, pH 8.0; 2 M NaCl) was overlaid with one drop of mineral oil, denatured 95°C for 5 min and then allowed to anneal at 68°C for 4 h. The following preheated reagents were added to the hybridization reaction at 68°C: 3.5 ul milliQ water;1 μl of 5× DNAse buffer (500 mM Tris-HCl, pH 8.0; 50 mM MgCl2, 10 mM DTT);0.5 μl double-strand nuclease (DSN) enzyme. After incubation at 65°C for 30 min., the DSN enzyme was inactivated by heating at 95°C for 7 min. The normalized cDNAs samples were diluted by adding 30 μl milliQ water and used for PCR amplification. The PCR reaction (50 μl) contained 1 μl diluted cDNA; 1 × Advantage 2 reaction buffer (BD Biosciences Clontech); 200 μM dNTPs; 0.3 μM SMART PCR primer; 1 × Advantage 2 Polymerize mix (BD Biosciences Clontech) and was amplified for 18 cycles of 95°C for 7 s; 65°C for 20 s; 72°C for 3 min. One part of the amplified, normalized adapter-ligated cDNA population was digested with SfiI and directionally cloned into Clontech's pDNR vector at the SfiA/B sites.
For 454 sequencing, approximately, 3 μg of the final normalized, adaptor-ligated cDNA population was sheared via nebulization into small fragments a few hundred base pairs in length. The fragment ends were made blunt and short adaptors which provide the priming sequences for both amplification and sequencing of the sample library fragments were ligated onto both ends. These adaptors also provide a sequencing key (a short sequence of four nucleotides) which was used by the system software to recognize legitimate library reads. Next, the library was immobilized onto streptavidin beads, facilitated by a 5' biotin tag on Adaptor B, and any nicks in the double-stranded library are repaired. Finally, the unbound strand of each fragment (with 5'-Adaptor A) was released, and the recovered single-stranded DNA library's quality is assessed. Sequences are available for download .
Perfect dinucleotide to hexanucleotide simple sequence repeats were identified using the MISA  Perl scripts, specifying a minimum of six dinucleotide and five tetranucleotide to hexanucleotide repeats and a maximum of 100-nucleotides interruption for compound repeats.
The authors wish to thank Bill Moskal for assistance in the plasmid library construction and Eli Venter for preparation of the sequence files. Funds for 454 sequencing were provided by the J. Craig Venter Institute (JCVI).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.