PAVE: Program for assembling and viewing ESTs
© Soderlund et al; licensee BioMed Central Ltd. 2009
Received: 26 February 2009
Accepted: 26 August 2009
Published: 26 August 2009
New sequencing technologies are rapidly emerging. Many laboratories are simultaneously working with the traditional Sanger ESTs and experimenting with ESTs generated by the 454 Life Science sequencers. Though Sanger ESTs have been used to generate contigs for many years, no program takes full advantage of the 5' and 3' mate-pair information, hence, many tentative transcripts are assembled into two separate contigs. The new 454 technology has the benefit of high-throughput expression profiling, but introduces time and space problems for assembling large contigs.
The PAVE (Program for Assembling and Viewing ESTs) assembler takes advantage of the 5' and 3' mate-pair information by requiring that the mate-pairs be assembled into the same contig and joined by n's if the two sub-contigs do not overlap. It handles the depth of 454 data sets by "burying" similar ESTs during assembly, which retains the expression level information while circumventing time and space problems. PAVE uses MegaBLAST for the clustering step and CAP3 for assembly, however it assembles incrementally to enforce the mate-pair constraint, bury ESTs, and reduce incorrect joins and splits. The PAVE data management system uses a MySQL database to store multiple libraries of ESTs along with their metadata; the management system allows multiple assemblies with variations on libraries and parameters. Analysis routines provide standard annotation for the contigs including a measure of differentially expressed genes across the libraries. A Java viewer program is provided for display and analysis of the results. Our results clearly show the benefit of using the PAVE assembler to explicitly use mate-pair information and bury ESTs for large contigs.
The PAVE assembler provides a software package for assembling Sanger and/or 454 ESTs. The assembly software, data management software, Java viewer and user's guide are freely available.
ESTs added to GenBank during specified years based on length
Besides the 454 sequencer, the following are also next-generation sequencers that generate high-throughput short reads: Illumina Genome Analyzer  developed by Solexa (Cambridge, UN), Applied Biosystems SOLiD Sequencing [6, 7], and Helicos GSS Sequencing . The 454 sequencer comes with the Newbler assembler, and there are multiple assembly packages tested for the Illumina system [9–15]. However, these are all tested on small genomes, chromosomes or BACs, which will have much shallower coverage compared to EST contigs.
Many current EST projects have generated 454 data and either used traditional EST assembly approaches (e.g. ) or aligned the ESTs to a related genome or assembled transcripts (e.g. ; see Additional file 1: List of EST papers, section 4.B). Laboratories are now transitioning between the traditional Sanger ESTs and new 454 ESTs. For example, our laboratory has a full-length cDNA project using Sanger 5' and 3' ESTs, and three other projects that have a mix of 454 and Sanger ESTs. For our Sanger projects, we developed a software package called PAVE (Program for Assembling and Viewing ESTs) that utilizes mate-pair information. With the release of the 454 sequencer, we extended PAVE to work for the increased depth of the 454 EST data sets.
The ESTs generated by Sanger versus the 454 sequencer differ in number and length. The current Sanger ESTs have an average length of around 650 good bases, but the number of ESTs that are sequenced is generally low. The cDNA sample must first be cloned into a vector (typically either plasmid- or phage-based) to produce a cDNA library and then individual clones are isolated from the library and sequenced, which results in a few thousand clones being sequenced. For example, most maize libraries in GenBank have between 1000 and 10,000 ESTs. The 454 GS FLX sequencer can generate over 200,000 good ESTs per project, but at an average length of only 250 trimmed bases. The new 454 GS FLX Titanium is capable of generating over a million reads of 400 bases with reduced error . This technology currently does not produce identifiable cDNA mate-pairs.
Sanger sequencing can produce mate-pairs where it is known which ESTs are mates based on their name. If the clone is full length, the 5' end will start at the beginning of the transcript, otherwise it can start anywhere within the original mRNA sequence. It is now relatively inexpensive to generate both the 3' and 5' reads of a clone, as the library only needs to be prepped once. To date no software exists that takes full advantage of mate-pair information in order to produce better contigs. CAP3  uses mate-pair information to build contigs but does not require that they be in the same contig. Phrap  uses mate-pair information to flag potential chimeric clones by inserting the chimeric mate into a contig. Clustering programs, such as STACK  and PaCE , will use mate-pair information to join clusters, but these may be broken into multiple contigs when assembled by CAP3 or Phrap. By contrast, PAVE requires mate-pairs to be in the same contig. It does not allow mate-pairs to be split across contigs, and if none of the ESTs in the 5' and 3' sub-contigs overlap, they are joined by n's. PAVE has been used to assemble multiple projects including 797,619 maize ESTs from GenBank .
The advent of 454 sequences presents new challenges to assembly. The increased depth of the 454 data sets can cause CAP3 and other assembly programs to run out of memory. Moreover, assembling large contigs (e.g. > 1000 ESTs) is time-consuming. To address both problems, ESTs contained in another may be removed, such as performed by PlantGDB [23, 24]. PAVE removes ("buries") many of the ESTs that are contained in another ("parent" EST) during assembly; after assembly, the buried ESTs are placed in their parents' respective contigs in order to retain the expression level.
There are quite a few packages for the pipeline processing that EST data requires (see Additional file 1: List of EST papers, section 2). For example, EST2uni  is a pipeline that trims and cleans ESTS using external programs such as Lucy , assembles the ESTs with CAP3 or TGICL , and has annotation capabilities. We find that ESTs generated from different technologies and laboratories require different trimming and cleaning processes, so these functions are not part of the PAVE software package. However, PAVE includes a data management system in order to allow assembling many libraries together while retaining the information about each library. The PAVE system contains a Java program called jPAVE that allows easy verification and display of assemblies in the PAVE MySQL database. The system supports annotation by UniProt  match, GC content, ORFs, R statistic  and comparison of contigs. The assembly software, data management software, and jPAVE viewer are freely available along with a user's guide .
The PAVE system is written in Perl, the jPAVE assembly viewer is written in Java, and the data is stored in a MySQL database. The inputs to PAVE are standard FASTA formatted sequence and optional quality files, which have been trimmed and cleaned.
Guiding the system are the OK_OLAP and OK_CTG functions, which contain rules for acceptance or rejection of MegaBLAST and CAP3 output, respectively. OK_OLAP(s x , s y , m i , d i , h i ) returns true if the sequences s x and s y have a minimum overlap of m, a minimum identity of d%, and a maximum overhang of h. The small amount of overhang allowed by h is necessary on the end of the overlapping portion to allow for poorly trimmed sequence. The parameters m i , d i and h i are set by the user in a configuration file (there are three or more sets as explained below). The OK_CTG(rule n ) function analyzes the results of CAP3 to determine whether to accept or reject the contig. There are three sets of rules that have been designed by us based on inspection of contigs using the jPAVE graphical interface, and will be discussed in the section on heuristics.
The clique step creates seed contigs from the file of EST-against-EST comparison results, as follows: (i) Compute mate-pair cliques where each mate-pair must pass OK_OLAP(s x , s y , m1, d1, h1) with every other mate-pair in the clique. Each clique is assembled and if the CAP3 results pass OK_CTG(rule1), it is retained as a seed contig. (ii) Compute singleton cliques where each singleton must pass OK_OLAP(s x , s y , m1, d1, h1) with every other singleton in the clique. Each clique is assembled and if the CAP3 results pass the OK_CTG(rule2) function, it is retained as a seed contig. (iii) Each remaining mate-pair is made into a contig, where the mates are joined if they pass OK_OLAP(s x , s y , m2, d2, h2). (iv) Each remaining singleton is made into a contig. The purpose of assembling cliques is that the existence of a set of mate-pairs (or singletons) that all mutually overlap creates a strong contig and it reduces the time of the next step. This step produces the initial contig consensus sequences (CCS).
The TC step computes the set of TCs from the file of CCS-against-CCS comparison results. Each TC is a connected graph where each node is a contig, an edge exists between each two contigs that pass OK_OLAP(s x , s y , m i , d i , h i ), and the edge is weighted by the bit-score. The nodes of a given TC are incrementally assembled together where the order is based on bit score and the number of ESTs. If a pair assembles correctly according to OK_CTG(rule3), the two nodes are collapsed into one. The TCs are assembled in parallel, i.e. each TC is assigned to a separate compute node for incremental assembly. The parallelism is implemented with the standard Unix fork function to take advantage of multiple processors.
The TC step is executed n times using the values (m i , d i , h i ), i = 3 to n+3, where the n corresponding parameters are provided in the configuration file. For Sanger ESTs, we typically use two iterations, which merges over-split contigs; this approach of using a second assembly has been used by others, e.g. PlantGDB . For 454 reads, we use up to eight iterations in order to allow for incremental burying, i.e. to prevent very large contigs from being assembled.
Two sub-contigs are not joined by n's until after assembly, at which point the orientation of the two sub-contigs needs to be computed. This is computed by a majority rule on the orientation and forward/reverse information of the ESTs.
A chimeric clone may have one half of the mate-pair belong in one contig and the other half in another contig. By forcing mate-pairs to be in contigs together, a chimeric clone will generally be in a contig of its own. The advantage is that it reduces chimeric multi-clone contigs resulting from chimeric clones. The disadvantage is that it may cause redundant contigs, as one half of the single mate-pair contig may align perfectly to another contig.
To substantially decrease usage of computer time and space, ESTs are buried both before and during the assembly process, i.e. if EST B is contained in EST A, it can be "buried" in the "parent" EST A and removed from assembly. ESTs may be buried prior to assembly when one EST is contained in another based on the BLAST coordinates; the amount of mismatches allowed is a user-supplied parameter. During assembly, additional ESTs are buried if one is contained within another based on CAP3 coordinates, even though there may be mismatches between the buried and parent ESTs. For mate-pairs, both mates must be contained in the same parent clone in order to be buried. Buried ESTs are recorded in a file and when the PAVE algorithm completes and the contigs are loaded into the database, the buried clones are assigned the same contig as the parent.
The algorithm for burying ESTs (both initially and during assembly) ensures that the remaining unburied ESTs are sufficient to provide a good coverage of the contig. Briefly, the algorithm builds a containment tree and then performs a breadth-first traversal of the tree, retaining all top-level ESTs plus a number of ESTs from the next-highest levels based on configuration parameters. The breadth-first search serves to distribute the retained ESTs evenly across the contig. After retaining a sufficient number for good coverage, the rest of the ESTs in the tree are buried. For burial during assembly, the unburied ESTs are re-CAP'ed, and if they fail to assemble together, the attempt to bury the clones is rejected. Since this re-CAP takes time, the burial is only attempted if there are at least Y (default 100) ESTs in a contig and at least × (default 25) ESTs can be buried.
A post-processing routine reassembles the contigs that have buried ESTs using all ESTs in the contig, primarily in order to provide accurate coordinates for the buried clones. The re-CAP occasionally fails due to too many ESTs or due to the mismatches from the CAP3 buried ESTs; in these cases, the ESTs are positioned under their parent clone and shown in red.
The PAVE system
As we have multiple PAVE projects, each with many EST libraries, it was advantageous to create an organized data management system as part of PAVE. This is done with a set of configuration files and a rigid directory structure, where there is a directory for libraries and another for assembly projects. By separating the libraries from the assemblies, different projects can be created with overlapping sets of libraries. The package provides four programs: (i) The loadLibrary.pl script loads one or more libraries of ESTs and metadata (i.e. organism, tissue, stage, etc) into the MySQL database. (ii) The runPAVE.pl script assembles the ESTs and enters the results into the database. (iii) The capBuried.pl script runs CAP3 on all the contigs with buried ESTs in order to assign correct coordinates. (iv) The jPAVE program provides some annotation and an interactive display.
After assembling the contigs, the runPAVE.pl script computes the R statistic , which identifies differentially expressed genes across the libraries. It also finds the longest clone (e.g. for microarrays) per contig, and marks contigs that are suspect based on having excessive mixed orientation or more than two stacks of ESTs. The jPAVE program's annotation module can be run to compute GC content, ORFs, SNPs, CCS against UniProt comparison, and CSS against CSS (to help detect paralogs and alternative splicing).
Three sets of results are provided: (1) maize Sanger ESTs with mate-pairs, (2) trichome 454 ESTs, (3) a comparison of PAVE to CAP3 and TGICL with a benchmark set of Sanger mate-pair ESTs. The PAVE parameters used for each assembly are provided in Additional file 2. The Sanger and benchmark assemblies can be viewed from the PAVE site .
Assembly of Sanger ESTs
PAVE assembly results
Contigs joined by n's
Mates joined by n's
ESTs in contigs:
Length of contigs:
This maize Sanger assembly was executed on a system with two dual-core AMD Opteron 2 GHz processors and 12G RAM. Though the execution time is 6.7 days, very little human time is required to run the assembly. That is, the dataset does not have to be broken up, assembled separately and then merged, as would be necessary with programs that cannot handle datasets of this size. For example, the largest data set we can assemble on this machine with TGICL is 228,000 ESTs.
Assembly of 454 ESTs
Our 454 data set is 415,559 ESTs from the stems of Solanum arcanum (Gang et al. manuscript in preparation) from the Solanum Trichome Project . The average length of the cleaned ESTs is 218 bases and the median is 234 bases. The results are shown in Table 2. The largest contig is 19,739 ESTs with a 1283 bases consensus sequence; with the narrow depth, the ability to bury all but 342 of ESTs greatly speeds up assembly and display. In fact, it would not assemble without burying as CAP3 runs out of memory when assembling all 19,739 ESTs.
Comparisons with other programs
We compared PAVE to TGICL and CAP3 for two reasons. First, the majority of EST papers use one of these two programs (see Additional file 1: List of EST papers, section 4). Second, all three assemblers use CAP3 for alignment, so that the comparison specifically addresses the effectiveness of using mate-pair information as a strict constraint, rather than variables in the alignment algorithm. To compare the assemblies, a set of benchmark contigs were created that contained 30,853 mate-pairs (61,706 ESTs) which aligned to 5437 of the KOME rice FL-cDNAs  (see Additional file 3: Benchmark contigs).
Comparison of methods
3 h 25 m
2 h 12 m
As these are well-aligned ESTs, there are few merged contigs. Most of the split contigs for TGICL and CAP3 are due to mate-pairs that do not overlap, that is, 3514 PAVE contigs were joined by n's of which 53 are single mate-pairs. For example, Figure 2 shows two contigs joined by n's, which is a split contig in the CAP3 and TGICL assemblies. These split contigs account for many of the extra contigs in the TGICL and CAP3 assemblies, an excess which often causes the number of tentative genes to be over-estimated.
Since ESTs came into widespread use, the typical approach to assembling them has been to cluster the ESTs and assemble with CAP3 or Phrap. Most projects used CAP3 alone or TGICL for assembly, where TGICL clusters with a version of MegaBLAST and then assembles with CAP3 (see Additional file 1: List of EST papers, section 4). There are quite a few programs written specifically to cluster ESTs, but only miraEST  was specifically written to assemble ESTs (see Additional file 1: List of EST papers, section 1); note, CAP3 and Phrap were written for genome assembly. In comparison, there are many assemblers developed for whole genome sequencing based on Sanger reads, and a number of assemblers have emerged for Illumina reads.
As shown in our results, improved assemblies can be computed if both the 3' and 5' reads are used; this is similar to the improvements in assembly that are obtained when paired reads are used in whole genome assembly , and has also been shown for the Illumina paired short-reads . In the early generation of the 3' and 5' Sanger ESTs, the mate-pair information could not be trusted, but for many years now, that has no longer been the case as laboratory tracking systems have improved. Though the use of Sanger sequencing for ESTs will probably decline, the historical data still has considerable value. As shown in Table 1 there are hundreds of thousands of Sanger ESTs available that should continue to be a valuable resource for years to come, and those with mate-pair information can take advantage of the PAVE assembler.
Though PAVE assembles the 454 ESTs with acceptable results, the new short read assemblers that are based on prefix trees and extensions (e.g. SSAKE ) or de Buijin graphs (e.g. Velvet , EULER-USR ) could provide better results when there is a lot of error or repeats. We are currently working on modifying PAVE so that it can use the results of various assemblies to take advantage of these custom assemblers and integrate the results with the PAVE mate-pair assemblies of Sanger reads.
The jPAVE assembly viewer provides an excellent way to evaluate contigs. We have assembled a large variety of libraries with different characteristics, and the occasional oddly-aligning contigs often cannot be explained nor anticipated. Having a versatile query and display program such as jPAVE is essential to understanding the wide range of problems that can occur with EST assembly. Moreover, it allows the biologist to diagnose the problems that occur with their specific libraries. For example, by assembling ESTs without the heuristics and then viewing them in jPAVE, the biologist can inspect the amount of chimerism and 3' slippage.
The PAVE system is created for large projects with multiple libraries. This allows comparing across libraries for differentially expressed genes, which will become more relevant with the large number of ESTs generated by the next-generation sequencing machines. The PAVE system also has a web interface (see  for examples), which allows extensive querying ability on the libraries (e.g. show the contigs that contain ESTs from one library that appear to be down-regulated in another). The web interface is not part of the distribution but is available upon request (email firstname.lastname@example.org). The software, jPAVE viewer and User's Guide are available from our PAVE website , where the User's Guide comes with a demonstration dataset.
Availability and requirements
◦ Project name: PAVE
◦ Project home page: http://www.agcol.arizona.edu/software/pave
◦ Operating system: Unix/Linux
◦ Programming languages: Perl and Java
◦ Other requirements: Java 1.5 or higher, MySQL
◦ License: GNU Public License
◦ Any restriction to use by non-academics: none.
Expressed Sequence Tags
Bacterial Artificial Chromosome
Knowledge-based Oryza Molecular biological Encyclopedia
Open Reading Frame
Single Nucleotide Polymorphism
Transitive Closure (i.e. connected graph)
We thank James Hatfield and Karl Haller for their original work on PAVE. We thank David Gang, Josh Udall, Guo-liang Wang and Yeisoo Yu for feedback on the contigs and interface; David Gang and William Nelson for their input on the manuscript; and a special thanks to William Nelson for his testing of the final PAVE system and the example of bad joins shown in Figure 4. The work was funded in part by the National Science Foundation [0501857, 0321683].
- Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al: Complementary DNA sequencing: expressed sequence tags and human genome project. Science. 1991, 252 (5013): 1651-1656. 10.1126/science.2047873.View ArticlePubMedGoogle Scholar
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437 (7057): 376-380.PubMed CentralPubMedGoogle Scholar
- Rothberg JM, Leamon JH: The development and impact of 454 sequencing. Nat Biotechnol. 2008, 26 (10): 1117-1124. 10.1038/nbt1485.View ArticlePubMedGoogle Scholar
- 454 Life Sciences, a Roche Company. [http://www.454.com]
- Solexa/Illumina. [http://www.illumina.com]
- Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, Rosenbaum AM, Wang MD, Zhang K, Mitra RD, Church GM: Accurate multiplex polony sequencing of an evolved bacterial genome. Science. 2005, 309 (5741): 1728-1732. 10.1126/science.1117389.View ArticlePubMedGoogle Scholar
- Applied Biosystems SOLID Sequencing. [http://www.appliedbiosystems.com]
- Helicons GSS Sequencing. [http://www.helicosbio.com]
- Chaisson MJ, Brinza D, Pevzner PA: De novo fragment assembly with short mate-paired reads: Does the read length matter?. Genome Res. 2009, 19 (2): 336-346. 10.1101/gr.079053.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.PubMed CentralView ArticlePubMedGoogle Scholar
- Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB: ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 2008, 18 (5): 810-820. 10.1101/gr.7337908.PubMed CentralView ArticlePubMedGoogle Scholar
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 2007, 17 (11): 1697-1706. 10.1101/gr.6435207.PubMed CentralView ArticlePubMedGoogle Scholar
- Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J: De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 2008, 18 (5): 802-809. 10.1101/gr.072033.107.PubMed CentralView ArticlePubMedGoogle Scholar
- Jeck WR, Reinhardt JA, Baltrus DA, Hickenbotham MT, Magrini V, Mardis ER, Dangl JL, Jones CD: Extending assembly of short DNA sequences to handle error. Bioinformatics. 2007, 23 (21): 2942-2944. 10.1093/bioinformatics/btm451.View ArticlePubMedGoogle Scholar
- Warren RL, Sutton GG, Jones SJ, Holt RA: Assembling millions of short DNA sequences using SSAKE. Bioinformatics. 2007, 23 (4): 500-501. 10.1093/bioinformatics/btl629.View ArticlePubMedGoogle Scholar
- Eveland AL, McCarty DR, Koch KE: Transcript profiling by 3'-untranslated region sequencing resolves expression of gene families. Plant Physiol. 2008, 146 (1): 32-44. 10.1104/pp.107.108597.PubMed CentralView ArticlePubMedGoogle Scholar
- Shin H, Hirst M, Bainbridge MN, Magrini V, Mardis E, Moerman DG, Marra MA, Baillie DL, Jones SJ: Transcriptome analysis for Caenorhabditis elegans based on novel expressed sequence tags. BMC Biol. 2008, 6: 30-10.1186/1741-7007-6-30.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Res. 1999, 9 (9): 868-877. 10.1101/gr.9.9.868.PubMed CentralView ArticlePubMedGoogle Scholar
- The Phrap Assembler. [http://www.phrap.org]
- Christoffels A, van Gelder A, Greyling G, Miller R, Hide T, Hide W: STACK: Sequence Tag Alignment and Consensus Knowledgebase. Nucleic Acids Res. 2001, 29 (1): 234-238. 10.1093/nar/29.1.234.PubMed CentralView ArticlePubMedGoogle Scholar
- Kalyanaraman A, Aluru S, Kothari S, Brendel V: Efficient clustering of large EST data sets on parallel computers. Nucleic Acids Res. 2003, 31 (11): 2963-2974. 10.1093/nar/gkg379.PubMed CentralView ArticlePubMedGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res. 2009, D26-31. 10.1093/nar/gkn723. 37 Database
- PlantGDB Assembly. [http://www.plantgdb.org/prj/ESTCluster/PUT_procedure.php]
- Dong Q, Schlueter SD, Brendel V: PlantGDB, plant genome database and analysis tools. Nucleic Acids Res. 2004, D354-359. 10.1093/nar/gkh046. 32 Database
- Forment J, Gilabert F, Robles A, Conejero V, Nuez F, Blanca JM: EST2uni: an open, parallel tool for automated EST analysis and database creation, with a data mining web interface and microarray expression data integration. BMC Bioinformatics. 2008, 9: 5-10.1186/1471-2105-9-5.PubMed CentralView ArticlePubMedGoogle Scholar
- Li S, Chou HH: LUCY2: an interactive DNA sequence quality trimming and vector removal tool. Bioinformatics. 2004, 20 (16): 2865-2866. 10.1093/bioinformatics/bth302.View ArticlePubMedGoogle Scholar
- Pertea G, Huang X, Liang F, Antonescu V, Sultana R, Karamycheva S, Lee Y, White J, Cheung F, Parvizi B, et al: TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics. 2003, 19 (5): 651-652. 10.1093/bioinformatics/btg034.View ArticlePubMedGoogle Scholar
- The UniProt Consortium: The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res. 2009, D169-174. 10.1093/nar/gkn664. 37 Database
- Stekel DJ, Git Y, Falciani F: The comparison of gene expression from multiple cDNA libraries. Genome Res. 2000, 10 (12): 2055-2061. 10.1101/gr.GR-1325RR.PubMed CentralView ArticlePubMedGoogle Scholar
- PAVE software package. [http://www.agcol.arizona.edu/software/pave]
- Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000, 7 (1–2): 203-214. 10.1089/10665270050081478.View ArticlePubMedGoogle Scholar
- Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, Kasprzyk A: BioMart – biological queries made easy. BMC Genomics. 2009, 10: 22-10.1186/1471-2164-10-22.PubMed CentralView ArticlePubMedGoogle Scholar
- The Maize FL-cDNA Project. [http://www.maizecdna.org]
- The Solanum Trichome Project. [http://www.trichome.msu.edu]
- Kikuchi S, Satoh K, Nagata T, Kawagashira N, Doi K, Kishimoto N, Yazaki J, Ishikawa M, Yamada H, Ooka H, et al: Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice. Science. 2003, 301 (5631): 376-379. 10.1126/science.1081288.View ArticlePubMedGoogle Scholar
- Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Muller WE, Wetter T, Suhai S: Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res. 2004, 14 (6): 1147-1159. 10.1101/gr.1917404.PubMed CentralView ArticlePubMedGoogle Scholar
- Weber JL, Myers EW: Human whole-genome shotgun sequencing. Genome Res. 1997, 7 (5): 401-409.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.