Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery
© Parchman et al. 2010
Received: 20 January 2010
Accepted: 16 March 2010
Published: 16 March 2010
Skip to main content
© Parchman et al. 2010
Received: 20 January 2010
Accepted: 16 March 2010
Published: 16 March 2010
Massively parallel sequencing of cDNA is now an efficient route for generating enormous sequence collections that represent expressed genes. This approach provides a valuable starting point for characterizing functional genetic variation in non-model organisms, especially where whole genome sequencing efforts are currently cost and time prohibitive. The large and complex genomes of pines (Pinus spp.) have hindered the development of genomic resources, despite the ecological and economical importance of the group. While most genomic studies have focused on a single species (P. taeda), genomic level resources for other pines are insufficiently developed to facilitate ecological genomic research. Lodgepole pine (P. contorta) is an ecologically important foundation species of montane forest ecosystems and exhibits substantial adaptive variation across its range in western North America. Here we describe a sequencing study of expressed genes from P. contorta, including their assembly and annotation, and their potential for molecular marker development to support population and association genetic studies.
We obtained 586,732 sequencing reads from a 454 GS XLR70 Titanium pyrosequencer (mean length: 306 base pairs). A combination of reference-based and de novo assemblies yielded 63,657 contigs, with 239,793 reads remaining as singletons. Based on sequence similarity with known proteins, these sequences represent approximately 17,000 unique genes, many of which are well covered by contig sequences. This sequence collection also included a surprisingly large number of retrotransposon sequences, suggesting that they are highly transcriptionally active in the tissues we sampled. We located and characterized thousands of simple sequence repeats and single nucleotide polymorphisms as potential molecular markers in our assembled and annotated sequences. High quality PCR primers were designed for a substantial number of the SSR loci, and a large number of these were amplified successfully in initial screening.
This sequence collection represents a major genomic resource for P. contorta, and the large number of genetic markers characterized should contribute to future research in this and other pines. Our results illustrate the utility of next generation sequencing as a basis for marker development and population genomics in non-model species.
Large numbers of molecular markers and sequence data from across the genome are playing an increasingly important role in population genomic studies of fine-scale genetic variation and the genetic basis of traits . Nevertheless, we lack genomic resources for most non-model organisms and whole genome sequencing is still largely impractical for most eukaryotes. Transcriptome, or Expressed Sequence Tag (EST), sequencing is an efficient means to generate functional genomic level data for non-model organisms or those with genome characteristics prohibitive to whole genome sequencing. EST sequencing is an attractive alternative to whole genome sequencing because the majority of most eukaryotic genomes is non-coding DNA, and EST sequences lack introns and intragenic regions that render analysis and interpretation of data more difficult . ESTs thus have a high functional information content, and often correspond to genes with known or predicted functions [2, 3]. Large collections of EST sequences have proven invaluable for gene annotation and discovery [2, 4], comparative genomics , development of molecular markers [6, 7], and for population genomic studies of genetic variation associated with adaptive traits . Nonetheless, until recently, traditional laboratory methods for the development of EST resources have required costly and time consuming approaches involving cloning, cDNA library construction, and many labor intensive Sanger sequencing runs .
Massively parallel sequencing technologies, such as 454 pyrosequencing, remove many time consuming steps involved in Sanger sequencing of ESTs and have facilitated transcriptome sequencing at a fraction of the time and cost previously required [5, 9–11]. At present, a single run on a 454 GS XLR70 Titanium pyrosequencer can produce more than 106 sequences averaging greater than 300 base pairs (bp) in length. The de novo assembly of the large numbers of short reads produced from this and similar technologies is a significant challenge for whole genome sequencing of large and complex genomes. In contrast, for transcriptome sequencing, de novo assembly is facilitated by the possibility of increased coverage depth (number of reads per nucleotide in the template) for the much smaller number of nucleotides in the transcriptome than in the whole genome . In addition, the reduced amount of repetitive DNA found in genes compared to non-coding regions ameliorates one of the principal obstacles to de novo assembly of short reads . Whereas most applications of parallel sequencing of ESTs have involved model organisms with draft genomes available to aid in assembly [4, 13, 14], recent studies have demonstrated highly successful de novo assemblies of 454 EST data for organisms with no prior genomic resources [5, 7, 15, 16]. The generation of such large-scale sequence data will enable functional analyses that were previously limited to model organisms and their rapid application in ecologically important taxa . Here, we utilize pyrosequencing of cDNA to characterize the transcriptome of lodgepole pine (Pinus contorta) and to develop genomic resources to support further research in this and other pines.
P. contorta is an ecologically and economically important tree that is widespread in the mountainous regions of western North America . It is a fire-adapted species that mediates regeneration after disturbance, has a major impact on forest structure and ecology, and is a foundation species of many montane forest ecosystems. It is one of the most variable pines, and grows in a variety of conditions ranging from low elevations to timberline  where it has experienced and evolved in response to diverse selection pressures including that from variation in seed predator communities [20–23] and fire regime [24, 25]. The current mountain pine bark beetle (Dendroctonus ponderosae) epidemic is causing unprecedented mortality of P. contorta throughout the Rocky Mountains , which is likely to cause rapid and massive changes in community structure and ecosystem processes. Consequently, a greater understanding of fine-scale population genetic variation and the genetic control of traits important to these forests would be beneficial and timely.
Although a large number of EST sequences for loblolly pine (P. taeda) exist in public databases (e.g., NCBI), far fewer resources exist for P. contorta (1 EST prior to 2010, as of January 2010 ca. 40,000 ESTs) and other pines, despite the importance of the genus. This paucity exists in part because pines have enormous genomes (10,000-40,000 mega-base pairs vs. 115 Mbp in Arabidopsis thaliana) with large amounts of repetitive DNA [27, 28], making whole genome sequencing projects difficult or impractical. The construction of large EST collections is thus the most promising approach for providing functional genomic level information in pines . Whereas other labs are currently generating P. contorta ESTs using Sanger sequencing (K. Ritland and J. Boehlman, pers. comm.), additional sequencing effort is needed to increase genomic level resources. The development of genomic resources for P. contorta should facilitate basic and applied research on the genetics and evolutionary ecology of this species and its role in maintaining forest health and ecosystem function [29, 30]. In addition, EST collections for P. contorta will contribute to the development of molecular markers for other pines and facilitate comparative genomics and the study of adaptive variation across the genus.
Here we describe 454 pyrosequencing of P. contorta cDNA and assess the utility of this approach for transcriptome characterization and marker discovery in a species with a large and complex genome. Normalized cDNA collections from multiple tissues and individuals were used to sample large numbers of expressed genes and to detect simple sequence repeats (SSRs) and single nucleotide polymorphisms (SNPs). We first describe the assembly and functional annotation of EST sequences, and the level of transcriptome coverage provided by our sequence data. Second, we discuss the detection and characterization of a surprisingly large number of sequences representing retrotransposons. Finally, we utilize our assembled sequence data for the development of a variety of gene-based markers for population genomic studies, including SSRs occurring within regions that are conserved with another pine species, and SNPs occurring in regions with many reads and deep coverage. We designed high quality PCR primers for a large number of the SSRs we characterized, providing an immediately available resource of genetic markers for pines. Along with other recent studies [5, 7, 15, 16], our results demonstrate the utility and highlight some of the challenges of next generation transcriptome sequencing applied to non-model organisms.
We first evaluated how varying some of the starting parameters affected de novo assembly of reads by running 18 different assemblies with different combinations of minimum match percentage, minimum match size, and gap penalty. Predictably, the number of assembled sequences and the length of assembled contigs increased with decreasing match percentage and decreasing match size (Table 1). However, the number of contigs was largest with a match percentage of 90%, suggesting that a substantial number of contigs (including paralogs) may be collapsed and joined when match percentages are dropped as low as 85% and a larger number of reads are assembled into a smaller number of contigs (Table 1). Analyses throughout this paper are based on a combination of reference-based and de novo assemblies. We first executed a reference-based assembly utilizing a set of approximately 19,000 unigenes available for P. taeda using a minimum match percentage of 88%, a minimum match size of 19 bp, and a gap penalty of 30. This assembly was executed to provide a set of contigs with sequences that are conserved between P. taeda and P. contorta and to use the comparative sequence alignments to design molecular markers with high cross-species transferability. De novo assembly of the remaining reads was run with a minimum match percentage of 93%, a minimum match size of 19 bp, and a gap penalty of 30. This combination of reference-based and de novo assemblies increased the number of assembled reads and the number of contigs when compared to pure de novo assembly (Table 1).
Results and characteristics of assemblies run with different parameter settings.
Number of contigs and average length
Number of assembled reads
Number of contigs and average length
Number of assembled reads
Combination of reference-based (match % = 88) and de novo assemblies (match % = 93)
454 EST matches to annotated protein databases.
Combined Set (303,480)
Matches to database
Unique matches to proteins
Matches to database
Unique matches to proteins
Summary and taxonomic source of BLAST matches to 454 ESTs.
Summary of plant ESTs with BLAST matches to retrotransposons.
# of ESTs
# of ESTs matching RTs
In research on other genomes, small segments of retrotransposon sequences have been found within ESTs containing genes, which could contribute to the large number of BLAST searches matching to retrotransposon-like sequences . To assess whether our sequences that aligned to Pinus retrotransposons IFG7, PpRT1, and GYMNY were located in coding regions, following  we assembled the corresponding sequences into 553 contigs using SeqMan Ngen. These contigs were then aligned to UniRef50 using BLASTx with an E value threshold of 10-4. If pieces of retroelement sequences were in genes, we would expect these contigs to align with known proteins in UniRef50 that were non-retroelement products . Of the 553 query sequences, 315 contigs had hits to UniRef50 proteins, and these sequences matched to 105 unique protein accession numbers. These accession numbers were imported into the online UniProt database search tool (http://www.uniprot.org/jobs/; ) and manually inspected to assess whether the associated protein products belonged to retroelements or not. Of the 105 unique proteins, 19 proteins (18.1%) were not retroelement protein products determined by either protein name, known function, conserved motifs, or affiliated gene ontology terms. This suggests that the majority of the sequences that were similar to known retroelements are likely to reflect transcriptionally active retrotransposons, although a significant proportion (18%) of retrotransposon-like sequences may represent transcribed regions containing short pieces of retroelement-like sequences.
SSRs in 454 ESTs.
Type of SSR repeat
Reference based assembly
De novo assembly of remaining reads
We were able to design high quality PCR primers with optimal expected product sizes and other properties that should facilitate successful amplification for a large portion these SSRs using BatchPrimer3 . Primers were constructed for nearly a third of the SSRs identified, but quality priming sites occurred in a much higher proportion of tri- and tetra-nucleotide repeats than of di-nucleotide repeats (Table 5). Information on the name of the sequence containing each SSR, SSR motif, number of repeats, expected length of PCR product, sequences and positions of forward and reverse primers, and the GC content and melting temperature for each primer is included in Additional file 1 : Supplemental Table S1. Of 96 loci tested for succesful PCR amplification in a single trial, 70% successfully amplified in P. contorta and 65% successfully amplified in P. ponderosa. In addition, more than 50% of the tested SSRs were polymorphic across panels of eight individuals of P. contorta and P. ponderosa.
Using Seqman Pro (DNASTAR, Inc.), we identified a substantial number of SNPs in our largest and most deeply covered contigs. The 20 BLAST-annotated contigs with the largest numbers of reads contained 301 SNPs with alternate alleles in at least 20% of reads and minimum coverage depth of 10×. These contigs represent 27,496 bp of sequence data, giving a SNP occurrence rate of 1.1 SNPs per 100 bp. Across all the contigs with more than 20 reads (658 contigs containing 614,125 nucleotides) we identified a total of 3,707 (0.6 high quality SNPs per 100 bp) high quality SNPs with frequency of the alternative allele being at least 20% and a minimum coverage depth of 8×.
Normalized cDNA collections from multiple tissues and individuals allowed us to sequence a large fraction of the P. contorta transcriptome using 454 pyrosequencing. The approximately 1.8 × 108 base pairs of data produced here represent a substantial sequence resource for P. contorta and will contribute to genomic data available for the genus Pinus. Below we discuss the characteristics and level of transcriptome coverage in our data, the abundance of transcriptionally active retrotransposons, the characterization of different molecular marker types, and future prospects for population genomics studies.
As with other recent studies [5, 7, 15, 16], our results indicate that short reads from 454 sequencing runs can be effectively assembled and used to readily characterize the gene space of non-model organisms. The large number and length of reads generated from a half plate pyrosequencing run on a 454 GS XLR70 Titanium instrument resulted in a relatively high expected average transcriptome coverage depth (3.6×). As expected with the increased read lengths from the 454 GS XLR70 Titanium instrument, our contigs were on average larger (mean = 500 bp) than those assembled in previous studies that used earlier 454 technologies with shorter reads (e.g., 197 bp, ; 247 bp, ; 440 bp, ). In addition, a large percentage of our contigs were greater than 800 bp in length and had good coverage depth (Figs. 3, 4). High quality of assembled contigs was indicated by a high proportion of contigs matching to known proteins using BLAST searches and by the ready PCR amplification of SSR markers developed in contigs.
Nonetheless, a relatively small portion (48%) of reads were assembled into contigs, which is significantly less than that reported for several other recent 454 transcriptome assemblies (e.g., 91%, ; 88%, ; 90%, ). As a consequence, coverage depth and contig length was lower than expected for many contigs. Large numbers of singletons could result from a variety of causes, including the assembly algorithm used, artifacts of cDNA normalization, genes expressed at low levels, contaminants from other organisms such as bacterial or fungal parasites, or 454 sequencing errors . Transcriptome assembly is additionally hindered by alternative splicing, among other issues [5, 17]. Long repeat regions, which are particularly common in pines, are known to cause difficulties with de novo assembly [12, 43]. One factor potentially influencing difficulty in assembly is the abundance of both simple and complex repeats. Although we removed a substantial number of reads prior to assembly, both long SSR regions and retroelements were prevalent in our data and could have contributed to difficulty with assembly. Additional filtering steps could improve the fraction of reads assembling into contigs , although this could also involve an overall loss of information. Nonetheless, many singletons were high quality reads and matched to proteins in BLAST searches, highlighting that they are still a very important source of information. Given that next generation sequencing projects will soon become common in non-model organisms, increases in the amount and quality of data should result in the improvement of de novo assembly algorithms over time .
Estimating the number of genes and the level of transcript coverage represented in an EST collection is an important issue for transcriptome sequencing projects, but is difficult or impossible without a completely annotated reference genome sequence. We indirectly evaluated transcriptome coverage breadth by first determing the number of unique genes detected in our sequence collection using BLAST. The large number of sequences that matched in BLAST searches to unique proteins (Table 2) indicates that our 454 sequencing reads identified a substantial portion of the genes in P. contorta. A large portion of the 17,321 unique genes had best BLAST hits to plant proteins, with a smaller portion hitting to other taxonomic groups (Table 3). A similarly large number of these were assigned to a wide range of gene ontology categories (Fig. 5), indicating that a wide diversity of transcripts are represented by our sequence data. Furthermore, many of the contigs and singletons without BLAST hits likely represent additional genes not represented in the annotated protein databases we searched, or genes that lack BLAST matches due to short length. If we assume a similar number of genes occur in pines as in Arabidopsis (25,000, ), our annotated sequences are likely to represent more than half of the genes in P. contorta. To evaluate the extent to which individual transcripts were fully covered, we quantified coverage of full length A. thaliana orthologs by our 454 contigs. A substantial number of A. thaliana orthologs were entirely covered by 454 contigs, with increasingly complete coverage for contigs with higher coverage depths (Fig. 6A). Similarly, the completeness of ortholog coverage decreased with decreasing contig coverage depth and increasing length of the A. thaliana sequence, indicating that additional sequencing would be warranted for more comprehensive transcriptome coverage (Fig. 6A, B).
ESTs with BLAST hits to distant taxa may represent proteins not well-characterized in closely related taxa, but could also represent RNA from other organisms present in sampled tissues. The substantial number of BLAST hits to known fungal proteins in our ESTs (Table 3) is most likely due to the presence of fungi in the needles and or developing conelets we sampled. This is not surprising given the prevalence of endophytic and symbiotic fungi in conifer tissues [45, 46]. Similarly,  reported matches in BLAST searches of their 454 ESTs to a substantial number of non-metazoan proteins, and identified the presence of an important intracellular parasite known to affect butterfly population dynamics. Such findings point to the utility of large-scale transcriptome sequencing and annotation for detecting parasites or symbionts present in sampled in tissues, just as is done more generally in environmental genomics .
Retrotransposons are known to play a prominent role in the evolution of genome size and complexity across eukaryotes . For example 30-50% of the mammalian genome is composed of retroelements , and they comprise a substantial proportion of most plant genomes [49, 50]. In addition, the proliferation of retrotransposons may be a major factor contributing to the large genome sizes of pines, where retrotransposons may account for almost 2% of the genome . Although they are typically not actively transcribed into RNA, their presence in EST collections indicates that they can be transcriptionally active, and in some cases at very high levels (Table 4; [51, 52]). BLAST searches of known retrotransposon sequences in plants as well as results from BLAST annotation of our contigs and singletons indicate that more than 6% of our 454 reads represent retrotransposon-like sequences. They are present in our data at a level an order of magnitude higher than in comparable EST collections for other plants, and two orders of magnitude higher than in other conifers (Table 4). We recognize that a variety of issues may contribute to this result, including possible variability of cDNA normalization across studies or tissues. Nonetheless this result highlights the abundance of retrotransposons that may occur in large-scale EST sequencing projects and deserves attention for several reasons. First, repetitive DNA is a known difficulty for assembly of short sequences into contigs, and the prevalence of retrotransposon sequences could contribute to assembly difficulties. Consequently, the characterization of retrotransposons in our 454 ESTs will be useful for planning future transcriptome sequencing in this species, where it may be desirable to use laboratory methods to remove retroelements from the cDNA sequencing pool and enrich for non-repetitive templates. Finally, given that a major goal of our study was molecular marker development, identification of the retrotransposon-derived reads is crucial to avoid developing and genotyping molecular markers residing in such sequences.
Polymorphic genetic markers are important for research involving population genetic structuring, demography, relatedness, and the genetic basis of adaptive traits [53–55]. Next generation transcriptome sequencing leads to superior resources for the development of such markers not only because of the enormous amount of sequence data in which markers can be identified, but also because discovered markers are gene-based. Such markers are advantageous because they facilitate the detection of functional variation and the signature of selection in genomic scans or association genetic studies [2, 56]. Currently few genetic marker resources exist for P. contorta (but see ). The large number of SSRs and SNPs we detected provide a wealth of markers potentially useful to applications ranging from population genetics, linkage mapping, and comparative genomics, to gene-based association studies aimed at understanding the genetic control of adaptive traits.
EST-based SSRs are advantageous compared to SSRs located in non-transcribed regions owing to their higher amplification rates and cross-species transferability . We have identified large numbers of SSRs in our 454 EST collection, and were able to design high quality PCR primers for nearly a third of these regions (Table 5). Moreover, a large number of these SSRs occur in the protein coding sequences of annotated contigs representing genes of known or predicted identity and function. That most of these SSRs occur in coding sequences (5,678 out of 7,906) and not UTRs is perhaps surprising. Another possibly surprising result is that the majority of SSRs in the coding regions were di-nucleotide repeats (78.4% di-, 19.4% tri-, and 2.1% tetra-nucleotide repeats). These di-nucleotide SSRs were typically short (< 20 bp), and coding regions with di-nucleotide repeats almost always contained more than one SSR, although these were not always in close proximity where they could be considered interrupted repeats.
We were also able to design primers for a large subset of the SSRs located in regions where P. contorta ESTs were aligned to P. taeda ESTs (Table 5). As priming sites should be highly conserved, these markers are likely to be transferable to related species, and may represent a valuable source of genetic markers for the genus Pinus. We were able to successfully amplify a large percentage of polymorphic SSR loci in initial tests with P. contorta and P. ponderosa, validating the quality of our assembled contigs and the utility of the SSRs produced. Our ongoing work is assessing amplification rates, polymorphism levels and cross-species transferability for hundreds of these SSRs across multiple species of pines. The density of SSRs detected in 454 sequencing projects depends on numerous factors including the template sequenced and the criteria used in recognizing SSRs, and has varied greatly across recent studies [5, 59, 60]. Nonetheless, these studies and our own highlight the value of 454 sequencing as a cost and time effective route for rapid SSR discovery.
Because of the deep and redundant coverage produced over many genes, pyrosequencing of cDNA is ideal for SNP discovery and characterization [5, 6, 16]. Although our sampling was limited to four different individuals, high quality SNPs are abundant in our contigs. In addition, most of these SNPs reside in annotated genes, which will allow the identification of reading frame and facilitate more detailed analyses on the significance of molecular variation. The SNP frequency in our contigs (0.6/100 bp) is in the general range of that reported in other studies using 454 pyrosequencing of cDNA pooled from multiple individuals (e.g.,  0.33/100 bp  0.67/100 bp,  0.49/100 bp,  0.72/100 bp). Studies such as these highlight the value of next generation sequencing projects for SNP discovery and characterization. Similar to studies on P. taeda [62, 63] and Picea glauca , the availability of large numbers of SNPs should facilitate population genomic and gene-based association studies in P. contorta.
Pines are among the most ecologically and economically important plant species on Earth. Yet, the size and complexity of their genomes has hindered the development of genomic resources for many taxa in the genus. The 303,480 unique sequences in this 454 EST collection represent a major genomic level resource for P. contorta, and will be useful for comparative genomic studies in pines. This highlights the utility of high-throughput transcriptome sequencing as a fast and cost-effective road to rapidly obtain information on coding genetic variation in pines. Because of large and minimally structured populations, high levels of nucleotide diversity, and rapid decay of linkage disequilibrium, conifers represent excellent subjects for association genetic studies [27, 64]. Nonetheless, the enormous size of the pine genome means that millions of markers will be required for the fine-scale mapping of traits. Consequently, utilizing SNPs occurring in candidate genes or in ESTs is a promising avenue for association genetics in pines [27, 64]. Until recently, association genetic studies for adaptive variation have been limited in conifers (but see [62, 65]) due to the high cost and time required by traditional routes for the development of marker resources. The ability to investigate the genetic basis of adaptive traits in these trees should increase as groups working on pine genomic resources continue to build large EST data sets through next generation sequencing approaches. The thousands of SNP and SSR markers in our 454 ESTs should enable population genomic and gene-based association studies [53, 66]. Such analyses should contribute to understanding patterns of adaptive variation across the genome and to identifying the genetic basis of adaptive traits.
Fresh needles and developing conelets were sampled from four individual P. contorta trees in the Medicine Bow National Forest of south central Wyoming (USA). RNA extraction, cDNA synthesis, and 454 sequencing were performed by staff at the biotechnology company GATC (GATC, Inc.). Total RNA was isolated using a CTAB-based protocol and was further purified with the NucleoSpin RNA XS kit (Macherey and Nagel, Inc.). RNA was pooled from the four individuals for cDNA synthesis. From this pool, poly(A)+ RNA was prepared, and first-strand cDNA synthesis was primed with an N6 randomized primer. 454 adapters A and B (Roche Life Sciences, Inc.) were ligated to the 5' and 3' ends of the cDNA, and the cDNA was PCR-amplified using a proof reading enzyme. One cycle of denaturation and reassociation of the cDNA was used to obtain N1 cDNA. Single stranded cDNA was used for hybridization rather than double stranded cDNA, and normalization of the cDNA template was achieved by separating reassociated double stranded cDNA from the single stranded cDNA by passing the mixture over a hydroxylapatite column [15, 67]. Following hydroxylapatite chromotography, single stranded cDNA was PCR amplified. One μ g of cDNA was sequenced in two half-plate runs on a 454 GS XLR70 Titanium genomic sequencer (Roche, Inc.). Although technical difficulties with the first run resulted in a smaller than expected number of reads, these reads were included in analyses as they were still of high quality. Files containing the sequences and quality scores have been deposited at NCBIs Short Read Archive (accession SRA012089). 454 primer sequences were trimmed from all reads prior to assembly. Because repetitive DNA is a known problem for the assembly of short pyrosequencing reads, and repeats are abundant in conifer genomes, we removed a large number of reads from the full data set to avoid sequences that could interfere with assembly. We discarded reads with simple sequence repeats longer than 50 bp, and a substantial number of reads matching to three well-characterized conifer retrotransposon sequences (see below). We also removed reads with average quality scores less than 18. We used Seqman Ngen (DNAstar, Inc.) to assemble reads into contigs, as this program has been successful in assembling 454 sequences from transcriptomes [5, 68, 69] and other programs created to assemble next generation sequencing data often assemble few EST reads into contigs [14, 17]. The assembly algorithm employed by Ngen is an extension of work by  that allows the incorporation of quality scores for individual nucleotides, and can be specifically parameterized for assembling short pyrosequencing reads. Sequences are placed into a hash table of overlapping subsequences, before sequential assembly of overlapping reads occurs with near constant memory usage (for a detailed explanation of the algorithm see ).
To explore the effects of parameter settings on the outcome of assemblies, we ran de novo assemblies with a range of minimum match percentages (85%, 90% and 95%), match lengths (19, 23, and 25 bp), and gap penalties (30 and 50) (Table 1). Analyses presented in this paper are based on a combination and reference-based and de novo assemblies. We first executed a reference-based assembly using a set of 18,921 unigenes constructed from P. taeda Sanger ESTs http://www.ncbi.nlm.nih.gov/UniGene/UGOrg.cgi?TAXID=3352. For this assembly, we used a minimum match size of 19 nucleotides, match percentage of 88%, mismatch penalty of 18, and gap penalty of 30. We used a lower minimum match percentage here because of substantial divergence between taxa. We utilized this approach not only to facilitate assembly but also to produce a set of contigs containing sequences conserved between P. contorta and P. taeda. These served as a basis for characterizing molecular markers likely to have high cross-species transferability. Reads that did not join into contigs in the reference-based assembly were entered into a de novo assembly with a more stringent minimum match percentage. This assembly was run with a minimum match size of 19 nucleotides, match percentage of 93%, mismatch penalty of 18, and gap penalty of 30. The resulting contigs and remaining singletons were then combined into a single set (Fig. 2).
To assess the coverage and quality of our assembly, we first used local BLASTx  to align both contigs and singletons to the UniRef50 15.4  and the TAIR9 Arabidopsis thaliana  annotated protein databases using an E value threshold of 10-6. BLASTx results were passed through a custom Perl pipeline that summarized information and produced tab-delimited tables with accession numbers, gene name, taxonomic ID, query length, ortholog sequence length, sequence alignment, E value, and bit score for each protein accession with matches in BLAST searches. To estimate the proportion of annotated contigs and singletons that matched to unique genes in the two databases, these files were then filtered for redundancy in protein accessions. Assignment of gene ontology (GO) terms to ESTs with BLASTx matches was then performed by importing the accession numbers for the BLASTx hits to unique proteins into Blast2go (version 2.3.6; http://www.blast2go.org/). Blast2go is an automated tool for the assignment of gene ontology terms to BLAST hits and was designed for use with novel sequence data . We also generated gene ontology assignments for A. thaliana annotated proteins to compare the distribution of functional annotations in P. contorta to that from a plant with a well-characterized transcriptome.
Because preliminary analyses identified a very large number of retrotransposon like sequences in our data, we quantified the presence of retroelements in the raw 454 data and the set of contigs and singletons with BLASTn searches using two approaches. First, we used BLASTn to search for 17 known plant retrotransposon sequences in our entire set of 454 reads and in large EST collections from other selected plant taxa. These retrotransposon sequences represent complete sequences from the NCBI Nucleotide database http://www.ncbi.nlm.nih.gov/entrez?db=nuccore that match queries of 'copia-like' and 'gypsy-like' in plant taxa. Additionally, complete retrotransposon sequences from conifer genomes were selected, including IFG7, PpRT1, and Gymny [32–34]. Ten species were chosen for comparison of retrotransposon abundance on the basis of taxonomic relatedness to our study species and the number of ESTs available. Entire EST collections available in dbEST for these species, as well as our complete set of P. contorta 454 reads, were compared to the 17 retrotransposon sequences using BLASTn with an E value threshold of 10-6. Second, following BLAST annotation of contigs and singletons, we further searched for sequences representing proteins associated with retroelements by identifying sequences that had the terms "copia", "gag", "pol", "retroelement", "integrase", "reverse transcriptase" and "retrotransposon" in their annotation.
We wrote Perl programs to identify SSRs in our contigs and to identify a subset of these that reside in contigs where reads from P. contorta were assembled onto P. taeda unigenes. We located di-, tri-, and tetra-nucleotide SSRs with lengths less than 50 bp and with a minimum of 4 contiguous repeating units, which provided a large number of candidate SSRs. We determined which SSRs occurred in coding sequences of genes by extracting the aligned portions of sequences having BLAST matches to annotated protein coding orthologs, and then using the same algorithm as above to detect SSRs in both the aligned and remaining portions of these contigs. The design of high quality PCR primers is crucial for the development of molecular markers that can be readily usable, and suitable priming sites will not exist for all of the loci we identified. To determine the loci that represent good candidates for PCR amplification, we used the program BatchPrimer3  to construct PCR primers in the flanking regions of SSRs. This program identifies SSRs, allows the control of many parameters to facilitate high quality primer construction, and can simultaneously process thousands of sequences. We designed primers in the flanking regions of SSRs that were a minimum of 12 bp long and used stringent criteria to design genetic markers of desired PCR product length and with a high probability of amplification. We created primers with a minimum GC content of 30%, with a melting temperature between 52 and 62°C and a maximum 4°C difference between primers, and positioned primers to obtain PCR products between 100 and 450 bp long. We also constrained primer construction so that the end of each primer contained a GC clamp (the last two nucleotides were G or C). Additional settings used for primer design are available from the authors upon request.
We tested 96 of these primer sets for succesful SSR amplification in several individuals in P. contorta and P. ponderosa. PCR reactions consisted of 50-100 ng total genomic DNA; 2 pmol of each primer; 0.5 mM each of dATP, dCTP, dGTP, and dTTP; 1× PCR buffer; and 0.4 units of taq polymerase. All polymerase chain reaction amplifications were performed with the following conditions; 94°C for five minutes, followed by 32 cycles of 94°C for one minute, 50°C for one minute, and 72°C for one minute, followed by a final extension step of 72°C for three minutes. We ran 12 μ l of each product out on 1.5% agarose gels stained with ethidium bromide, and scored each individual for successful amplification. A subset of SSR loci that amplified successfully were tagged with flourescently labelled M13 tails, run out on an ABI 3130 genetic analyzer (ABI, Inc.), and genotyped for polymorphism.
We also identified an abundance of SNPs in large contigs with high coverage depths using the SNP reporter feature in Seqman Pro (DNAstar, Inc.). We considered only SNPs with nucleotide variation, and disregarded indels. First, we enumerated and visually inspected high quality SNPs with coverage depth of at least 10× and with an alternate allele in a minimum of 20% of the reads for the 20 contigs containing the largest number of reads. Second, we assessed the presence of high quality SNPs in all contigs containing greater than 25 reads. Here we counted SNPs at sites where coverage depth was at least 8×, and where alternate alleles were present at a minimum frequency of 20%.
We appreciate constructive comments on earlier versions of the manuscript from Zach Gompert and three anonymous reviewers. An NSF DBI award (0701757) to C. A. Buerkle and an NSF DEB award (0344503) to C. W. Benkman supported this work. K. Geist was supported by an NIH INBRE program at the University of Wyoming for training undergraduates in bioinformatics. We thank DNAstar for technical guidance and expertise and for graciously allowing use of their Seqman NGen and Lasergene software packages.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.