- Research article
- Open Access
Generation, annotation, analysis and database integration of 16,500 white spruce EST clusters
- Nathalie Pavy1Email author,
- Charles Paule2,
- Lee Parsons2,
- John A Crow2,
- Marie-Josee Morency3,
- Janice Cooke1, 5,
- James E Johnson2,
- Etienne Noumen1,
- Carine Guillet-Claude1,
- Yaron Butterfield4,
- Sarah Barber4,
- George Yang4,
- Jerry Liu4,
- Jeff Stott4,
- Robert Kirkpatrick4,
- Asim Siddiqui4,
- Robert Holt4,
- Marco Marra4,
- Armand Seguin3,
- Ernest Retzel2,
- Jean Bousquet1 and
- John MacKay1
© Pavy et al; licensee BioMed Central Ltd. 2005
Received: 17 August 2005
Accepted: 19 October 2005
Published: 19 October 2005
The sequencing and analysis of ESTs is for now the only practical approach for large-scale gene discovery and annotation in conifers because their very large genomes are unlikely to be sequenced in the near future. Our objective was to produce extensive collections of ESTs and cDNA clones to support manufacture of cDNA microarrays and gene discovery in white spruce (Picea glauca [Moench] Voss).
We produced 16 cDNA libraries from different tissues and a variety of treatments, and partially sequenced 50,000 cDNA clones. High quality 3' and 5' reads were assembled into 16,578 consensus sequences, 45% of which represented full length inserts. Consensus sequences derived from 5' and 3' reads of the same cDNA clone were linked to define 14,471 transcripts. A large proportion (84%) of the spruce sequences matched a pine sequence, but only 68% of the spruce transcripts had homologs in Arabidopsis or rice. Nearly all the sequences that matched the Populus trichocarpa genome (the only sequenced tree genome) also matched rice or Arabidopsis genomes. We used several sequence similarity search approaches for assignment of putative functions, including blast searches against general and specialized databases (transcription factors, cell wall related proteins), Gene Ontology term assignation and Hidden Markov Model searches against PFAM protein families and domains. In total, 70% of the spruce transcripts displayed matches to proteins of known or unknown function in the Uniref100 database (blastx e-value < 1e-10). We identified multigenic families that appeared larger in spruce than in the Arabidopsis or rice genomes. Detailed analysis of translationally controlled tumour proteins and S-adenosylmethionine synthetase families confirmed a twofold size difference. Sequences and annotations were organized in a dedicated database, SpruceDB. Several search tools were developed to mine the data either based on their occurrence in the cDNA libraries or on functional annotations.
This report illustrates specific approaches for large-scale gene discovery and annotation in an organism that is very distantly related to any of the fully sequenced genomes. The ArboreaSet sequences and cDNA clones represent a valuable resource for investigations ranging from plant comparative genomics to applied conifer genetics.
Genomics projects have been initiated in several pine and spruce species to identify genes involved in traits of economic interest and of ecological significance in conifers. It is unlikely, however, that conifer genomes will be completely sequenced in the near future because of their shear size . For example, estimates of the haploid DNA content of Pinus taeda ranged from 11 pg  to 23.2 pg  and that of Picea glauca ranged between 4.5 pg  to 20.2 pg [PGI5.0; ]. With around 10–20,000 Mb , conifer genomes are more than 100 times larger than that of Arabidopsis and three times larger than the human genome. Such a large genome suggests that strategies that aim at characterizing the coding component of the genome will be more cost efficient for the recovery of information, in the short term.
The large-scale sequencing and analysis of ESTs remain a fundamental part of genomics research to enable gene discovery and annotation in most forest tree species, but especially in conifers. Several EST sequencing projects have been initiated in pines; 191,229 ESTs from several species were assembled to produce 35,053 consensus sequences in the Pinus Gene Index . A large majority of conifer sequences were shown to have sequence similarity to Angiosperm genes or genome sequences like Arabidopsis, however the identification of homologous sequences depends largely on the length of sequences available to conduct similarity searches [8, 9]. In loblolly pine, for example, the majority of contigged sequences which had no sequence similarity to other genomes were very short and more than 90% of sequences above 1 kb in length gave strong matches to Arabidopsis . Therefore, effective annotation of conifer coding sequences through comparative approaches is best achieved with complete information, which may be obtained by combining 3' and 5' sequences or by full length sequencing strategies. A recent investigation of the knox gene family in conifers showed that gene evolution and conifer protein family structure may diverge quite significantly from those of Angiosperm genomes . It is unknown how widespread this phenomenon may be; however, the finding suggests that although conserved protein motifs may be unambiguously identified, the biological role of genes belonging to conifer protein families may not be readily inferred from their Angiosperm homologs. These data would support the argument in favour of thorough cDNA sequencing projects in conifers because they are distantly related to model Angiosperms like Arabidopsis, in order to fully characterize protein families.
Many conifer EST sequencing projects have focused on wood formation and secondary xylem in pines (e.g. due to the ecological significance of the genus and the economic importance of wood [8, 11]). More recently, programs have emerged that involve other species including Douglas-fir  and spruce , and address other important aspects of tree physiology like the response to abiotic stresses or biotic stresses [12, 14]. Macroarrays and microarrays ranging in scope from a few hundred to a few thousand genes have been developed to help identify genes involved in wood formation and to characterize their putative roles in determining wood quality (e.g. in maritime pine , and in loblolly pine ). The relatively high level of sequence similarity between genera within the Pinaceae family has lead to the use of loblolly pine arrays for expression profiling experiments in scots pine, norway spruce  and white spruce . Transcript profiling has also been integrated into investigations of xylem differentiation in poplar , different questions related to wood formation have also been investigated by transcript profiling in Angiosperm trees, including heartwood of black locust trees , tissue differentiation in poplar  and tension wood formation in Eucalyptus .
Spruce is the most widely used genus for forest tree plantations in Canada, with hundreds of million seedlings planted each year . It is also widely divergent from pine [23, 24]. Genetic improvement of spruce species, mainly white and black spruces, has been ongoing in Canada since the 1950s and extensive information has been accumulated on the genetic control of commercially important traits. Genome mapping of spruces is underway to enable molecular breeding applications (e.g. ). Association mapping approaches have been proposed as most promising to identify genes underlying phenotypic variation in quantitative traits, and thus, to support the development of molecular breeding strategies in conifers . Large-scale EST sequencing and analysis are expected to enable association studies and gene mapping research as they are prerequisite steps to identifying SNPs to use in high throughput genotyping assays.
The objective of this study was to produce extensive collections of EST sequences and cDNA clones to support manufacture of cDNA microarrays and gene discovery efforts in white spruce (Picea glauca [Moench] Voss). This collection of ESTs constitutes an important new resource for the genomics of white spruce and related species. In this paper, we report the sequence analysis of around 71,000 sequence reads obtained through 3' and 5' sequencing of cDNAs. Comparative analyses were conducted to assign a functional annotation based upon similarities. Spruce contigs were also correlated with terms derived from the Gene Ontology , and similarity searches were conducted against specialized databases to identify putative transcription factors, cell wall related proteins and protein domains available in PFAM. To mine this new sequence resource, a database called SpruceDB has been developed at the Center of Computational Genomics and Bioinformatics (CCGB, University of Minnesota) , which supports multiple queries on the occurrence of the ESTs in the libraries and on the functional annotations.
Results and discusion
Library development and resulting sequences
Tissue sampling and EST sequencing strategies
Sequencing and quality parameters of white spruce cDNA libraries. Quality reads had a Phred score above 20 over at least 100 bp after vector trimming.
Libraries, treatments and tissues
Number of reads
% >1.6 Kb
Nb of quality reads
% Quality reads
Average length of quality reads (nt)
Male strobili development sequence
Female cones development sequence
Vegetative buds development sequence
Secondary xylem – mature trees
Cambium, phloem – mature trees
Secondary xylem – girdled seedlings
Cambium to bark – girdled seedlings
Elongating root tips – saplings
Primary, secondary shoots-N treatments
Immature somatic embryos
Clean roots systems – N treatments
Clean roots systems – P treatments
Clean roots systems – Diurnal cycle
Root secondary xylem – mature trees
Annual flush shoots diurnal cycle – trees
Needles – N fertilization treatments
We sequenced close to 50,000 cDNA clones, sampling between 1,536 and 6,144 clones from each cDNA library. The library quality assessment data, the number of sequencing reactions, and the number of high quality reads for each library are presented in Table 1. All clones were sequenced from the 3' end ; in addition, 5' sequencing was carried on many clones from the libraries of highest quality or most relevant to our research goals. In total, 71,424 reads were obtained and processed to remove vector and sequences of low quality (Phred score below 20). We thus retained 49,101 quality reads (QR) comprised of at least 100 contiguous nucleotides with a Phred score above 20 (Table 1). Among the quality reads 33.5% were from secondary vascular tissues, 32.2% were from roots, 16.7% from young shoots (all tissues), and the remaining 17.6% were from various organs including male strobili, female cones, buds, somatic embryos, and needles (Table 1).
EST assembly into contigs
Contig groups according to several levels of sequence identity based on 100 nt of overlap
Number of contigs per group
Total number of groups
Spruce and pine EST datasets are populated with allelic variants for many loci because conifers are outbred and highly heterozygous. As a consequence, the number of genes sampled may be estimated more or less accurately from the number of contigs or contig groups, depending upon the parameters that are used for their assembly and clustering. To our knowledge, the impact of assembly parameters has not been directly assessed in conifers or other Gymnosperms. On the other hand, the average nucleotide diversity was reported to be low for conifers [24, 26]; for example, sequence variation was estimated in pines with the average mutation population parameter ? = 0.00407 in Pinus taeda , ? = 0.00241 in P. pinaster, ? = 0.00186 in P. radiata  and ? = 0.0013 in P. sylvestris . These data suggest that the use of stringent criteria were appropriate for the assembly (into contigs) of the spruce sequence dataset comprised in part of allelic sequences. We also defined contig groups with less stringent criteria aiming to evaluate sequence redundancy. We recognize, however, that some contigs may contain paralogs, especially for slow-evolving gene families as discussed in other reports on plant EST clustering [29, 30]. For these reasons, the contig groups are thought to provide a conservative estimate of the number of genes, i.e. the minimum number of genes sequenced.
Sequence comparisons with other species
Sequence comparisons with the pine database and Angiosperm genomes
We found that 84.4% the Arborea transcript set (12,108 transcripts) showed sequence similarity with a contig of the Pine Gene Index (PGI5.0) which contains the largest assembly of publicly available pine ESTs (Figure 3). All of the tblastx searches detected a greater number of matches with PGI5.0 than with the Uniref100 protein database, in which the PGI5.0 consensus sequences are not represented. We examined whether the lack of similarity of the remaining 15.6% spruce transcripts (with no counterpart in the pine database) could be attributed to the non overlap of pine and spruce contigs derived from 5' and 3' sequences, respectively. More than half of the non matching spruce transcripts (9.8% of the total transcripts) were indeed derived only from 3' reads. Therefore, the lack of similarity of many of the sequences is not sufficient to conclude whether a pine homolog is absent from the database. Nonetheless, 6.6% of the spruce transcripts were derived from 5' reads alone (predominantly) or both 5' and 3' reads and, did not match a pine contig. For these sequences, there is a high likelihood that a similar pine transcript has not been sequenced thus far.
As might be expected, the overall sequence similarity was lower with Angiosperms than pine sequences. There were fewer matches and the number of matches decayed more rapidly as we used more stringent e-value cutoffs with tblastx against Angiosperms sequences. At the protein level, 68.4% (9,898) of spruce transcripts matched a sequence from Arabidopsis or rice with a tblastx e-value < 1e-10 and the proportion dropped to 37.6% for highly conserved sequences (e-value < 1e-50) compared to 65.3% with pine. A similar trend was observed with the poplar genome sequence which gave slightly lower similarities than Arabidopsis and rice sets, i.e. 64.3% and 21.6% matches with e-values below 1e-10 and 1e-50, respectively (Figure 3).
Complementarity of the sequencing projects in several species
Comparisons to the poplar genome sequence gave fewer matches and only a small number of matches not identified with Arabidopsis or rice (Figure 4). Only 89.1% of the spruce transcripts (8,823 out of 9,898) that matched an Arabidopsis or rice sequence also had a similarity to a sequence in the poplar genome. Furthermore, 3.5% of the spruce transcripts which lacked similarity to Arabidopsis or rice gave a match against the poplar genome. In the end, sequence similarity searches against the poplar genome only allowed us to annotate 162 additional sequences (0.7% of the spruce transcripts) compared to data derived from comparisons with Arabidopsis or rice. Such a trend is expected given the relatively close proximity between poplar (Salicaceae) and Arabidopsis (Brassicaceae).
The results indicate that data derived from Angiosperms species alone are insufficient for annotating sequences in conifers and that computational tools specifically developed for Gymnosperms are needed to help recognize functional regions in sequences like coding sequences or motifs in around 30% of conifer sequences with no obvious counterpart in Angiosperms. For example, the software Diogenes for predicting open reading frames in sequences was trained based on Pinaceae derived sequences for this purpose .
In total, 10,130 (70%) of the spruce transcripts displayed matches to proteins of known or unknown function, based on the blastx analysis against the Uniref100 database. We conducted Hidden Markov Model (HMM) searches against the PFAM protein family database [37, 38] to evaluate the proportion of the spruce transcripts homologous to families with an assigned function. Overall, we found that 52% of the 14,471 spruce "transcripts" showed similarity with 1,655 PFAM protein families (p-score below 1e-10). There were 157 of these PFAM families annotated as "DUF, Domain of Unknown Function", which showed similarities with 488 transcripts, and 20 families annotated as "UPF, Uncharacterized Protein Family" showing similarities with 45 transcripts. In the end, a total 48% of the spruce transcripts were similar to 1,478 PFAM families when DUFs and UPFs were excluded.
Consensus sequences correlated to terms belonging to the "molecular function" categories of the Gene Ontology
Annotations including electronic annotations
Annotations excluding electronic annotations
Number of consensus sequences
% of the number of annotated consensus sequences
% of the total number of consensus sequences
Number of consensus sequences
% of the number of annotated consensus sequences
% of the total number of consensus sequences
Triplet codon-amino acid adaptor activity
Chaperone regulator activity
Enzyme regulator activity
Nutrient reservoir activity
Translation regulator activity
Signal transducer activity
Obsolete molecular function
Transcription regulator activity
Structural molecule activity
Molecular function unknown
Families of putative transcription factors
Identification of transcripts encoding putative regulatory proteins. Sequences were identified based on HMM searches suported by p-score < 1e-10 with PFAM profiles available for families of regulatory proteins. The PFAM accessions for which no homology was found in SpruceDB through HMM search were not reported.
Number of spruce transcripts
Zinc finger, C3HC4 type (RING finger)
WD, G-beta repeat
AP2 domain-B3 DNA binding domain
HMG (high mobility group) box
MADS Family – SRF-type transcription factor – K-box region
Histone-like transcription factor (CBF/NF-Y) and archaeal histone
PHD finger – CW-type Zinc Finger
No apical meristem (NAM) protein
WRKY DNA-binding domain
bZIP transcription factor – bZIP Maf transcription factor-G-box binding protein MFMR
B-box zinc finger
Helix-loop-helix DNA-binding domain – Myc amino-terminal region
LIM domain family – PET Domain
Dof domain, zinc finger
GATA zinc finger
TCP family transcription factor
CCAAT-HAP2 Family CCAAT-binding transcription factor (CBF-B/NF-YA) subunit B
SBP (Sqamosa-promoter binding protein) floral development
HSF Family (Heat shock protein promoter binding)
EIL Family ethylene insensitive 3
B3 DNA binding domain
ARID/BRIGHT DNA binding domain – ELM2 domain
Cell wall related genes
Many of the libraries that we constructed were derived from secondary vascular tissues from stems or roots, or from whole stems or roots containing primary as well as secondary vascular regions. Therefore, we aimed to classify genes which encode proteins potentially involved in cell wall assembly. As a first step toward this goal, our collection of spruce transcripts was blasted against the sequences from the Cell Wall Navigator Database  [see Additional file 2], comprised of proteins involved in primary cell wall structure and assembly. In total, we found that 708 spruce contigs were similar to sequences of cell wall related proteins, with nearly all of the subclasses represented. We also searched for genes encoding enzymes involved in the biosynthesis monolignol precursors based upon sequence similarity with the set identified in Arabidopsis by Reas et al. , and identified 47 additional contigs (Supplemental data 2).
Redundancy analysis suggests larger size of selected protein families in spruce compared to Angiosperms
The cytoskeleton related TCTP family
The translationally controlled tumour proteins (TCTPs) are anti-apoptotic proteins, named for their preferential synthesis in the early phase of some tumours . They are implicated in both cell growth and division and have been shown to bind to tubulin in the cytoskeleton. In plants, similar proteins were identified in alfafa  and Pharbitis mil .
Pairwise comparison of white spruce consensus sequences related to the translationally controlled tumour proteins (TCTP). Nucleic acid identities were determined using the Smith-Waterman algorithm (water) available in the EMBOSS suite  in a 138 bp region of the 5' UTR immediately upstream of the first codon (ATG), (above the diagonal); and, along the complete sequence of the consensus sequences (under the diagonal). The diagonal shows the contig length.
The SAMS family
Pairwise comparison of white spruce consensus sequences related to the S-adenosylmethionine synthetase (SAMS). Nucleic acid identities were determined using the Smith-Waterman algorithm (water) available in the EMBOSS suite  in a 99 bp region of the 3' UTR immediately downstream the stop codon (above the diagonal) and along the complete sequence of the consensus sequences (under the diagonal). The diagonal shows the contig length.
We analyzed the 8 spruce sams transcripts that encompassed complete protein coding sequences averaging 393 amino acids in length. The predicted proteins were very highly conserved with Angiosperm SAMS. For example, the Arabidopsis SAMS2 protein (locus At4g01850) had a similarity of 88% (345/390a.a) and 90% (354/390 a.a) with the predicted proteins from the spruce contigs 10446 and 10482, respectively. Pairwise comparisons of the spruce coding sequences showed they are highly conserved, yet they could be divided into seven groups of sequences with 66.3% to 91.3% identity (Table 6). We also analyzed the nucleic acid sequence of their 96 bp 3' UTR and found significant variability between groups, with sequence identities varying from 42.5% to 70.7% (Table 6). These results provided a strong indication that these putative sams transcripts represented 7 distinct genes. Protein and nucleic acid sequence comparisons supported the hypothesis that the SAMS proteins form a larger family in the spruce genome than in Arabidopsis and rice. In rice, the presence of two pseudogenes indicated that protein family expansions through duplication events have been followed by gene loss during the evolution. Two sams genes were described in Pinus contorta ; however, large-scale EST sequencing in Pinus taeda  identified 16 consensus sequences, suggesting that the relatively large family size of SAMS in spruce may also apply to pine and other Gymnosperm genomes.
Development of Spruce DB
The relational database, SpruceDB, was created to allow complex queries into the spruce ESTs, assembled consensus sequences and results of similarity analyses. The database can be accessed via web browser . Web-based tools provide facilities for exploration of this information resource. The ESTs or contigs can be retrieved based on library composition and sequence similarities. Web links from the database query pages retrieve the actual EST and contig sequences from the Biodata web pages .
Structure and data sources
In this report, we described a new conifer EST resource derived from 49,101 high quality 5' and 3' reads that were assembled to produce 16,578 consensus sequences averaging 797 nucleotides in length, and representing 14,471 different "transcripts". We estimated the sequencing redundancy at 39% based on the number of consensus sequences represented by more than one cDNA clone. Comparison of the spruce sequences to public sequence datasets from Angiosperms and pine showed that approximately 70% of the sequences had similarity with Arabidopsis, rice or poplar sequences, but 84% matched a pine sequence. The majority of the sequences that did not give a match in pine did not produce a match with any of the Angiosperms either. We used a variety of approaches based on sequence similarity searches to assigned putative functions to the ArboreaSet sequences, including blast searches against general and specialized datasets, GO term assignation, HMM searches against PFAM protein families and domains. These analyses were used for the systematic identification of diverse putative transcription factors, cell wall related enzymes and structural proteins, and revealed a few protein families that are thought to be larger in spruce than in the well-characterized genomes of Arabidopsis and rice.
These comprehensive analyses to enable the annotation of spruce sequences provide critical information to help identify target genes for functional analysis and association studies. Studies are now being planned based on these data to search for DNA polymorphisms underlying the extensive phenotypic variation which occurs in natural and breeding populations. These studies will focus on sequences encoding proteins relevant for adaptation, growth and wood formation for large-scale SNP discovery and genotyping required for association studies and gene mapping. It is therefore essential that we develop databases of annotated coding sequences so that we may rapidly identify and screen the most suitable targets. As a first step toward this goal, the relational database SpruceDB was created to allow complex queries into the spruce ESTs, assembled consensus sequences and results of similarity analyses. By using this EST resource, we have also developed a low redundancy cDNA microarray comprised of 9,690 sequences, which, in combination with multiple sequence annotations, will be a powerful tool to investigate transcriptome modulation in spruces and conifers.
All of the libraries were comprised of a single organ or tissue, and the majority of libraries were developed by pooling samples collected at different points along a time course, along the diurnal cycle, at several stages of differentiation or from different treatments (Supplemental data 2 and ). Treatments known to affect plant physiology were applied to saplings (young trees) aiming to stimulate different transcript profiles. These treatments included N and P fertilization as well as stem girdling. Three libraries were made from whole root systems of very young spruce seedlings, produced through tissue culture, grown in sterile growth media. Most of the libraries were derived from one genotype (pg-653), however four libraries were comprised of two or more genotypes. The secondary xylem collected from saplings (library GQ007) was comprised of the entire sampling of woody tissues collected from seedlings; however, only the differentiating partly-lignified secondary xylem was collected from mature trees as previously described . The secondary xylem tissues were collected by first gently separating the bark from the underlying wood and scraping the soft tissues inward of the cambial area. The secondary phloem of mature trees was collected by gently scrapping the inner surface of the bark with a scalpel blade. All tissue samples were frozen in liquid nitrogen and then stored at -80°C until RNA extraction immediately upon removal from the tree, seedling or tissue culture vessel.
cDNA library construction, quality controls and high-throughput sequencing
We began the construction of each library with 1000 micrograms of total RNA or more, isolated using the method of Chang et al. . Poly A+ RNA was isolated using the PolyATtract mRNA Isolation System (Promega, San Luis Obispo, CA, USA). The polyA+ RNA was treated with methylmercury hydroxide according to the manufacturer's instruction (Stratagene, La Jolla, CA, USA) to relax its secondary structure. Double-stranded cDNA was synthesized from 5 micrograms (µg) of poly A+ selected RNA using a pBluescript II SK (+) XR cDNA Library Construction kit (Stratagene, La Jolla, CA, USA). The reverse transcription step was carried out with either Superscript II or Superscript III and StrataScript (Invitrogen, Burlington, ON, Canada; Stratagene, La Jolla, CA, USA) as described in the manufacturer's instructions. The double stranded cDNAs were fractionated using the Drip column method (Stratagene, La Jolla, CA, USA) or by agarose gel electrophoresis on NuSieve GTG Agarose (Mendel, Guelph, ON, Canada) followed by selective elution of particular-sized cDNA molecules by ß-Agarase I digests according to the manufacturer's instruction (NEB, Pickering, ON, Canada). The size distribution of the resulting double cDNA synthesized in second-strand fractions was visualized by electrophoresis on a 1.4% alkaline agarose gel . The fractions of 600 pb to 1.2 kb and above 1.2 kb were selected, pooled, directionally ligated into the EcoRI and XhoI restriction sites of the pBluescript II SK (+) XR vector (Stratagene, La Jolla, CA, USA), and transformed into E.coli DH10B competent cells (Invitrogen, Burlington, ON, Canada) by electroporation. The library quality assessment used test ligations to determine library titer. We also estimated the proportion of empty vectors as based upon the proportion of blue to white colonies grown on LB agar supplemented with X-GAL/IPTG (Table 1). The average cDNA insert size was determined by PCR screening of 48 to 96 random white colonies (assumed contain plasmids with inserts) per test ligation, followed by determination of the PCR product size by gel electrophoresis. The highest quality libraries were those estimated to have the highest proportion of inserts above 1.6 Kb (Table 1). High-throughput sequencing of libraries was completed using standard methodsas described by Yang et al. .
EST processing and assembly
Sequence traces from the spruce EST libraries were analyzed with the Phred base calling software (version 0.980904) to generate raw sequences . Peaks with Phred quality values less than 20 were considered to be ambiguous and were assigned base N. Quality trimming and vector filtering (with polyA/polyT removal, as appropriate) were done. Processed sequences were then assembled using the base quality files and Phrap (version 0.990329) . Phrap contigs were evaluated for chimeric sequences, and reassembled after removing chimeric reads. The Phrap assembly parameters used were minmatch 50 and minscore 100. Only reads with at least 100 nt of sequence with a quality score above 20 were assembled. EST sequences were submitted to dbEST at the National Center for Biotechnology Information  under accession numbers : [Genbank:CK434215-CK445169] and [Genbank:CO472624-CO490610].
Quality control of consensus sequences
The quality control of resulting consensus sequences used a system developed at the CCGB. This system uses information that is included in the contig ace file generated by Phrap. From the ace file, several important characteristics of a consensus sequence and its member sequences can be determined. The first characteristic used in this process is the "shape" of the consensus sequence, or how the assembled reads overlap each other. This can be thought of as the profile of the consensus sequence member distribution. Consensus sequences are classified as being of block, staircase, or dumbell shape. Contigs with a dumbell shape are candidates for additional evaluation.
Reads within a dumbell shaped contig are evaluated for their similarity to the consensus sequence of the contig. Phrap provides information on the quality regions of assembled sequences, which is used for this step. If the high quality region of the read (as defined by the Phrap ace file) has less than 95% consistency with the consensus sequence of the contig, or has more than 5 mismatched bases relative to the consensus, the read is flagged as a suspected chimera, provided it also shows evidence of either a polyA or polyT region.
The final step of the quality control process is to examine the flagged reads visually to find chimeric qualities. Chimeric reads are selected and removed based on their similarity to the consensus sequence and to the individual reads in the contig. A chimeric read may also be indicated if blast hits to different proteins are found to be adjacent in the read. The process of chimera detection and removal is often repeated numerous times before arriving at a finished assembly.
Sequence comparison and assignment into functional categories
Similarity searches were performed with the tblastx or blastx programs  against the TIGR Gene Indices available for Arabidopsis (AGI11), rice (OGI16) and pine (PGI5.0), retrieved from the TIGR web site  and against one Cycas EST assembly , retrieved from Sputnik web site . Blast searches were conducted against several databases: the NCBI non redundant database (nr), the Uniref100 peptides set , and the Cell Wall Navigator Database . HMM searches were conducted with the PFAM profiles (PFAM release16.0) with the local alignment setting since the spruce consensus sequences are fragmentary sequences. The Arabidopsis and rice coding sequences were downloaded from the TAIR web site  and the Rice Genome Annotation Database from TIGR , respectively.
To correlate the spruce consensus sequences to a Gene Ontology (GO) molecular function term, the annotations of homologous Uniref100 and Arabidopsis proteins were analysed. For each spruce consensus sequence, the blastx hits with a minimum similarity value of 0.75 and a minimum coverage of 0.5 were used in the GO assignment procedure. Similarity was defined as hsp positive/hsp alignment length (hsp : high scoring pair). Coverage was defined as the high scoring pair alignment length × 3/ query length. Among the retained hits, whenever a spruce sequence matched a protein with an associated GO term, this term was transferred to the spruce consensus sequence. Two GO annotation lists were completed: one including evidence codes Inferred from Electronic Annotation (IEA) evidence codes and one excluding IEA evidence.
Funding for this work was provided by Genome Canada and Genome Québec to J.M., A. Sé. and J.B. for the project Arborea. We acknowledge the "Centre de Bioinformatique de l'Université Laval" for bioinformatics support. We also acknowledge H. Bérubé, S. Blais, C. Delisle, S. Forest, V. Roy for their technical assistance and B. Pelgas for her reading of the draft.
- Ahuja MR: Recent advances in molecular genetics of forest trees. Euphytica. 2001, 121: 173-195. 10.1023/A:1012226319449.View ArticleGoogle Scholar
- Dhillon SS: DNA in tree species. Cell and Tissue Culture in Forestry. Edited by: Bonga JM, Durzan DJ. 1987, Martinus Nijhoff Publishers, Dordrecht, 1: 298-313.View ArticleGoogle Scholar
- Wakamiya I, Newton RJ, Price JS: Genome size and environmental factors in the genus Pinus. Am J Bot. 1993, 80: 1235-1241.View ArticleGoogle Scholar
- Rake AW, Miksche JP, Hall RB, Hanson KM: DNA reassociation kinetics for four conifers. Can J Genet Cytol. 1980, 22: 69-79.View ArticleGoogle Scholar
- Ohri D, Khoshoo TN: Genome size in gymnosperms. Plant Syst Evol. 1986, 153: 119-132. 10.1007/BF00989421.View ArticleGoogle Scholar
- Murray BG: Nuclear DNA amounts in gymnosperms. Ann Bot. 1998, 3-15. 10.1006/anbo.1998.0764.Google Scholar
- Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, Pertea G, Sultana R, White J: The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res. 2001, 29: 159-164. 10.1093/nar/29.1.159.PubMedPubMed CentralView ArticleGoogle Scholar
- Kirst M, Johnson AF, Baucom C, Ulrich E, Hubbard K, Staggs R, Paule C, Retzel E, Whetten R, Sederoff R: Apparent homology of expressed genes from wood-forming tissues of loblolly pine (Pinus taeda L.) with Arabidopsis thaliana. Proc Natl Acad Sci USA. 2003, 100: 7383-7388. 10.1073/pnas.1132171100.PubMedPubMed CentralView ArticleGoogle Scholar
- Pavy N, Laroche J, Bousquet J, Mackay J: Large-scale statistical analysis of secondary xylem ESTs in pine. Plant Mol Biol. 2005, 57: 203-224. 10.1007/s11103-004-6969-7.PubMedView ArticleGoogle Scholar
- Guillet-Claude C, Isabel N, Pelgas B, Bousquet J: The evolutionary implications of knox-I gene duplications in conifers: correlated evidence from phylogeny, gene mapping, and analysis of functional divergence. Mol Biol Evol. 2004, 21: 2232-2245. 10.1093/molbev/msh235.PubMedView ArticleGoogle Scholar
- Allona I, Quinn M, Shoop E, Swope K, St Cyr S, Carlis J, Riedl J, Retzel E, Campbell M, Sederoff R, Whetten RW: Analysis of xylem formation in pine by cDNA sequencing. Proc Natl Acad Sci USA. 1998, 95: 9693-9698. 10.1073/pnas.95.16.9693.PubMedPubMed CentralView ArticleGoogle Scholar
- Dendrome project. [http://dendrome.ucdavis.edu/dfgp/about.html]
- Treenomix project. [http://www.treenomix.com]
- Dubos C, Plomion C: Identification of water-deficit responsive genes in maritime pine (Pinus pinaster Ait.) roots. Plant Mol Biol. 2003, 51: 249-262. 10.1023/A:1021168811590.PubMedView ArticleGoogle Scholar
- Le Provost G, Paiva J, Pot D, Brach J, Plomion C: Seasonal variation in transcript accumulation in wood-forming tissues of maritime pine (Pinus pinaster Ait.) with emphasis on a cell wall glycine-rich protein. Planta. 2003, 217: 820-830. 10.1007/s00425-003-1051-2.PubMedView ArticleGoogle Scholar
- Egertsdotter U, van Zyl LM, MacKay J, Peter G, Kirst M, Clark C, Whetten R, Sederoff R: Gene expression during formation of earlywood and latewood in loblolly pine: expression profiles of 350 genes. Plant Biol. 2004, 6: 654-663. 10.1055/s-2004-830383.PubMedView ArticleGoogle Scholar
- van Zyl L, von Arnold S, Bozhkov P, Chen Y, Egertsdotter U, MacKay J, Sederoff R, Shen J, Zelena L, Clapham D: Heterologous array analysis in Pinaceae: Hybridization of high density arrays of Pinus taeda cDNA with cDNA from needles and embryogenic cultures of P. taeda, P. sylvestris or Picea abies. Function Compar Genomics. 2002, 3: 306-318. 10.1002/cfg.199.View ArticleGoogle Scholar
- Stasolla C, Belmonte MF, van Zyl L, Craig DL, Liu W, Yeung EC, Sederoff R: The effect of reduced glutathione on morphology and gene expression of white spruce (Picea glauca) somatic embryos. J Exp Bot. 2004, 55: 695-709. 10.1093/jxb/erh074.PubMedView ArticleGoogle Scholar
- Hertzberg M, Aspeborg H, Schrader J, Andersson A, Erlandsson R, Blomqvist K, Bhalerao R, Uhlen M, Teeri TT, Lundeberg J, Sundberg B, Nilsson P, Sandberg G: A transcriptional roadmap to wood formation. Proc Natl Acad Sci USA. 2001, 98: 14732-14737. 10.1073/pnas.261293398.PubMedPubMed CentralView ArticleGoogle Scholar
- Yang J, Park S, Kamdem DP, Keathley DE, Retzel E, Paule C, Kapur V, Han KH: Novel gene expression profiles define the metabolic and physiological processes characteristic of wood and its extractive formation in a hardwood tree species, Robinia pseudoacacia. Plant Mol Biol. 2003, 52: 935-956. 10.1023/A:1025445427284.PubMedView ArticleGoogle Scholar
- Paux E, Tamasloukht M, Ladouce N, Sivadon P, Grima-Pettenati J: Identification of genes preferentially expressed during wood formation in Eucalyptus. Plant Mol Biol. 2004, 55: 263-280. 10.1007/s11103-004-0621-4.PubMedView ArticleGoogle Scholar
- Canadian Council of Forest Ministers. [http://nfdp.ccfm.org/]
- Florin R: The distribution of conifer and taxad genera in time and space. Acta Horti Bergiani. 1963, 20: 121-312.Google Scholar
- Bouillé M, Bousquet J: Trans-species shared polymorphisms at orthologous nuclear gene loci among distant species in the conifer Picea (Pinaceae): Implications for the long-term maintenance of genetic diversity in trees. Am J Bot. 2005, 92: 63-73.PubMedView ArticleGoogle Scholar
- Pelgas B, Bousquet J, Beauseigle S, Isabel N: A composite linkage map from two crosses for the species complex Picea mariana [Mill.] B.S.P x Picea rubens (Sarg.) and analysis of synteny with other Pinaceae. Theor Applied Genetics. 2005,Google Scholar
- Neale DB, Savolainen O: Association genetics of complex traits in conifers. Trends Plant Sci. 2004, 9: 325-330. 10.1016/j.tplants.2004.05.006.PubMedView ArticleGoogle Scholar
- Gene Ontology Consortium: Creating the gene ontology resource: design and implementation. Genome Res. 2001, 11: 1425-1433. 10.1101/gr.180801.View ArticleGoogle Scholar
- SpruceDB. [http://ccgb.umn.edu/Pub_SpruceDB/]
- Vettore AL, da Silva FR, Kemper EL, Souza GM, da Silva AM, Ferro MI, Henrique-Silva F, Giglioti EA, Lemos MV, Coutinho LL, Nobrega MP, Carrer H, Franca SC, Bacci Junior M, Goldman MH, Gomes SL, Nunes LR, Camargo LE, Siqueira WJ, Van Sluys MA, Thiemann OH, Kuramae EE, Santelli RV, Marino CL, Targon ML, Ferro JA, Silveira HC, Marini DC, Lemos EG, Monteiro-Vitorello CB, Tambor JH, Carraro DM, Roberto PG, Martins VG, Goldman GH, de Oliveira RC, Truffi D, Colombo CA, Rossi M, de Araujo PG, Sculaccio SA, Angella A, Lima MM, de Rosa Junior VE, Siviero F, Coscrato VE, Machado MA, Grivet L, Di Mauro SM, Nobrega FG, Menck CF, Braga MD, Telles GP, Cara FA, Pedrosa G, Meidanis J, Arruda P: Analysis and functional annotation of an expressed sequence tag collection for tropical crop sugarcane. Genome Res. 2003, 13: 2725-2735. 10.1101/gr.1532103.PubMedPubMed CentralView ArticleGoogle Scholar
- Forment J, Gadea J, Huerta L, Abizanda L, Agusti J, Alamar S, Alos E, Andres F, Arribas R, Beltran JP, Berbel A, Blazquez MA, Brumos J, Canas LA, Cercos M, Colmenero-Flores JM, Conesa A, Estables B, Gandia M, Garcia-Martinez JL, Gimeno J, Gisbert A, Gomez G, Gonzalez-Candelas L, Granell A, Guerri J, Lafuente MT, Madueno F, Marcos JF, Marques MC, Martinez F, Martinez-Godoy MA, Miralles S, Moreno P, Navarro L, Pallas V, Perez-Amador MA, Perez-Valle J, Pons C, Rodrigo I, Rodriguez PL, Royo C, Serrano R, Soler G, Tadeo F, Talon M, Terol J, Trenor M, Vaello L, Vicente O, Vidal Ch, Zacarias L, Conejero V: Development of a citrus genome-wide EST collection and cDNA microarray as resources for genomic studies. Plant Mol Biol. 2005, 57: 375-391. 10.1007/s11103-004-7926-1.PubMedView ArticleGoogle Scholar
- Kawai J, Shinagawa A, Shibata K, Yoshino M, Itoh M, Ishii Y, Arakawa T, Hara A, Fukunishi Y, Konno H, Adachi J, Fukuda S, Aizawa K, Izawa M, Nishi K, Kiyosawa H, Kondo S, Yamanaka I, Saito T, Okazaki Y, Gojobori T, Bono H, Kasukawa T, Saito R, Kadota K, Matsuda H, Ashburner M, Batalov S, Casavant T, Fleischmann W, Gaasterland T, Gissi C, King B, Kochiwa H, Kuehl P, Lewis S, Matsuo Y, Nikaido I, Pesole G, Quackenbush J, Schriml LM, Staubli F, Suzuki R, Tomita M, Wagner L, Washio T, Sakai K, Okido T, Furuno M, Aono H, Baldarelli R, Barsh G, Blake J, Boffelli D, Bojunga N, Carninci P, de Bonaldo MF, Brownstein MJ, Bult C, Fletcher C, Fujita M, Gariboldi M, Gustincich S, Hill D, Hofmann M, Hume DA, Kamiya M, Lee NH, Lyons P, Marchionni L, Mashima J, Mazzarelli J, Mombaerts P, Nordone P, Ring B, Ringwald M, Rodriguez I, Sakamoto N, Sasaki H, Sato K, Schonbach C, Seya T, Shibata Y, Storch KF, Suzuki H, Toyo-oka K, Wang KH, Weitz C, Whittaker C, Wilming L, Wynshaw-Boris A, Yoshida K, Hasegawa Y, Kawaji H, Kohtsuki S, Hayashizaki Y, RIKEN Genome Exploration Research Group Phase II Team and the FANTOM Consortium: Functional annotation of a full-length mouse cDNA collection. Nature. 2001, 409: 685-690. 10.1038/35055500.PubMedView ArticleGoogle Scholar
- Whitfield Ch.W, Band MR, Bonaldo MF, Kumar ChG, Liu L, Pardinas JR, Robertson HM, Soares MB, Robinson GE: Annotated expressed sequence tags and cDNA microarrays for studies of brain and behavior in the honey bee. Genome Res. 2002, 12: 555-566. 10.1101/gr.5302.PubMedPubMed CentralView ArticleGoogle Scholar
- Brown GR, Gill GP, Kuntz RJ, Langley CH, Neale DB: Nucleotide diversity and linkage disequilibrium in loblolly pine. Proc Natl Acad Sci USA. 2004, 101: 15255-15260. 10.1073/pnas.0404231101.PubMedPubMed CentralView ArticleGoogle Scholar
- Pot D, McMillan L, Echt C, Le Provost G, Garnier-Géré P, Cato S, Plomion C: Nucleotide variation in genes involved in wood formation in two pine species. New Phytologist. 2005, 167: 101-112. 10.1111/j.1469-8137.2005.01417.x.PubMedView ArticleGoogle Scholar
- Garcia-Gil MR, Mikkonen M, Savolainen O: Nucleotide diversity at two phytochrome loci along a latitudinal cline in Pinus sylvestris. Mol Ecol. 2003, 12: 1195-1206. 10.1046/j.1365-294X.2003.01826.x.PubMedView ArticleGoogle Scholar
- Crow JA: Diogenes – Reliable prediction of protein-encoding regions in short genomic sequences. 2005, [http://analysis.ccgb.umn.edu/diogenes]Google Scholar
- PFAM database. [http://www.sanger.ac.uk/Software/Pfam/]
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR: The Pfam Protein Families Database. Nucleic Acids Res. 2004, D138-D141. 10.1093/nar/gkh121.Google Scholar
- Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LL: The Universal Protein Resource (UniProt). Nucleic Acids Res. 2005, D154-D159.Google Scholar
- Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, Miller N, Mueller LA, Mundodi S, Reiser L, Tacklind J, Weems DC, Wu Y, Xu I, Yoo D, Yoon J, Zhang P: The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res. 2003, 31: 224-228. 10.1093/nar/gkg076.PubMedView ArticleGoogle Scholar
- Pine Gene Index PGI5.0 database. [http://www.tigr.org/tigr-scripts/tgi/T_index.cgi?species=pine]
- Davuluri RV, Sun H, Palaniswamy SK, Matthews N, Molina C, Kurtz M, Grotewold E: AGRIS: Arabidopsis gene regulatory information server, an information resource of Arabidopsis cis-regulatory elements and transcription factors. BMC Bioinformatics. 2003, 4: 25-10.1186/1471-2105-4-25.PubMedPubMed CentralView ArticleGoogle Scholar
- Girke T, Lauricha J, Tran H, Keegstra K, Raikhel N: The Cell Wall Navigator database. A systems-based approach to organism-unrestricted mining of protein families involved in cell wall metabolism. Plant Physiol. 2004, 136: 3003-3008. 10.1104/pp.104.049965.PubMedPubMed CentralView ArticleGoogle Scholar
- Raes J, Rohde A, Christensen JH, Van de Peer Y, Boerjan W: Genome-wide characterization of the lignification toolbox in Arabidopsis. Plant Physiol. 2003, 133: 1051-1071. 10.1104/pp.103.026484.PubMedPubMed CentralView ArticleGoogle Scholar
- Li F, Zhang D, Fujise K: Characterization of fortilin, a novel antiapoptotic protein. J Biol Chem. 2001, 276: 47542-47549. 10.1074/jbc.M108954200.PubMedView ArticleGoogle Scholar
- Pay A, Heberle-Bors E, Hirt H: An alfalfa cDNA encodes a protein with homology to translationally controlled human tumor protein. Plant Mol Biol. 1992, 19: 501-503. 10.1007/BF00023399.PubMedView ArticleGoogle Scholar
- Sage-Ono K, Ono M, Harada H, Kamada H: Dark-induced accumulation of mRNA for a homolog of translationally controlled tumor protein (TCTP) in Pharbitis. Plant Cell Physiol. 1998, 39: 357-360.PubMedView ArticleGoogle Scholar
- Campbell M, Sederoff RR: Variation in lignin content and composition. Plant Physiol. 1996, 110: 3-13.PubMedPubMed CentralGoogle Scholar
- Peleman J, Saito K, Cottyn B, Engler G, Seurinck J, Van Montagu M, Inze D: Structure and expression analyses of the S-adenosylmethionine synthetase gene family in Arabidopsis thaliana. Gene. 1989, 84: 359-369. 10.1016/0378-1119(89)90510-6.PubMedView ArticleGoogle Scholar
- Shen B, Li C, Tarczynski MC: High free-methionine and decreased lignin content result from a mutation in the Arabidopsis S-adenosyl-l-methionine synthetase 3 gene. Plant J. 2002, 29: 371-380. 10.1046/j.1365-313X.2002.01221.x.PubMedView ArticleGoogle Scholar
- Sanchez-Aguayo I, Rodriguez-Galan JM, Garcia R, Torreblanca J, Pardo JM: Salt stress enhances xylem development and expression of S-adenosyl-L-methionine synthase in lignifying tissues of tomato plants. Planta. 2004, 220: 278-285. 10.1007/s00425-004-1350-2.PubMedView ArticleGoogle Scholar
- Espartero J, Pintor-Toro JA, Pardo JM: Differential accumulation of S-adenosylmethionine synthetase transcripts in response to salt stress. Plant Mol Biol. 1994, 25: 217-227. 10.1007/BF00023239.PubMedView ArticleGoogle Scholar
- Schröder G, Eichel J, Breining S, Schröder J: Three differentially expressed S-adenosylmethionine synthetase from Catharantus roseus : molecular and functional characterization. Plant Mol Biol. 1997, 33: 211-222. 10.1023/A:1005711720930.PubMedView ArticleGoogle Scholar
- Gramene database. [http://www.gramene.org/]
- Lindroth AM, Saarikoski P, Flygh G, Clapham D, Gronroos R, Thelander M, Ronne H, von Arnold S: Two S-adenosylmethionine synthetase-encoding genes differentially expressed during adventitious root development in Pinus contorta. Plant Mol Biol. 2001, 46: 335-346. 10.1023/A:1010637012528.PubMedView ArticleGoogle Scholar
- CCGB biodata database. [http://ccgb.umn.edu/biodata/]
- Lamblin AF, Crow JA, Johnson JE, Silverstein KA, Kunau TM, Kilian A, Benz D, Stromvik M, Endre G, VandenBosch KA, Cook DR, Young ND, Retzel EF: MtDB: a database for personalized data mining of the model legume Medicago truncatula transcriptome. Nucleic Acids Res. 2003, 31: 196-201. 10.1093/nar/gkg119.PubMedPubMed CentralView ArticleGoogle Scholar
- Arborea project. [http://www.arborea.ulaval.ca/en]
- Chang S, Puryear J, Cairney J: A simple and efficient method for isolating RNA from pine trees. Plant Mol Biol Rep. 1993, 11: 113-116.View ArticleGoogle Scholar
- Sambrook J, Fritsch EF, Maniatis T: Molecular Cloning: A Laboratory Manual. 1989, Plainview, NY: Cold Spring Harbor Laboratory Press, 2Google Scholar
- Yang GS, Stott JM, Smailus D, Barber SA, Balasundaram M, Marra MA, Holt RA: High-throughput sequencing: a failure mode analysis. BMC Genomics. 2005, 6: 2-10.1186/1471-2164-6-2.PubMedPubMed CentralView ArticleGoogle Scholar
- Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998, 8: 175-185.PubMedView ArticleGoogle Scholar
- Phrap software. [http://www.phrap.org]
- National Center for Biotechnology Information. [http://www.ncbi.nlm.nih.gov/]
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- The Institute for Genomic Research. [http://www.tigr.org/]
- Brenner ED, Stevenson DW, McCombie RW, Katari MS, Rudd SA, Mayer KF, Palenchar PM, Runko SJ, Twigg RW, Dai G, Martienssen RA, Benfey PN, Coruzzi GM: Expressed sequence tag analysis in Cycas, the most primitive living seed plant. Genome Biol. 2003, 4: R78-10.1186/gb-2003-4-12-r78.PubMedPubMed CentralView ArticleGoogle Scholar
- Sputnik database. [http://sputnik.btk.fi/]
- The Arabidopsis Information Resource. [http://www.arabidopsis.org]
- Rice Genome Annotation Database. [http://www.tigr.org/tdb/e2k1/osa1/]
- EMBOSS. [http://emboss.sourceforge.net/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.