Generation and analysis of expressed sequence tags from six developing xylem libraries in Pinus radiata D. Don

Background Wood is a major renewable natural resource for the timber, fibre and bioenergy industry. Pinus radiata D. Don is the most important commercial plantation tree species in Australia and several other countries; however, genomic resources for this species are very limited in public databases. Our primary objective was to sequence a large number of expressed sequence tags (ESTs) from genes involved in wood formation in radiata pine. Results Six developing xylem cDNA libraries were constructed from earlywood and latewood tissues sampled at juvenile (7 yrs), transition (11 yrs) and mature (30 yrs) ages, respectively. These xylem tissues represent six typical development stages in a rotation period of radiata pine. A total of 6,389 high quality ESTs were collected from 5,952 cDNA clones. Assembly of 5,952 ESTs from 5' end sequences generated 3,304 unigenes including 952 contigs and 2,352 singletons. About 97.0% of the 5,952 ESTs and 96.1% of the unigenes have matches in the UniProt and TIGR databases. Of the 3,174 unigenes with matches, 42.9% were not assigned GO (Gene Ontology) terms and their functions are unknown or unclassified. More than half (52.1%) of the 5,952 ESTs have matches in the Pfam database and represent 772 known protein families. About 18.0% of the 5,952 ESTs matched cell wall related genes in the MAIZEWALL database, representing all 18 categories, 91 of all 174 families and possibly 557 genes. Fifteen cell wall-related genes are ranked in the 30 most abundant genes, including CesA, tubulin, AGP, SAMS, actin, laccase, CCoAMT, MetE, phytocyanin, pectate lyase, cellulase, SuSy, expansin, chitinase and UDP-glucose dehydrogenase. Based on the PlantTFDB database 41 of the 64 transcription factor families in the poplar genome were identified as being involved in radiata pine wood formation. Comparative analysis of GO term abundance revealed a distinct transcriptome in juvenile earlywood formation compared to other stages of wood development. Conclusion The first large scale genomic resource in radiata pine was generated from six developing xylem cDNA libraries. Cell wall-related genes and transcription factors were identified. Juvenile earlywood has a distinct transcriptome, which is likely to contribute to the undesirable properties of juvenile wood in radiata pine. The publicly available resource of radiata pine will also be valuable for gene function studies and comparative genomics in forest trees.


Background
Radiata pine (Pinus radiata D. Don) is the dominant forest plantation species for the sawmill industry in Australia, New Zealand, Chile and some other countries. Breeding programs in radiata pine have been conducted in Australia since the early 1950s. The first generation of breeding increased the growth rate by 33%, thus reduced the rotation period to 27-30 yrs from the previous 40-45 yrs [1]. Consequently, the faster growth rate resulted in a large proportion (30-50%) of juvenile wood in the harvested logs [2,3]. Juvenile wood has a number of undesirable wood properties [1,3] and its higher proportion in the harvested logs reduces the value of timber products. Improving juvenile wood quality and reducing its proportion have been identified as the priorities in the next generation breeding program of radiata pine. Understanding wood formation at the molecular level would underpin more efficient breeding strategies for the improvement of juvenile wood.
Genomics approaches have been applied to explore the molecular basis of growth and development in forest tree species. Expressed sequence tags (ESTs) and microarray gene expression studies have been carried out in poplar, loblolly pine, spruce and eucalypts [4][5][6][7][8][9][10][11][12][13][14][15]. Due to the economic value of wood, all forest genomic projects were primarily focused on the transcriptional regulation of wood formation (xylogenesis). Xylogenesis is initiated in the vascular cambium and proceeded from cell division to expansion, secondary wall formation, lignification, and finally programmed cell death [5,12]. Large numbers of xylogenesis ESTs from forest tree species have been deposited in public databases, including 59,797 ESTs from loblolly pine, 25,218 ESTs from poplars, 16,430 ESTs from white spruce and 52,330 ESTs from sitka spruce (extracted from [6,8,14,16], respectively). Furthermore, the genome of Populus trichocarpa (~550 Mbp) has been published [17], and efforts to sequence the Eucalyptus grandis genome (~640 Mbp) are in progress [18]. However, the large conifer genome (~20,000 Mbp) [19] is unlikely to be sequenced in the near future, thus EST sequencing remains an important approach for gene discovery in conifers.
Despite the commercial importance of radiata pine in many countries, little genomic research has been done for this species compared to loblolly pine, Populus, spruce, maritime pine and eucalypts. As of January 20, 2009, only 151 radiata pine ESTs appear in the NCBI GeneBank (dbEST), and no unigene information is available. Recently, 455 genes were observed to be differentially expressed in the base to the crown of the radiata pine trees using modified differential display [20]. Gene expression in the early embryogenesis of Pinus radiata was studied using the cDNA-AFLPs strategy, which revealed a total of 82 up-or down-regulated transcript-derived fragments (TDFs) [21]. Development of juvenile and mature wood in radiata pine is still poorly characterised at the genomics level.
We applied genomics approaches to investigate the transcriptional regulation of xylogenesis in radiata pine, with a focus on juvenile wood formation. Six developing xylem cDNA libraries were constructed from earlywood and latewood tissues collected from juvenile (7 yrs), transition (11 yrs) and mature (30 yrs) trees, respectively. The sampled xylem tissues represent the major stages of wood development in a typical rotation period of radiata pine. A total of 6,389 high quality xylogenesis ESTs were sequenced from 5,952 cDNA clones and assembled into 3,304 unigenes. Here we report the generation and analysis of a genomic resource for wood formation in radiata pine.

EST sequencing and assembly
In total, 6,389 high quality ESTs of at least 100 bp in length were collected from approximately 8,000 raw sequences. The 6,389 ESTs were sequenced from 5,952 different cDNA clones in six developing xylem cDNA libraries. Average size of all 6,389 ESTs and the 5,952 ESTs from 5' end sequences is 624 bp and 636 bp, respectively. Of the 5,952 ESTs, 86.4% and 69.6% are greater than 300 bp and 500 bp in length, respectively. The number of ESTs in each library ranged from 694 in earlywood at transition age to 1,636 in latewood at juvenile age (Table 1). Juvenile wood has the highest proportion of ESTs due to the focus of this study. The assembly of all 5,952 ESTs from 5' end sequences generated 3,304 xylogenesis unigenes, including 952 contigs (28.8%) and 2,352 singletons (71.2%) ( Table 1). The 3,304 unigenes have an average length of 702 bp; 41.0% are more than 800 bp, and only 13.5% are less than 300 bp. Of the 952 contigs, 28.4% (270) have four or more transcripts ( Figure 1) and the three deepest contigs included 69-79 transcripts. Since the ESTs and unigenes were derived from a total of seven different genotypes of radiata pine, these deep contigs provide an opportunity to identify single nucleotide polymorphisms (SNPs). The number of ESTs forming a contig reflects the level of EST redundancy in the cDNA libraries. The assembly of 5,952 ESTs from 5' end sequences revealed 47.8% of redundancy in the radiata pine EST collection (Table 1). However, the EST redundancy in each library varies from 14.2% to 44.4%, with an average redundancy of 28.8% (Table 1). The redundancy may increase significantly during further EST sequencing. By comparison, the EST redundancy in the radiata pine EST collection is slightly higher than the estimated redundancy of 28% in Populus (13.5 k microarrays) [22] and of 39% in white spruce (50,000 ESTs) [14].

Functional annotation and classification
Blast searches of the 5,952 ESTs against the NCBI nr database revealed 73.4% and 69.0% matches with tblastx and blastx (E-value ≤ 10 -5 ), respectively. However, 71.3% of all matches with blastx are either unknowns (47.5%), unnamed (13.6%), hypothetical (5.4%) or predicted proteins (4.7%). When using the UniProt database with blastx (E-value ≤ 10 -5 ) and TIGR database with blastn (Evalue ≤ 10 -15 ), 97.6% of the 5,952 ESTs have homologs and only 20.5% of all matches are unknowns or uncharacterized proteins. Of the 139 ESTs (2.3%) with no matches in the UniProt and TIGR databases, 41 ESTs are greater than 500 bp in length and remain as singletons after assembly, thus some of which are likely to represent putative novel ESTs in radiata pine wood formation. Based on the Pfam known protein family database, 52.1% of the ESTs have homologs with blastx (E-value ≤ 10 -5 ) and were classified into 772 protein families. Nearly half of the ESTs (47.9%) did not match the known protein families in the Pfam database, thus they were regarded as unknown protein families.
Of the 3,304 xylogenesis unigenes, 68.1% have matches in the NCBI nr database with blastx, however 77.3% of all matches are unknowns or uncharacterized proteins. In contrast, a total of 96.1% of the unigenes matched sequences in the UniProt (with blastx) and TIGR (with blastn) databases and only 42.9% of all matches were not assigned GO terms ( Table 2). The results blasted with unigenes are similar to those with ESTs. In the functional classification with GO terms, 89.1% of the 1,813 unigenes with assigned GO terms have molecular functions, 74.6% are involved in a biological process, and 47.8% are cellular components. The three categories of GO terms fell predominantly into one or two sub-categories ( Figure 2). In the molecular function category with 1,616 unigenes 56.4% and 57.2% have binding and catalytic activity, respectively. Of the 867 unigenes in the cellular component, 98.9% are related to cell components. As for the 1,353 unigenes involved in biological process, 87.7% and 96.2% have functions in cellular process and physiological process, respectively.
The functional annotation of 3,304 unigenes using the UniProt database revealed 2,101 (63.6%) matches, which is slightly lower than the matches using the NCBI nr database. Of the 2,101 matched unigenes 1,813 were assigned with GO terms and matched 1,582 different UniProt accession numbers, which suggests that possibly 1,582 non-redundant genes with known functions have been identified in wood formation of radiata pine. Significant

Num ber of Contigs
GO term enrichments of the xylogenesis related genes in radiata pine (Table 3) were revealed using the DAVID Bioinformatics Resources 2008 [23,24]. The mostly enriched GO terms in biological process (BP) include cellular protein metabolic process, transport, cellular component organization and biogenesis, and catabolic process. In cellular component (CC), cytoplasmic part, macromolecular complex, and non-membrane-bound organelle are the mostly enriched terms. While in the molecular function (MF), nucleotide binding, transporter activity, peptidase activity, and structural molecular activity are highly enriched. Further DAVID functional classification revealed 13 functional groups with an enrichment score of at least 1.26 (Additional file 1). The most significant functional group is a cluster of 11 genes showing a peptidase function.

Cell wall biosynthesis genes
In the 5,952 ESTs, only 6.8% have homologs (blastx, Evalue ≤ 10 -5 ) in the Cell Wall Navigator, a primary wall gene database of Arabidopsis [25]. However, all 18 categories of primary and secondary wall genes in the MAIZE-WALL database [26] were represented in the radiata pine EST resource, including 1,070 ESTs classified into 91 cell wall gene families (Additional file 2). Therefore, genes related to secondary cell walls are highly accumulated in the radiata pine EST resource. Based on the Pfam database, 772 known protein families were identified in radiata pine. The 30 most highly abundant families included 11 protein families related to cell wall biosynthesis (Table  5). Unsurprisingly, cellulose_synt is the most abundant family with 108 ESTs. Other abundant protein families included actin, tubulin, tubulin_C, meth_synt_2, methltransf_3, peroxidase, S-AdoMet_synt_C, glyco_hydro_19 and glyco_hydro_9 (Table 5). These abundant families included many known proteins associated with cell wall formation, such as CesAs, actins, tubulins (α and β), methionine synthases, MetE, CCoAOMT, peroxidases, SAMS, chitinases and cellulases. However, AGPs are not recognized as proteins in the Pfam database, thus they are not listed in the abundant protein families. The most abundant protein families here are broadly consistent with the most abundant genes based on UniProt and TIGR databases ( Table 4).

Identification of transcription factors
PlantTFDB, a recently developed database of transcription factor (TF) families for 22 plant species [28], was used to identify putative transcription factors expressed in radiata pine wood formation. Blastx searches revealed 358 ESTs (assembled into 284 unigenes) of radiata pine with matches in PlantTFDB at E-value ≤ 10 -5 . These homologs *: Annotation was based on blastx with unigenes against UniProt; for the no hits a further blastn was performed against the TIGR database; functional classification was based on GO terms.
Functional classification for the 1,843 unigenes which were assigned with GO terms Figure 2 Functional classification for the 1,843 unigenes which were assigned with GO terms. Among 3,304 unigenes from radiata pine xylogenesis, 1,813 unigenes were functionally assigned with GO terms as molecular function (A), cellular component (B) and biological process (C). For each of these GO terms, functions were further assigned with sub-category of GO terms. The total percentage of functional categories or sub-category may be over 100% due to possibly multiple functions from some unigenes. Trihelix. The higher abundance of PHD, C3H, MYB, HMG, WRKY, NAC, HB and bZIP in radiata pine xylogenesis was also observed in white spruce [14].

Transcriptome reorganization during wood development
The ESTs and unigenes from different libraries represent the transcriptome in the respective wood development stages. We compared the transcriptomes from juvenile earlywood, juvenile latewood, mature earlywood and mature latewood tissues. These four tissues were collected from trees growing within 50 m of each other, and at the same date for earlywood or latewood to minimize the environmental effects. Comparative transcriptome analysis highlighted 306 and 150 GO terms showing significant differences (P-value < 0.01) in terms of EST and unigene abundance, respectively. The hierarchical clustering dendrogram trees from both ESTs and unigenes revealed similar patterns of transcriptome reorganization during wood formation in radiata pine ( Figure 3A, B), which suggested that juvenile earlywood has a distinct transcriptome compared to the other three developmental stages, whereas the transcriptome of juvenile latewood is more conserved with mature earlywood and latewood. The distinct gene expression in juvenile earlywood is likely associated with the unique properties of juvenile wood.
Comparative analysis of protein family revealed some known protein families more abundant in particular libraries (Additional file 3). Pectinesterase, pfkB family carbohydrate kinase, UDP-glucose/GDP-mannose dehydrogenase, homeobox domain, UDP-glucosyl transferase, profilin, thioredoxin, and plastocyanin-like domain were more abundant in earlywood. In contrast, CesA, SAMS, TCTP, tubulin, dehydrin, metallothionein, protein tyrosine kinase, aminotransferase, and WD domain protein  GO term enrichments in different stages of wood development were also revealed using DAVID Bioinformatics Resources 2008 (Additional file 5). Specifically enriched GO terms in earlywood included: cellular component organization and biogenesis, membrane, protein complex, catalytic activity, hydrolase activity and transporter activity. Latewood specifically enriched terms included: cellular macromolecule metabolic process, protein modification process, cytoplasm, nucleotide binding, and protein kinase activity. The number of specifically enriched GO terms in juvenile wood is about two times as in mature wood, suggesting more unique gene expression in juvenile wood than mature wood (Additional file 5). The most enriched specific GO terms in juvenile wood included: cellular macromolecule metabolic process, intracellular organelle part, protein complex, hydrolase activity, and nucleotide binding; while in latewood cellu-

Discussion
The 6,389 xylogenesis ESTs and 3,304 unigenes described here form a significant genomic resource for radiata pine.
Being derived from major stages of wood development in a typical rotation period, this resource represents a broad xylogenesis transcriptome in the species. The highest proportion of ESTs is derived from juvenile wood and will assist in understanding transcriptional regulation at this stage. Most families of genes involved in secondary cell wall development are represented in the EST resource.

Gene expression in secondary wall formation
Lignin biosynthesis is a key developmental feature that distinguishes secondary from primary cell walls. The synthesized monolignols are transported to the apoplast, where polymerisation starts in the middle lamella and cell corners [29]. Most genes in the monolignol pathway are represented with high or moderate abundance in radiata pine xylogenesis including SAMS, MetE,laccase, CCoAOMT, C4H, 4CL,methionine synthase,peroxidase,CAD, C3H, COMT and cytochrome c oxidase. A total of 46 radiata pine ESTs (four contigs and two singletons) were found to encode SAMS protein. Lignin precursors are extensively methylated in the S-adenosyl methionine (SAM) dependent reaction [30]. In white spruce seven SAMS genes were previously identified [14]. Some other lignin biosynthetic genes were also identified at lower abundance including CCR, PAL and dirigent-like. The dirigent protein was presumed to act as a template for lignin polymerization [31]. The presence and abundance of most lignin biosynthetic genes in the radiata pine EST collection suggests active secondary cell wall biosynthesis in the xylem tissues sampled.
Cellulose is deposited in both primary and secondary cell walls, but is much more abundant in secondary walls.
CesA is the most abundant gene in the radiata pine EST collection with 175 ESTs representing six of the eight previously isolated PrCesA genes [32]. A phylogenetic tree of plant CesAs (Figure 4) constructed using Muscle [33] and MEGA 4 [34] revealed that five of the six PrCesA proteins identified in this study are most homologous to the secondary wall CesAs of Arabidopsis [35]. The five PrCesAs are also most homologous to the secondary wall CesAs of Populus [36][37][38] and loblolly pine [39]. Interestingly, the five PrCesAs cover all three sub groups of plant secondary wall CesAs represented by three Arabidopsis secondary wall CesAs (AtCesA4, AtCesA7 and AtCesA8), respectively, suggesting the functional conservation of secondary wall CesAs across angiosperm and gymnosperm taxa. A single EST encoding PrCesA10 was clustered with primary cell wall CesAs, suggesting primary wall PrCesA genes are poorly presented in the radiata pine EST resource. Cellulases and SuSy are also highly expressed during radiata pine wood formation. Cellulase may be part of the cellulose synthesizing complex and required for wall assembly and elongation [40]. A membrane-bound cellulase isoenzyme is up-regulated in poplar secondary walls [5]. In Scots pine SuSy activity was observed to peak in the zone of maturing tracheids where the secondary wall is formed, and its expression was lowest in primary wall tissues [41].
Hemicelluloses and pectin form a large group of heteropolysaccharides composed of D-xylose, L-arabinose, Lrhamnose, L-fucose, D-mannose, D-galactose, D-galacturonate, or D-glucose. In the radiata pine EST collection we identified genes involved in the synthesis and degradation of hemicellulose and pectin biosynthesis, including 21 ESTs (three contigs) of UDP-glucose dehydrogenase. UDP-D-glucuronate synthesis is the rate-limiting step for the biosynthesis of both hemicellulose and pectin [42]. We also identified four radiata pine ESTs of xyloglucan endotransglycosylases (XETs). XETs can cut and rejoin xyloglucan (XG) chains, and are believed to be important regulators of primary cell wall expansion [43]. One radiata pine XET is most homologous to Arabidopsis XTH8, which belongs to the same class as PttXET16A, a poplar secondary wall XET [44]. The considerable involvement of XETs in secondary walls is likely in the breakage and reconnection of linkages between adjacent microfibrils soon after their synthesis.
AGPs, AGP-like proteins and FLAs are highly abundant in the radiata pine EST collection with a total of 89 ESTs. The ortholog of loblolly pine PtaAGP4 is the most abundant AGP in radiata pine with 45 ESTs. Other identified radiata pine AGPs include PrAGP5, PrAGP6, PrAGP-like and PrFLAs (PrFLA1, 8, 10, 12, 16, 17 and 26). Six loblolly pine PtaAGPs (AGP3, AGP4, AGP5, AGP6, 3H6 and 14A9) were predominantly expressed in xylem [45]. The role of these AGPs in secondary wall development is currently unknown. In loblolly pine PtaAGP6 epitopes are restricted to cells formed immediately before secondary cell wall thickening [46]. However, the preponderance of genes in the radiata pine EST resource that are clearly involved in secondary wall development suggests that AGPs may also play an important role in secondary wall synthesis. AGPs were also found to be expressed in xylem tissues of  [47]. While in eucalypts two FLAs are strongly up-regulated in upper branch wood [15]. A tobacco AGP is a candidate linker at the cell surface to mediate signal transduction between the plasma membrane and the cytoskeleton [48].

Expression of cytoskeleton-related genes
Conspicuous in the radiata pine xylem EST resource are genes encoding the cytoskeletal proteins: tubulin, actin and other cytoskeletal-associated proteins. The cytoskeleton provides the structural basis for cell polarity establishment and maintenance [49,50]. Cellulose microfibril arrangement and deposition is believed to be directed by cortical microtubules, the dynamic heteropolymer arrays of α-tubulin and β-tubulin proteins. A Eucalyptus grandistubulin gene (EgrTUB1) has been implicated in determining the orientation of cellulose microfibrils in plant secondary walls [51]. Two Populus tremuliodes -tubulin genes (TUA1 and TUA5) are highly expressed in wood tissues [52] and ten tubulin genes in Populus tremula × P. tremuliodes are up-regulated in secondary walls [5]. In the radiata pine EST resource, a total of 102 ESTs were annotated to encode tubulin proteins, among which three tubulin genes (PrTUA1, PrTUB2 and PrTUB3) are highly or moderately abundant with 52, 26 and 15 ESTs, respectively. These three PrTUBs are most homologous to poplar tubulin genes that are strongly expressed in xylem tissues [52].
Actin microfilaments ensure the delivery of vesicles to specific sites in plant cells and are involved in cell shape determination [53]. Actin genes are highly abundant in the radiata pine EST collection with a total of 58 ESTs (six contigs and five singletons). Other actin related and interacting genes are also found, including ADF, actin binding protein (ABP), actin-related protein (ARP), profilin and villin. Arabidopsis has ten actin genes [53], eight actin-related proteins (ARP) [54] and many actin-interacting proteins including profilin, actin depolymerisation factor (ADF), fimbrin, villin, rho-type small GTPase, capping protein, and actin-interacting cyclise-associated protein [55,56]. The radiata pine ARPs are the putative homologs to Arabidopsis ARP2, ARP3 and ARP6, suggesting the possible involvement of the ARP2/3 complex in actin cytoskeleton development during radiata pine xylem formation.
Cytoskeleton formation involves cycling dynamics of polymerization and depolymerization of its basic structural subunits, such as G-actin and tubulin dimers. Microtubule-associated proteins (MAPs) and actin-binding proteins (ABPs) are believed to regulate the dynamic cytoskeletal changes [57]. In the radiata pine EST resource, several transcripts are present for MAPs (three ESTs), microtubule-binding protein (one EST) and actin-binding protein (two ESTs). Kinesin-1 (conventional kinesin) is a dimeric motor protein that carries cellular cargo along microtubules [58]. A novel kinesin (GhKCH1) from cotton fibers plays a role in coordinating the actin network with the cortical microtubule array [59]. We identified eight radiata pine ESTs annotated as kinesin-1 or kinesinlike proteins which may have similar roles in actin and microtubule interactions.
Phylogenetic tree of CesA proteins from radiata pine and other species

Transcription factors in secondary wall formation
Of the 64 transcription factor families in poplar genome, 41 families (64.1%) are presented in the radiata pine xylogenesis genomic resource (Table 6). These transcription factor families include many regulatory genes, such as MYBs, MADS-box, LIM domain, zinc finger, Class III HD-Zip, WRKY and AUX/IAA. Several members of NAC, MYB, zinc finger, LIM domain, MADS-box, AUX/IAA and homeodomain (HD) are believed to regulate secondary wall biosynthesis [60][61][62][63][64][65]. MYBs and zinc fingers can recognise AC elements in the promoters of many monolignol biosynthesis genes [65][66][67]. Some members of the NAC and MYB families are the key switches in the transcriptional network for secondary wall development [65].
Three radiata pine ESTs of R2R3-type MYB are most homolgous to the three secondary xylem MYB genes (MYB2, MYB4 and MYB8) in loblolly pine and Picea glauca [68]. In loblolly pine PtMYB1 and PtMYB4 are transcriptional activators of lignin synthesis [66,67]. The Eucalyptus EgMYB2 regulates lignin biosynthesis through binding to AC element [69]. Homeobox domain (HD) transcription factors were also identified in the radiata pine EST resource with 29 ESTs (two contigs). Three radiata pine ESTs were annotated as AUX/IAA genes. Auxin can rapidly induce AUX/IAA gene expression [70] and five AUX/IAA genes from hybrid aspen are up-regulated in xylem [71].
In radiata pine EST resource Class III HD-Zip genes were represented by 15 ESTs (six unigenes). Five Arabidopsis HD-Zip III genes are up-regulated in secondary xylem [63,72] and the hybrid aspen class III HD-Zip gene (PtaHB1) is closely associated with wood formation [73]. Eleven radiata pine ESTs (four unigenes) were annotated as LIM domain and four ESTs (one unigene) as MADS-box. A tobacco LIM protein can bind to the Pal-box sequence in the promoter regions of several phenylpropanoid biosynthesis genes [61]. Some Arabidopsis MADS-box genes have been implicated in the regulation of lignin biosynthesis [62].

Gene expression and wood properties
Wood formation is a complex and continuous process involving a series of developmental stages during which considerable variation in wood properties is observed. Compared to mature wood, juvenile wood shows higher spiral grain, lower density, higher microfibril angle (MFA), more longitudinal shrinkage, lower cellulose content, higher lignin content, more pectin and shorter tracheids [1,3,74]. Compared to latewood, earlywood has higher growth rate, thinner cell walls, larger radial lumen, more lignin, lower hemicellulose, lower pectin, slightly lower cellulose, higher MFA and lower density [74][75][76].
Most of the distinguishing properties of juvenile wood are due to the high proportion of earlywood in the juvenile wood [77]. The SilviScan profiling [78] of mature radiata pine trees revealed the unique morphology and wood properties of juvenile earlywood compared to other wood development stages ( Figure 5 and Additional file 6). Juvenile earlywood forms a larger proportion in the early rings and has lower density, higher microfibril angle (MFA), lower modulus of elasticity (MOE) and thinner cell walls. These unique phenotypic characteristics of juvenile earlywood are likely to be caused by its distinct transcriptome ( Figure 3A, B). Some genes involved in cellulose synthesis and secondary wall formation are more strongly expressed in latewood and mature wood than in earlywood and juvenile wood, respectively (P-value < 0.05) (Additional file 3 and 4), which may be associated with some of the observed differences in wood quality between the different wood tissues, such as cellulose content and cell wall thickness. Many genes related to lignin biosynthesis are more abundant in earlywood and mature wood than in latewood and juvenile wood, respectively; however, only some genes are statistically significant (P-value < 0.05) ( Table 7). These differences in gene abundance may give rise to the higher lignin content found in earlywood and mature wood.
Although Audic and Claverei proposed a statistical method [79] to test the significance of digital gene expression, the preferentially expressed genes (even statistically significant genes) derived from gene abundance analysis still require further validation, such as microarray experiments, real time RT-PCR or other alternative methods. Multiplex ligation-dependent probe amplification (MLPA) [80] was used in this study for validation of selected genes ( Table 8). The normalized ratio of EST abundance and MLPA gene expression ratio is highly comparable for a proline-rich protein (PRP) gene in both earlywood and latewood at transition age, as well as for a dehydrin gene in latewood at transition age. The MLPA gene expressions of AGP4 and peroxidase at both juvenile and transition ages confirmed their high EST abundances in earlywood. However, MLPA expression values of some genes may differ from their EST abundances (Table 8), suggesting EST abundance could not infer the exact gene expression level.

Conclusion
The radiata pine genomic resource was developed from six developing xylem libraries with 6,389 high quality ESTs and 3,304 unigenes. This is the first large scale and publicly available genomic resource for radiata pine. Many genes involved in cell wall biosynthesis and transcription factors were identified in the wood formation of radiata pine. Comparative analysis among different development stages revealed a distinct transcriptome in juvenile earlywood. Genes with relatively more expression in earlywood, latewood, juvenile wood and mature wood were also identified, respectively. The identified genes in this study will be candidates for functional genomics and association studies in radiata pine. This genomic resource will also be valuable for comparative genomics of wood formation in forest trees.

Plant material
In order to broadly represent the xylogenesis transcriptome in radiata pine, developing xylem tissues were sampled at six critical development stages: earlywood (spring) and latewood (autumn) tissues collected at juvenile (7 yrs), transition (11 yrs) and mature (30 yrs) ages. Two 7year-old and two 30-year-old trees were sampled at Yarralumla ACT in August 2004 (spring) and April 2005 (autumn). Three 11-year-old trees were sampled at Bondo NSW in September 2003 (spring) and April 2004 (autumn). Developing xylem tissues were collected by scraping the thin (approximately 1.0 mm) and partially lignified layer on the exposed xylem surface at breast height. All xylem tissues were immediately frozen in liquid nitrogen in the field, and then stored in the laboratory at -80°C for later RNA extraction.

cDNA library construction and colony isolation
Total RNA was extracted using the method of Chang, et al [81] with slight modification. Poly(A) + mRNA was isolated using the polyATtract system III/IV (Promega, CA). Five micrograms of mRNA were used as starting template for cDNA library construction using the ZAP-cDNAGigapack III Gold Cloning Kit (Stratagene, CA). After first and second strand cDNA synthesis, the radioactive cDNA samples were checked on an alkaline agarose gel and exposed to x-ray film. The double strand cDNAs were fractionated using a drip column filled with Sepharose CL-2B gel. The fractions at approximately 600 bp or more were pooled and ligated into the EcoRI/XhoI cloning site in the Uni-ZAP XR vector. One to two microliters of ligated cDNA inserts were packaged using Gigapack III Gold Packaging Extract. To make a large quantity of high titer stock of the lambda phage library, the primary libraries were amplified according to the manufacturer's protocols (Stratagene, CA). The pBluescript phagemids containing cDNA inserts were mass excised in vivo from the Uni-ZAP XR vector using Ex-Assist helper phage and XL1-Blue MRF'. The titer of lambda phage and phagemid libraries was measured using the host cells of XL1-Blue MRF' strain and SOLR strain, respectively. Library quality and the size of cDNA inserts were determined by PCR screening of 96 randomly selected clones. All lambda phage cDNA libraries were stored at -80°C.
A wood core and SilviScan profile of wood density for a 32-year-old radiata pine tree Figure 5 A wood core and SilviScan profile of wood density for a 32-year-old radiata pine tree. The cross section of wood core sampled at 1.3 m height shows the annual change in width and colour of earlywood and latewood within each ring. Wood density is the most important wood trait for the sawmill timber industry. SilviScan profiling for wood density of radiata pine revealed the distinct characteristics of juvenile wood, particularly juvenile earlywood, compared to transition and mature wood. SilviScan analysis was performed in the method of Evans, et al, 2000 [78].

Mature wood (≥ 12 yrs) Earlywood Latewood
The excised phagemids were replicated in SOLR cells and grown on LB-ampicillin agar plates overnight at 37°C for colony isolation. All phagemid colonies were picked out and put into LB broth with ampicillin. Colony picking was done by a Versarray Colony Picking Robot (Bio-Rad, CA) or manually with a tooth-pick. The picked colonies were cultured overnight at 37°C with shaking. These suspension cultures were used for PCR reactions and EST sequencing, or stored at -80°C in glycerol stock for later use.

EST sequencing and assembly
An average of 1,200 cDNA clones from each of the six libraries was used for EST sequencing. Approximately 8,000 sequencing reactions were performed on a total of 7,200 cDNA clones. The cDNA inserts were PCR amplified using the M13 forward and reverse primers followed by clean-up with ExoSap (GE Healthcare, UK). The sequencing primer for the 5' end of cDNA inserts was the SK, T3 or M13 primer. A small proportion of sequences were obtained from the 3' end using the M13 forward primer. The sequencing reaction was performed using BigDye 3.1. After a quick start at 96°C 1 min, the reaction was cycled 25 times at 96°C 1 sec, 50°C 5 sec and 60°C 4 min, followed by holding at 4°C. After sodium acetate-EDTA-ethanol precipitation, the sequencing reaction was run using ABI 3730 capillary sequencers (Applied Biosystems, CA). The trace files were base called by Sequencing Analysis v5.2 which was integrated into the ABI 3730 capillary sequencers.
The raw EST sequences were processed using Sequencher 4.7 (Gene Codes Corporation, MI) to trim vector and ambiguous ends. The poly(A) in the EST sequences was  deleted manually. A total of 6,389 high quality ESTs with at least 100 bp in length were obtained. These ESTs were submitted to dbEST at the National Center for Biotechnology Information with Genbank accession numbers FE518213 to FE524601. The 5' end ESTs from 5,952 different clones were assembled to generate consensus sequences for unigenes, including contigs and singletons. The EST assembly was performed using the CAP3 program [82] integrated in BIO301, an automated EST sequence management and functional annotation system [83]. In order to minimize chimeric contigs in assembly, an overlap of at least 40 bp and identity of sequence of at least 95% were used as thresholds in the assembly. The number of ESTs in each contig was used to estimate the redundancy in the EST sequencing.

Annotation and functional classification
The high quality ESTs and derived unigenes were used to search the NCBI nr (non-redundant) database using blastx and tblastx (E-value ≤ 10 -5 ). They were also used to search the UniProt known protein database [84] using blastx (E-value ≤ 10 -5 ). The sequences with no matches in UniProt were further blasted against the TIGR all tentative consensus sequence gene indices database [85] using blastn (E-value ≤ 10 -15 ). Putative functions for ESTs and unigenes were assigned with GO (Gene Ontology) terms [86]. To identify the known protein families, the ESTs were also searched with the Pfam protein family database [87] using blastx (E-value ≤ 10 -5 ). Initial functional classification was based on the assigned GO terms with two levels of classification. Gene enrichment analysis and further functional classification on GO terms were conducted using the DAVID Bioinformatics Resources 2008 [23,24] Identification and validation of differentially represented genes Comparison of EST abundance was used to identify differentially represented genes in different libraries or combined libraries. Number of ESTs for each gene in different libraries was previously normalized at the presumed same number of ESTs in each library. A normalized ratio of EST abundance was calculated for the comparisons. The significance of differentially represented genes was statistically tested using the method from Audic and Claverie [79].
Preferentially represented genes were validated using the strategy of multiplex ligation-dependent probe amplification (MLPA) [80]. More specifically we used the RT-MLPA protocol for mRNA detection and quantification. Four genes (AGP4, proline-rich protein, peroxidase and dehydrin) preferentially represented in the combined earlywood or latewood libraries were included in the validation. Gene expressions of these four genes were detected in earlywood and latewood tissues from three radiata pine trees at transition age (11 yrs). In addition, expressions of AGP4 and peroxidase were also measured in earlywood and late-wood tissues from eight radiata pine trees at juvenile age (5 yrs). We used four technical replicates in RT-MLPA. Reverse transcription for ~400 ng total RNA after DNase treatment was performed using the ImProm-II Reverse Transcription System (Promega, WI) and oligo (dT) 15 . Synthesized cDNA was hybridized with mixed NPK and LIG probes designed for the validated genes (Additional file 7) at 60°C overnight, followed by ligation and PCR amplification with SALSA D4 primer. Mixture PCR products from multiple genes were separated and analysed using the CEQ™ 8000 Genetic Analysis System (Beckman Coulter, CA).

Authors' contributions
XL carried out cDNA library construction, EST sequencing, assembly, annotation, functional classification, gene identification and manuscript preparation. SS and HW proposed the research project and guided the research. SD participated in colony isolation and PCR. All the authors have read and approved the final manuscript.

Additional material
Additional file 1