In spite of the remaining challenges associated to the assembly of short reads to full-length transcripts  RNA-seq has over the last years evolved into the de-facto standard for transcriptome analysis even in non-model species [11, 42]. In the present study we evaluated the quality of our assembly using a set of different criteria largely relying on comparisons to other plant species. First, we compared basic summary statistics to data from other conifer species, where the transcriptome was obtained through Sanger sequencing. Second, we further characterized the assembly by assessing properties like transcriptome annotation and similarity at the protein level to available plant protein sequences. P. glauca, which has one of the best characterized transcriptome among conifer species, has a transcriptome assembly length of 30.15 Mbp, i.e. 12 Mbp longer than the the P. abies assembly presented here. In addition, the length of individual PUTs is on average shorter in the P. abies assembly than in both P. glauca and P. taeda (Figure 1, Figure 2, http://www.plantgdb.org). This indicates that the P. abies transcriptome assembly only comprises a fraction of the total transcriptome and that a large proportion of the assembled PUTs corresponds to partial transcripts. There are a number of possible reasons for this difference in assembly length despite of the large amount of short reads collected. First, the assembly of short reads from transcriptome sequence is still far from optimal , something that likely is elevated in this data set as the mean fragment length in the sequence library was shorter than the pair-end distance. Second, non-normalized mRNA was used, which means that the read depth of lowly expressed genes is likely insufficient for assembly. Third, the short read libraries used here obviously will capture only a restricted part of the total transcriptome as they are based on a single tissue and only two sampling time points. All this is in contrast to the other mentioned conifer species EST assemblies that are, at least partially, composed of normalized mRNA libraries and also include mRNA extracted from several tissues including root, which in general have divergent expression patterns [21, 22, 43, 44].
In total over 34,000 transcripts showed significant hits when compared to other spruce species, either at the protein level to plant proteins or at the nucleotide level, leaving only 4,167 PUTs without any annotation. This set of PUTs with low or no similarity to other species could represent erroneously assembled PUTs, but since estimates from the P. glauca transcriptome suggest that as much as 11.6 Mbp of the transcriptome is still missing , they could also represent previously undiscovered spruce transcripts. In theory they could also be transcripts specific to P. abies, but this seems less likely as both this data and previous studies have shown that spruce species are very closely related and even share many polymorphic sites [33, 39]. Still, genome size between different spruce species is variable (15.8 - 20.2 pg/1C ), which could imply significant differences in gene content between them. Finally, and perhaps most likely, they may represent assemblies containing mainly non-coding RNA and partial transcripts with mostly UTR sequence that, in general, show lower degree of conservation between species. In summary, even if the P. abies transcriptome contains a large fraction of partial transcripts and is far from complete, the assembly presented here is of a high enough quality and size to be informative on the general properties of the transcriptome.
In fact, ORF predictions identified 6,194 PUTs from P. abies with properties that suggest that they are full-length transcripts. A direct comparison of these PUTs towards full-length transcripts from P. glauca indicate that neither ours nor the P. glauca PUT assembly contains a comprehensive set of full-length transcripts and that despite millions of ESTs from spruce species there is still room for efforts aiming at capturing and sequencing full-length transcripts, such as oligo-capping, something that is of special importance in relation to upcoming genome annotations .
In many of the PUTs with high enough sequence coverage we could extract potentially variable sites, but since sequencing technologies are evolving at a rapid pace, the understanding of the error profiles for a given technique is still sparse making this process nontrivial and prone to false positives. This process is, in the present case, further complicated by the fact that the majority of tools developed to extract SNPs assumes that genome data is available and that sequence depth is even for the underlying genotypes, meaning that one expects around half the reads from each allele at heterozygous sites. In addition, many tools make use of uneven coverage as a tool to detect copy number variants or duplicated regions making standard methods to discover SNPs difficult to use as the uneven coverage and allele specific gene expression will cause deviations from these expectations. By merging population genetics sequence data sets from both unpublished and published projects in P. abies, where the data have been collected with Sanger sequencing [33, 47] we compared the ratio of transitions to transversions to the ratio obtained here under different cut-offs. Adding both quality and coverage of the alternate alleles as criteria yielded almost identical transition to transversion ratio as the Sanger data, suggesting that this criteria might be suitable for this data set (Table 2). Hence, from a single individual study we have identified close to 15 thousand SNPs, but because all of these stem from only one individual a fraction of them will be either singletons or low frequency variants within the species. Even so these results suggests that mRNA sequencing can be of great use for the identification of population genetic markers. However, recent discussions have identified several issues in identification of variable sites from cDNA something that calls for caution and further validation before using SNPs from cDNA as a source for identification of genetic markers [48–50].
The observed mean synonymous divergence of 0.175 between spruce and pine obtained here, is similar to the 0.22 reported as a mean over a phylogenetic tree of four conifer species  and the 0.19 reported from a pairwise comparison of P. sitchensis and P. taeda. In the comparisons to yew we obtained a mean silent divergence around 0.6. These divergence estimates translate into average synonymous substitution rates of 0.6 × 10−09 and 1.1 × 10−09, values that are also in line with previous estimates of 0.7 − 1.31 × 10−09. As previously noted this yearly synonymous substitution rate is much lower than estimates of substitution rate from annual angiosperm species. Data from numerous studies have obtained similar patterns and there is now substantial evidence supporting a lower annual substitution rate in gymnosperms compared to many annual angiosperms [3, 26, 34, 51]. Since lower synonymous substitution rates per year have also been observed in angiosperms trees/shrubs compared to herbs [52, 53] one of the main causes seem simply to be that gymnosperms are trees with a long generation times and if substitution rates are calculated per generation rather than per year, the estimates between perennial gymnosperms presented here are similar to estimates from annual angiosperms. The mechanisms underlying the slower substitution rates in perennials than in annuals remain unclear [52, 54] and we will not speculate on its causes here. Interestingly though, available data also suggest that polymorphism level is lower in trees, and in particular in conifers, than in herbs (Additional file 2). This apparent effect of generation time on nucleotide diversity could reflect the fact that organisms with long generation time and low substitution rate will require longer period of time to recover from past decline in effective population sizes. Also, for a given number of generations they will span much longer periods of time than annuals and thereby will be more likely to experience environmental changes causing variation in population size.
The ratio of non-synonymous to synonymous divergence at full length ORFs between spruce species and either pine or yew reported here, 0.236 (CI = 0.2272, 0.2432) and 0.167 (CI = 0.1614, 0.1717), respectively, are higher than estimates from previous analyses in conifer species , where branch specific estimates were used (with values of 0.12, 0.14 and 0.15 for the internal branch, the branch leading to P. glauca and the branch leading to P. menziesii, respectively), but lower than the pairwise estimate between P. taeda and P. sitchensis (0.314 (95% CI = 0.299, 0.329)) . The deviation from the former study is likely due to the fact that they included only sequences present in all four conifer species used in their study thereby limiting severely the number of sequences (138) and likely enriching the data set for conserved sequences. The difference with the latter study is harder to explain, but stems mainly from differences in the way coding frames and orthology were defined. In Buschiazzo et al. 100 comparisons revealed a dN/dS ratio higher than one, and most of these estimates are based on short alignments where both the number of synonymous sites and changes were small. This is in stark contrast to our results with very few values larger than one and a weaker correlation between dN/dS ratios and alignment length (Figure 5, Additional file 3). Buschiazzo et al. also estimated dN/dS between Arabidopsis thaliana and Populus trichocarpa and obtained a value of 0.0924. They therefore concluded that dN/dS was higher in conifers than in angiosperms. While our results for the conifer data are within the same order of magnitude, it is worth noting that the estimate they obtained between A. thaliana and P. trichocarpa tends to be much lower than previous estimates from angiosperms. For example, in estimates comparing A. thaliana and A. lyrata the mean over more than 5000 genes was 0.203 and within the poplar lineage P. trichocarpa and P. tremula the mean over almost 600 genes was 0.175 [54, 55]. It is also worth noting that the observed synonymous divergence between pine and spruce are more similar to the values obtained within the Arabidopsis and Populus lineages and since synonymous divergence seem to have a strong impact in estimates of dN/dS  comparing ratios over different groups of species should only be done when the observed synonymous divergence is similar. The latter, together with other uncertainty factors discussed above, suggests that caution is warranted before concluding that there are biologically meaningful differences in selective constraints between angiosperms and gymnosperms.
Even when the correct orthologous sequences have been aligned, individual estimates of dN/dS values should not be directly used to infer positive selection as recent work have highlighted several problems in using dN/dS larger than one as indicative of positive selection [56–58]. For example, the vast majority of human genes with dN/dS ratio larger than one is due to low dS values rather than high dN values. Hence, the classic interpretation of high dN/dS ratios is questionable as relatively low dS values will have the same effect on the ratio as high dN values. Furthermore, species pairs that are too closely related will tend to overestimate dN/dS.
In model organisms large scale studies of gene expression have identified numerous properties of genes and proteins that significantly correlate with selective constraints. Here we investigated whether level of gene expression in needles correlated with selective constraints, but found no significant pattern. In animals, fungi and plants a correlation between expression and selective constraint, such that highly expressed genes tend to have lower dN/dS, has been reported [54, 59, 60]. In multicellular organisms the correlation between dN/dS and expression breadth (number of tissues where genes are expressed) is often stronger than with the actual level of gene expression [54, 55, 61]. The lack of correlation in spruce could hence be due to the fact that we only measured gene expression in needles and have no data on expression breadth. Also, since our transcriptome reference sequence is largely created from short reads from needle RNA our reference is biased and contains a specific set of genes that do not fully represent the complete transcriptome; e.g our approach has limited information from genes lowly expressed in needles. The lack of correlation reported here should therefore be taken with a grain of salt as it might simply stem from lack of data rather than from an actual lack of correlation. Analysis of level of gene expression in samples collected in the dark compared to samples collected in the light revealed only a small set of genes showing at least a 2-fold difference in gene expression. Angiosperms studies suggest that around 20% of the transcriptome is differentially expressed between light and dark treatments, even though the number of genes varies depending on species, tissue and actual treatment [62–64]. The pattern observed in our data is different and might suggest that the diurnally expressed genes in gymnosperm trees could be fewer than in angiosperms. This is consistent with earlier studies, reporting a lack of clear diurnal expression pattern at key photosynthesis genes in gymnosperm species [65, 66]. To be noted here is that all expression estimates put forward here stems from a single individual and is hence not suitable for making strong statements at the species level.