Polyploid species sequence assembly
Using a next generation sequencing approach, leaf transcriptome sequence data was generated for the allotetraploid N. tabacum and its progenitor species N. sylvesteris and N. tomentosiformis. These sequences were assembled into species specific sets of unigenes and then further combined into a consensus set of clusters for the three species. The process of assembly revealed that default parameters of sequence assemblers were probably not stringent enough when working with sequences originating from polyploid species. Sequencing errors, such as homopolymer length issues associated with pyrosequencing, can further confound this problem by potentially masking low polymorphism content between homeologs. Other sequencing technologies, such as Illumina, may not be impacted by this homopolymer problem, but read length may be a limiting factor given the requirement that a single read must contain at least one polymorphism per overlapping region. These factors should be taken into consideration for any future assembly attempts on polyploid species and the methodology applied for the assembly of an allopolyploid transcriptome in this study could be useful for guiding future genome assembly work in polyploidy.
Additionally, the number of collapsed homeologs was estimated in N. tabacum assembled transcripts (using the 97% identity assembly) based on SNPs shared with N. sylvesteris or N. tomentosiformis reads. In this analysis, only 3.4% of N. tabacum transcripts were polymorphic and shared SNPs with the parental transcripts. This methodology cannot be applied for transcripts lacking SNPs in the transcript fragment analyzed (67% of the transcripts). More information could be obtained by deeper transcriptomic sequencing (more mapped sequences and more reliable SNP calling), use of longer sequences (increasing the possibility to find a parent relative SNP) or genomic DNA sequencing (where the intron sequencing, being more diverge region, could increase the number of parent relative SNPs).
Homeologous gene fate in Nicotiana tabacum
Based on the leaf transcriptome data for the Nicotiana species generated in this study, a pipeline was developed to carry out a phylogenetic analysis on a genomic scale. The PhygOmicss pipeline works on a single transcriptome set, but can be applied to transcriptomic data from multiple tissues/organs, or gene models from genomic sequence data.
The majority of the N. tabacum transcripts (69%) did not show any polymorphisms with the parental sequences, making it impossible to distinguish the homeologous genes and excluding the possibility of neofunctionalization in these genes. Additionally the expression analysis of clusters with genes expressed above background level (more than 5 reads), revealed that the expression of a majority of these genes was not changed (83.6% of genes in clusters; 57.7% of the total transcribed genes) between these three species. With this level of conserved expression, the possibility of subfunctionalization is low.
A more specific topology analysis with the newly developed PhygOmicss pipeline revealed that in N. tabacum transcripts where homeologous genes can be differentiated there was evidence for the presence of only a single homeolog (90% of gene clusters, 6% of the total transcribed genes). Given that the data is transcriptomic, it is not possible to distinguish between gene loss and subfunctionalizaton. Tissue-specific gene silencing  provides one possible mechanism of gene subfunctionalization and may partially explain the pattern observed in the Nicotiana topologies. An analysis of a broader set of tissues might resolve the question and increase the chance of detecting expression differences in any individual genes. However, studies in other polyploid plants suggest that only a small number of genes display tissue specific gene silencing. For example, a similarly low level of gene silencing (around 1-5%) was estimated in both synthetic allotetraploid wheat  and synthetic cotton , and results from gene expression analysis of Tragopogon miscellus showed a similar trend (3.4%) . Even lower estimates of silencing were suggested from experiments with an early allotetraploid formed by the hybridization of Arabidopsis thaliana and Cardaminopsis arenosa (> 0.4%) .
Based on the distribution of topologies and the relative expression level of homeologous genes, there was little evidence to suggest preferential loss, or transcriptional silencing of genes from one or other progenitor genomes from the sub-set of Nicotiana sequences that this analysis could be completed on. This is in contrast to the apparent preferential loss of repetitive sequences from the T genome in N. tabacum, as shown in a recent study also using 454 sequencing in these Nicotiana species . Previous studies in other allotetraploids have shown preferential expression of homeologous genes. For example, there is evidence of preferential expression of the D genome in cotton . Differential expression was shown for 22% of homeologous genes pairs in the 40 generation-old allotetraploid T. miscellus, similar to the 27% of N. tabacum genes observed in this study. It should also be noted that genes expressed in the leaf tissue at a very low level may have been missed in the transcriptome sets, particularly since clusters with less than 5 sequence members were removed from the analysis. As such, increasing the sequence depth might reveal more differentially expressed homeologous genes, but it is unlikely that this will increase the contribution of subfunctionalization extensively.
With the caveat that this study was based on a subset of genes identified in the leaf transcriptomes of Nicotiana species, the data would suggest the expression of homeologous genes is mostly conserved between N. tabacum and its parent relatives and supporting the hypothesis of gene dosage compensation [15, 45] reported previously in other species . This level may be over-estimated as the transcriptome was sampled in only one tissue type, thus reducing the possibility of observing subfunctionalization. However, based on the levels observed in other species [7, 8, 10, 43] subfunctionalization is unlikely to account for a large proportion of genes.
There is also limited evidence of neofunctionalization having occurred in N. tabacum, based on comparison of the homeologous and homologous gene sequences. Indeed, no genes could be identified as undergoing positive selection in N. tabacum that did not also show the same response between N. sylvesteris and N. tomentosiformis. This suggests that these differences may have predated the formation of tobacco. Again, the apparent low level of neofunctionalization may be explained by only having sampled the leaf transcriptome. Sequencing transcripts from other tissues, perhaps more specifically involved in secondary metabolite synthesis, may increase the likelihood of identifying genes showing positive selection in tobacco; two such examples are trichomes  or roots, where alkaloids, including nicotine, are synthesized .
In addition to an increased spatial and temporal coverage of the transcriptome for the Nicotiana species covered in this study, it would be interesting to compare the proportion of subfunctionalization and neofunctionalization in tobacco with an older Nicotiana allotetraploid species, such as Nicotiana nesophila (dated approx. 4.5 Myr old), or Nicotiana benthamiana (dated > 10 Myr old) . Similarly, a comparative analysis of allele selection between wild and cultivated N. tabacum varieties might provide insight into the role of homologous genes in the species’ domestication process. Gene duplication plays an important role in the successful transition of a wild species into its cultivated relatives, as shown for several wheat loci . Indeed there are also examples for duplicated genes from diploid species playing an important role in domestication, including GRAIN INCOMPLETE FILLING 1 (GIF1) and the cell wall invertase OsCIN1 in .