Previous reports of ESTs from non-model, yet ecologically and economically important organisms have been successfully sequenced and annotated using next generation sequencing [[19–21]]. Here, we have used 454 next-generation sequencing to generate the first transcriptome sequence data for big sagebrush, a key ecological species of the western North America. Similar to reports of other efforts, the assembled ESTs of big sagebrush were further analyzed to generate a functional characterization of the transcriptome and discover putative molecular markers (SSRs and SNPs). A surprisingly high level of nucleotide diversity was also found within individual assemblies of ESTs from big sagebrush accessions.
To generate a functional characterization of the big sagebrush transcriptome, we compared the contigs and singletons obtained from the combined assembly to peptides within the non-redundant protein database using BLASTx. The low number of matches (54 contigs) to Artemisia annua sequences is probably due to fewer number of A. annua sequences available in the NR database compared to species such as Vitis vinifera. We expect that the numbers of hits will substantially increase with the eventual publication and annotation of an A. annua and other Artemisia and Asteraceae genome sequences. A majority (69.8%) of the assembled sequences did not align with any peptide in the NR database, possibly indicating the presence of substantial number of novel genes in A. tridentata transcriptome and related taxa. Genes of unknown function are not unexpected, as the discovery of novel genes has been demonstrated in other EST sequencing projects within non-agricultural plant families [2, 22].
Many of the contigs and singleton ESTs identified in this study are expected to have ecological and adaptive relevance. Previous studies relating sagebrush biochemistry to mule deer feeding preference suggest strong correlation between the composition and concentration of secondary metabolites, especially terpenoids, and mule deer preference of sagebrush [23, 24]. We were able to identify many, but not all, of the genes coding enzymes involved in MVA, MEP, and phenylpropenoid pathways. The failure to detect all genes from these pathways could be explained by a lack of transcriptome coverage and/or by a lack of pathway documentation of these specific genes . The detection of major enzymes involved in phenylpropanoid pathway in big sagebrush and variation within these pathways may aid in elucidating herbivore preferences and trade-offs between defense responses.
Polymorphisms in A. tridentata ESTs
A large number of SNP and SSR markers were discovered and different subsets of SNPs were validated using Sanger amplicon sequencing of cDNA and genomic DNA, Illumina cDNA sequencing of ssp. wyomingensis, and sequence capture. We verified (Type 1) six of six tested SNPs using amplicon Sanger sequencing of individually selected PCR fragments. Additional verification (Sanger sequencing of next-generation sequencing results) was deemed unnecessary due to past experience in Arabidopsis , Amaranth [27, 28], and cotton (Udall, personal communication) using this same conservative bioinformatic pipeline. These other studies verified 100% of 5 × more SNPs using Sanger re-sequencing of amplicons and demonstrated that they segregated in mapping populations such that genetic maps were reliably constructed. Similar to these other studies, a small number of genotypes (2) were used for SNP discovery in sagebrush ESTs. It was possible that the two individuals selected for EST sequencing could also represent minor alleles at a number of SNPs. Thus, the SSRs and SNPs that we report here represent DNA differences between individuals and differences between subspecies.
In our efforts to describe SNPs in big sagebrush, we have also quantified the number of SNPs that were due to subspecies differences and those that were due to individual differences. The high numbers of SNPs between individuals, apparent in the individual assemblies (of two individuals), in the validation using ssp. wyomingensis, and in the sequence capture assemblies (of four individuals) suggested significant amounts of nucleotide diversity between individual genomes of Artemisia. This evidence was supported by three findings. 1) When discriminating SNPs between ssp. tridentata and ssp. vaseyana were re-identified at a higher stringency than 90% (at 99%), 13% of the SNPs were not detected because of a single parameter requiring a degree of homogeneity among residues originating from a single DNA source. This suggests that both individuals used for EST sequencing contained a high number of heterozygous loci. 2) Using Illumina sequencing, only 36% and 44% of the SNP positions had both alleles detected in the ssp. wyomingesis samples respectively, where nearly all of the SNP positions were at least represented by one or the other allele. This indicated that both alleles of a significant number of the SNPs exist in a third A. tridentata subspecies, but a true polyploid hybrid of these the two diploid subspecies would contain both alleles of all SNPs. Thus, the ssp. wyomingensis samples used here were likely derived from different diploids and those individuals had significantly different genotypes than those used for EST sequencing. 3) Using sequence capture, only 54% of the 403 SNP positions were validated as discriminatory between ssp. tridentata and ssp. vaseyana, but 67% of the SNP positions had both bases detected. Thus, 13% of the sequence capture validated SNP positions also appeared to be heterogeneous (two nucleotides) within the collected individuals used for sequence capture. Indeed, a significant number of SNPs were found between individual plants within A. tridentata subspecies. Much of this nucleotide diversity at initially identified SNP loci could be at heterozygous loci, though we are careful not to describe it as such until allelism between nucleotide residues is fully established through segregation analysis. Recall that these EST sequences contain both coding and non-coding sequence (particularly the 3' UTR as the poly-A tail was used for priming the cDNA synthesis). A high level of nucleotide diversity in these coding and non-coding sequences is certainly plausible considering the very large effective population size of big sagebrush and wind-pollination strategy .
Given the high level of heterozygosity due to the out-crossing nature of big sagebrush populations , we expect that a large number of inter-subspecific SNPs and intra-subspecific SNPs could be used in conducting subspecies level association genetics studies. To date, little or no sequence of big sagebrush has been made publicly available, thus the SNPs reported here represent a starting point for such future population genetic studies of big sagebrush. While they may form the basis of future molecular studies, caution is needed because informative SNP comparisons will depend on the specific individuals selected for genetic analysis. Alternatively, our study suggests that a sequenced based approach to population genetics such as a population-wide genome reduction strategy  or amplicon analysis should be considered because of the expense required for assay development and their potential use in few, specific A. tridentata individuals. Such an approach would avoid extrapolation of our putative SNPs specific to these individuals to a larger population of individuals (e.g. subspecies' specific SNPs that were likely due to genetic variation between individuals) by generating accession-specific data for each newly sequenced accession. Implementation of such study among spatially distributed big sagebrush populations would 1) enlighten our understanding of natural selection on genes and gene complexes controlling adaptive traits, and the evolution of these trait-linked loci and 2) provide relatedness metrics between natural populations of these subspecies and their hybrid zones. Though we briefly touched on these questions by using independent genotypes for SNP validation, these questions are out of the scope of this particular study that aims to primarily characterize EST sequences of big sagebrush and provide insight regarding the origins of ssp. wyomingensis.
Regarding the discovered SSRs, we were surprised to find that all SSR repeat motif types detected were much more abundant in ssp. tridentata compared to ssp. vaseyana. The reduced levels of SSR in ssp. vaseyana ESTs compared to ssp. tridentata could be due to differential gene expression since different loci were sampled with our non-replicated experimental design. While leaves from both plants were harvested at the same time in common garden, phenological differences between the subspecies might have caused differences in expression levels and thus, changes in the number and types of detected SSRs. While gene expression could explain some of the differences, many such EST-SSRs have been found to be reliable molecular markers in other species [[22, 30–33]] and they represent hypothetical (i.e. testable) genetic divergences between the subspecies.
Ka/Ks and gene evolution in big sagebrush
The ratio of synonymous and non-synonymous mutations between sspp. tridentata and vaseyana suggest possible selection pressure resulting in the maintenance of subspecies divergence, as similar trends have been observed in various organisms [[34–37]]. Since natural selection shapes phenotypes and genotypes in favor of adapted traits, the Ka/Ks ratio of less than 1 for a large number of contigs could be a result of either stabilizing or diversifying selection within both subspecies, depending upon the magnitude of the ratio. Or if divergence times are very recent, it could also be the hallmark of purifying selection on the adapted common ancestor of these two subspecies. For example, Contig_29840 (Ka/Ks = 0.106) was annotated for 'aquaporin' protein. Considering that big sagebrush grows in variety of soils and arid plains, valleys and foothills of mountains, the importance of aquaporin proteins in water balance is critical and the genes coding for aquaporin proteins could have been under stabilizing selection. A formal investigation of molecular evolution within these species (with a proper outgroup) would place selection pressure relative to species divergence.
Exploring the inter-subspecies hybridization hypothesis
Hybridization can be of great importance to the ecological adaptation and subsequent evolution of offspring because of the novel genetic recombination and spatial selection [[38–40]]. Generally, allopolyploid formation is considered to have arisen through hybridization between unreduced gametes [[41–43]]. Several studies have been conducted on hybrid populations formed from A. t. ssp. tridentata and A. t. ssp. vaseyana to investigate hybridization events. Generally, these hybrid populations are formed in a narrow zone between the two ecotypes [[29, 44–47]]. In this study, we did not select a tetraploid ssp. wyomingensis along with diploid representatives of its two neighboring ssp. tridentata and ssp. vaseyana populations. Instead, selected ssp. tridentata and ssp. vaseyana accessions were chosen for EST sequencing based on penetrance of specific, subspecies morphological markers (i.e. trueness to type). Thus, variation at SNP loci for the diploid-tetraploid comparison is a mixture of individual variation, variation within inter-mating populations, and variation between subspecies in this study. Based on the number of Illumina reads that actually did map to discriminating SNPs between sspp. tridentata and vaseyana, the tetraploid ssp. wyomingensis samples appeared to contain both alleles for a large number of loci (251/695 Montana; 458/1,039 Utah). The presence of both alleles at approximately one-third of the loci suggests that ssp. wyomingensis either originated as an allotetraploid from a hybridization event of 2 n gametes between sspp. tridentata and vaseyena or formed as a autopolyploid from both diploid subspecies with subsequent hybridization. Since allopolyploids have been reported between diploids and tetraploids of ssp. tridentata and ssp. vaseyena [[9, 29, 46, 48]], a similar scenario is plausible for the origin of ssp. wyomingensis. A focused genetic study within and between putative hybrid zones of big sagebrush is needed to further elucidate the origins and reproducibility of hybridization processes involved in ssp. wyomingensis formation. If tetraploid recurrence is a common feature of ssp. wyomingensis, perhaps only populations of ssp. tridentata and ssp. vaseyana need active management during environmental conversation of wildlands because a tetraploid hybrid between the two locally adapted accessions could be expected to form and repopulate geographic zones between the diploid subspecies.