Plant materials and RNA preparation
Reference transcriptome
We used three sets of coastal Douglas-fir sequences to construct the reference transcriptome (Table 1). In this section, we describe the plant materials and general sequencing strategies used for each dataset. Detailed laboratory methods are described subsequently.
The goal of the first multi-genotype dataset (MG1
SANG
) was to include existing Sanger sequences from a diverse set of genotypes known to be expressing functionally important genes. The MG1
SANG
dataset was prepared by combining Sanger sequences derived from three ‘cold hardiness’ cDNA libraries (CA, MH, and CD) and one ‘actively growing’ (GR) library [58]. Seedlings used for the cold acclimating (CA) library were collected in September, October, and November; seedlings used for the maximum hardiness (MH) library were collected in December and January; and seedlings used for the cold deacclimating (CD) library were collected in February, March (2 dates), and April in Corvallis, OR. On each date, 10 seedlings were collected from a single orchard seedlot, and total RNA was extracted separately from needles, stems, and buds. The parents of these seedlings originated from a low-elevation population near Toledo, OR. Total RNA was isolated at Oregon State University (OSU) according to Chang et al. [59], except that the RNA was subsequently purified on RNeasy columns (QIAGEN, Valencia, CA, USA). Equal amounts of total RNA were pooled from each tissue prior to sequencing. Sanger sequences from the cold hardiness libraries (CA = 3,949; MH = 3,701; and CD = 3,684 sequences) were combined with 6,760 sequences from the GR library prepared from actively growing seedlings harvested from the greenhouse during their first growing season [55].
The goal of the second multi-genotype dataset was to increase the number genotypes, tissues, and physiological conditions, while also increasing sequence depth and coverage by using 454 pyrosequencing. The resulting MG2
454
dataset consisted of Roche 454 sequences derived from three tissue collection regimes. First, on each of five dates between September and April, we harvested 6 or 12 first-year seedlings and separated them into needles, stems, and buds. These seedlings were grown outdoors in Corvallis, OR, but their seed orchard parents originated from a low-elevation population near Coos Bay (CB), OR [58]. Second, we harvested three seedling tissues on five dates between July and January from a total of 79 seedlots provided by the Cottage Grove Nursery of Plum Creek Timber Company. On all five dates, we harvested buds (i.e., elongating apices or resting buds), shoots (stems plus needles), and roots. On two of the dates when the seedlings were large enough, we also harvested lower stems without needles. These seedlings were grown outdoors in Corvallis, and some were subjected to water stress to induce the expression of genes associated with adaptation to cold and drought. Third, we collected elongating shoots from ramets of two clonal genotypes growing at the Lebanon Forest Regeneration Center managed by Roseburg Forest Products. Because these shoots were collected during June from trees that had been stimulated to produce reproductive buds, they are expected to contain differentiating male and female flowers (strobili). Total RNA was isolated at OSU as described above, or at the University of Georgia (UGA) as described by Lorenz et al. [60]. Individual RNA samples were pooled in an attempt to have mRNAs from buds, stems, needles, and roots equally represented in the resulting cDNA libraries.
The goal of the single-genotype dataset was to expand our representation of genes from mature trees and increase sequence depth and coverage by using 454 pyrosequencing. The resulting SG
454
dataset consisted of Roche 454 sequences derived from five tissues collected on 8 July from two mature ramets of a single clonal genotype growing at the Lebanon Forest Regeneration Center. We collected terminal shoots, stems, and resting buds (all from current year branches), plus cambial tissue and developing seeds from immature cones. Total RNA was isolated at UGA as described by Lorenz et al. [60], and 36 to 130 μg of total RNA was pooled from each tissue prior to cDNA synthesis.
Illumina short-read sequences
The goal of the Illumina sequencing was to enhance SNP detection by increasing the number and genetic diversity of the sequences to be mapped to the reference transcriptome. We used coastal and interior Douglas-fir to produce four sets of Illumina short-read sequences. One set of coastal Douglas-fir sequences (MG2
IL
) was derived from the same pooled RNA sample that was used to construct the MG2
454
dataset described above. The Coos Bay (CB
IL
) dataset was derived from a subset of the Coos Bay RNA samples described above, plus replicate samples harvested on some of the same dates. We used six bud samples and two needle samples for a total of eight Illumina sequencing runs. The Yakima (YK
IL
) dataset was prepared using the same collection protocol and sequencing protocol as for the CB
IL
dataset, but the seedlings were derived from parents growing in a high-elevation inland population near Yakima, Washington that is thought to represent the interior variety of Douglas-fir [61]. Total RNA was isolated at OSU as described above.
The interior Douglas-fir samples (INT
IL
dataset) were collected from mature trees growing in a provenance test near Vernon, B.C., Canada [62] and the Cherrylane Seed Orchard in northern Idaho. Young shoots were collected from the provenance test in early May from two trees from each of 26 seedlots collected from Arizona and New Mexico in the south, to British Columbia and Washington state in the north. Approximately equal amounts of total RNA were pooled from recently flushed buds, stems, young needles, and mature needles. The seed orchard samples were collected in early June from 18 trees originating from northern Idaho. Approximately equal amounts of total RNA were pooled from stems and needles harvested from recently flushed shoots. The two pooled RNA samples were then combined for Illumina sequencing.
DNA sequencing
Reference transcriptome
The MG1
SANG
dataset was produced via Sanger sequencing. For the CA, MH, and CD libraries, pooled samples of total RNA were used by Evrogen JCS (Moscow, Russia) for double-stranded cDNA synthesis using the SMART approach [63], and the cDNAs were normalized using DSN normalization [64]. The resulting cDNAs were directionally inserted into the pAL17.1 vector and transformed into E. coli. SymBio Corporation (Menlo Park, CA) amplified the cDNA clones using rolling circle amplification, and then sequenced about 4,000 cDNA clones per library using a MegaBASE 4000 sequencer (GE Healthcare, Little Chalfont, UK). The non-normalized GR library was prepared and sequenced (Sanger) as described by Krutovsky et al. [55]. Sanger sequences were archived under GenBank accession numbers CN634509-CN641229 and ES417751-ES429084.
The MG2
454
and SG
454
datasets were produced via 454 pyrosequencing. For the MG2
454
dataset, mRNA isolation, cDNA synthesis, and DNA sequencing were performed by the University of Illinois Carver Biotechnology Center using the SuperScript Double-Stranded cDNA Synthesis Kit (Invitrogen, CA) and GS Titanium Library Preparation kit (454 Life Sciences, Branford, CT). The cDNA library was normalized using the Trimmer Direct Kit (Evrogen), and then sequenced using the 454 GS-FLX platform. For the SG
454
dataset, cDNA synthesis was performed by the U.S. Department of Energy Joint Genome Institute (JGI) using the SMART PCR cDNA Synthesis Kit (Clontech, Mountain View, CA). The resulting non-normalized cDNA library was sequenced by JGI using the 454 GS-FLX platform. The raw 454 sequences were deposited in the NCBI Sequence Read Archive (SRA) under accession numbers SRA023776 and SRA051424.
Illumina short-read sequences
The CB
IL
and YK
IL
libraries were constructed at the USDA Forest Service’s Pacific Northwest Research Station using Illumina mRNA-Seq Prep Kits (San Diego, CA) with minor modifications. To obtain strand-oriented reads, we used dUTP for second-strand synthesis, and then selectively destroyed the dUTP-containing strand before the PCR-enrichment step as described by Parkhomchuk et al. [65]. Libraries were constructed using multiplex sequencing adapters [66], and single-end reads of 80 nt were obtained using multiplex sequencing on an Illumina Genome Analyzer IIx at the OSU Center for Genome Research and Biocomputing or Harvard FAS Center for Systems Biology. The MG2
IL
and INT
IL
cDNA libraries were constructed using the standard Illumina mRNA-Seq Sample Prep Kit, and then sequenced on an Illumina Genome Analyzer IIx for 2x101 cycles at the Carver Biotechnology Center. The RNA previously used for the MG2
454
dataset was also used for the MG2
IL
library, and the RNA isolated from interior Douglas-fir was used for INT
IL
. The raw Illumina sequences were deposited in the NCBI SRA under accession number SRA051424.
Pre-assembly sequence processing for the reference transcriptome
Prior to assembly, sequences in the MG1
SANG
, MG2
454
, and SG
454
datasets were cleaned as described below (Figure 1). For the MG2
454
and SG
454
datasets, we used Roche’s sffinfo utility to trim primer and adaptor sequences and produce raw FASTA and quality files from the SFF files. For Sanger sequences (MG1
SANG
), we performed these same functions using phred [67]. We then used the SnoWhite pipeline (http://www.evopipes.net/snowhite.html), which combines Seqclean and TagDust, to remove or mask polyA/T tracts; short, low-quality and low-complexity sequences; and reads matching chloroplast, mitochondrial, rRNA, or retrotransposon sequences. Our filtering database contained (1) vector, adapter, linker, and primer sequences from NCBI’s UniVec database (http://www.ncbi.nlm.nih.gov/VecScreen); (2) chloroplast, mitochondrial, and ribosomal RNA (rRNA) sequences from 21 to 50 species that included conifers, Arabidopsis, and Nicotiana (GenBank; http://www.ncbi.nlm.nih.gov/genbank); (3) a nearly-complete reference of the coastal Douglas-fir chloroplast genome (GenBank JN854170; [68]); and (4) retrotransposon sequences from 27 species, including Arabidopsis and Oryza sequences from the Plant Repeat Database (http://plantrepeats.plantbiology.msu.edu/) and conifer sequences obtained from GenBank as described by Parchman et al. [22]. Our final database of non-redundant filter sequences was prepared by processing all filter sequences through the NCBI BLASTclust program [69] with the identity and coverage parameters set to 90% (i.e., pairwise matches require sequences to be 90% identical over 90% of their lengths). We then used the SnoWhite pipeline to filter reads that had ≥ 96% sequence identity to any sequence in this filtering database (Figure 1, Step 3)
We also filtered bacterial- and fungal-like sequences from the datasets used for transcriptome assembly. These sequences were filtered by first conducting a de novo assembly of each 454 dataset (MG2
454
and SG
454
) using Newbler v2.3 (Figure 1, Step 4; discussed below). The resulting isotigs and singletons (plus the Sanger sequences from the dataset MG1
SANG
) were screened for homology using BLASTN against a dataset of 27,720 white spruce unigenes [16]; http://www.arborea.ca]. Non-matches (E-value > 10-5, bit score < 50, and identity < 96%) were subsequently screened using BLASTN and two local databases: the NCBI nucleotide collection (nr/nt) and NCBI non-human, non-mouse ESTs (est-others). Assembled isotigs and singletons were filtered if the best hit had a bit-score > 50 and an E-value < 10-10, and the corresponding genus name was found in a custom database of 162,679 bacterial and 59,139 fungal names downloaded from the NCBI Taxonomy database (http://www.ncbi.nlm.nih.gov/Taxonomy) (Figure 1, Step 5). After we removed the singletons and all reads that assembled into the contaminating isotigs, the original reads were re-assembled as described below (Figure 1, Step 6). We also used the results from these analyses to compare the number of contaminating fungal and bacterial reads between the two 454 datasets.
Assembly of the reference transcriptome
We used Newbler v2.3 (Roche GS De Novo Assembler v2.3; Roche Life Sciences, Inc.) to assemble the reads in the MG1
SANG
, MG2
454
, and SG
454
datasets into a single reference transcriptome consisting of isogroups (unigene models), isotigs (presumed transcript variants), and singletons ≥ 100 nt (Figure 1, Step 6). Prior to the final assembly of all datasets, we first evaluated the impact of alternative assembly parameters. De novo assemblies were run using the transcriptome (-cdna) option, minimum read length (-minlen) of 40 nt, isotig length threshold (-icl) of 40 nt, a large contig threshold (-l) of 100 nt, plus a factorial arrangement of the following parameters: minimum overlap lengths of 35 and 45 nt; alignment difference scores of -2 and -6; and minimum overlap identities of 82% to 98%. The 20 resulting assemblies were evaluated based on the total numbers of isogroups, number of isogroups represented by a single isotig (I1 subset), and the number of isogroups represented by multiple isotigs (IM subset). We also evaluated the assemblies by comparing the assembled isogroups to white spruce unigenes using the approach described below. Based on these evaluations, we performed the final Newbler assembly using a minimum overlap length of 45 nt, alignment difference score of -6, and a minimum overlap identity of 96%. We clustered the resulting isotigs using Vmatch (http://www.vmatch.de; –dbcluster p
small
= 99 and p
large
= 99) to form a non-redundant set of sequences, and then calculated the assembly statistics shown in Table 2 using custom Perl scripts. The reference transcriptome (i.e., 37,177 isotigs ≥ 200 nt) has been deposited at DDBJ/EMBL/GenBank under accession GAEK01000000.
Comparison to white spruce and loblolly pine
We compared the Douglas-fir assembly to a set of white spruce unigenes using SCARF, a sequence assembly tool designed for assembling 454 EST sequences against a reference sequence from a related species (Figure 1, Step 8) [17]. We downloaded 27,720 white spruce unigenes constructed from Sanger sequenced ESTs [16]; http://www.arborea.ca], and then used SCARF to determine where Douglas-fir isotigs matched white spruce unigenes. The combination of MegaBLAST parameters [70] used in the SCARF analysis resulted in matches with a minimum length of 40 nt, minimum identity of 77%, minimum bitscore of 80, and maximum E-value of 2 × 10-13. Using this information, we defined seven types of structural relationships (C1 to C7) between Douglas-fir isotigs and white spruce unigenes, and then assigned the isotigs to these classes based on three criteria (Table 3). First, we classified each isotig according to the number of white spruce matches: no match (C7), one match (C1, C3, C4), or matches to multiple white spruce unigenes (C2, C5, C6). Multiple matches were counted only when the percent identities were within 5% of the best match. Second, we determined whether other isotigs matched the same white spruce unigene, resulting in isotigs that were classified as having no matching partners (C1, C2), and those that did (C3 to C6). Finally, for the isotigs with matching partners, we determined whether the partners overlapped each other (C4, C6) or not (C3, C5). We then assigned relative confidence scores to each isotig assembly based on these relationship classes: C1 = Highest; C2 and C3 = Higher; C4 and C5 = Medium, C6 = Lower; and C7 = Unknown.
We conducted the same SCARF analysis using loblolly pine as the reference, but these analyses were completed after the SNP array was constructed and tested. These analyses were conducted using 35,550 contigs that comprise the first release of the PineDB transcriptome assembly (PineDB v1.0; June 15, 2012; http://bioinfolab.muohio.edu/txid3352v1/interface/download.php).
Annotation
We annotated the isogroups using a local tBLASTX [69] search against the Uniref50 release 2010_09; [71] and TAIR10 (TAIR10_pep_20101214; [72]) databases using an E-value of 10-5 (Figure 1, Step 8). We then summarized the results separately for the I1 isogroups, IM isogroups, and singletons. For the IM set of isogroups, a hit was counted only if all isotigs matched the same protein in the database; otherwise this isogroup was considered unannotated. We also annotated sequences using the Annot8r pipeline (http://www.nematodes.org/bioinformatics/annot8r/index.shtml), which assigns GO terms [73], EC numbers (http://www.chem.qmul.ac.uk/iubmb/), and KEGG pathways [74] to protein or nucleotide sequences from non-model organisms based on sequence similarity to protein sequences in the EMBL UniProt database (http://www.uniprot.org/). We also assigned GO-slim terms to the isogroups and singletons using the results from the TAIR10 tBLASTX search. We extracted GO-slim terms for the matching Arabidopsis accessions from the TAIR10 database, and then compared the distributions of GO-slim terms for the I1 isogroups, IM isogroups, and singletons versus the distribution of GO-slim terms for all 35,386 Arabidopsis accessions in the TAIR10 database (http://ftp.arabidopsis.org/home/tair/Ontologies/Gene_Ontology/). Finally, we assigned taxonomic affiliations to the isogroups and singletons using the results from the UniRef50 tBLASTX search described above. We extracted the taxonomic assignment for each best-hit, and then summarized them according to the categories shown in Table 5.
Processing of Illumina short-read sequences and analysis of sequence orientation
Illumina short read sequences were mapped to the transcriptome reference to identify SNPs. Some of the short-read sequences contained strings of nucleotides with a quality score of 2 (i.e., ‘B’ ascii character), which Illumina uses to indicate that these calls should not be used for downstream analysis. Therefore, we changed these positions to ‘N’s before read mapping and SNP detection (Figure 1, Step 7).
We used the strand-oriented reads from the CB
IL
and YK
IL
libraries to infer the orientation of the isotigs and singletons. We used Bowtie v 0.12.7 (−M 1, -q, –n 2; [75]) and custom R scripts to count the number of unique alignment locations where reads were mapped as direct Illumina output (D) and as their reverse complements (C). For each isotig and singleton, we summed D and C across both strand-specific datasets, and then used a two-tailed binomial test to test whether C was significantly greater or less than D (P < 0.05), which would indicate the corresponding isotig or singleton is in the forward (+) or reverse (−) orientation, respectively.
SNP detection
Flanking variants
The first step toward identifying likely SNPs and designing SNP assays was to identify flanking variants (SNPs and indels) using permissive criteria (Figure 1, Step 9). We combined the Sanger and 454 sequences (MG1
SANG
, MG2
454
, and SG
454
) into a single dataset, and then aligned them to the reference using the BWA-SW program with default parameters [76]. For the Illumina sequences (CB
IL
, YK
IL
, and INT
IL
), we used the Novoalign short-read aligner with default parameters (Novocraft Technologies; http://www.novocraft.com). We used SAMTools [77] to output the alignment results to the BAM format, and then used mpileup, BCFTools, VCFutils, custom Perl scripts, and SAS (Statistical Analysis System, Cary, NC) to extract and summarize sequence variants. For these programs, we used the following parameters: 20,000 = maximum number of reads for calling a SNP, 20 = minimum mapping quality, and 20 = minimum base quality to identify putative SNPs. These SNPs were subsequently filtered using more stringent criteria. Although we recorded indels found within the query dataset or between the query dataset and the reference, we only recorded SNPs found within the queried dataset. That is, if the query dataset differed from the reference, but had no called SNP itself, we treated the variant as a sequencing error. Because our input sequences were derived from pooled samples, we did not filter variants based on probability values from BCFTools (i.e., we used -p = 2.0 for the BCFTools view and VCFutils programs). Instead, we estimated SNP and indel probabilities from the mpileup output using a custom Perl script that implemented the methods described by Wei et al. [78], using a MAF value of 0.01 and sequence error rate of 0.01. We also used this Perl script to remove variants that had a total read depth < 5 or < 2 alternative alleles in the dataset. For each dataset, we compared variants detected using the BCFTools/VCFutils programs versus our custom Perl script, removed indels from the 454 dataset, merged the five datasets, and then removed other variants that did not meet a flanking probability threshold (PF) of 0.10 in any single dataset or pooled across datasets (Figure 1, Step 9). The pooled across-dataset probability was calculated using a chi-square test with 10 degrees of freedom, where X2 equals −2Σln(pi), and P
i
is the SNP probability for each of the five datasets [79]. We then used a Perl script to generate a reference sequence for each dataset that identified all retained indel and SNP positions using IUPAC codes, and these were combined to create a comparable sequence for Douglas-fir.
Target SNPs
We filtered flanking SNPs to obtain sets of ‘target SNPs’ that could serve as a resource for future genotyping assays. In Figure 1, we show three output datasets based on target SNP probabilities (PS) of 10-2, 10-3, and 10-4 (Figure 1, Step 10). To avoid redundant SNPs, this database was developed using only the longest isotig from each isogroup. For these datasets, we retained bi-allelic SNPs that were not near a high-quality indel (i.e., did not receive a BCFTools code of “G” in any dataset), had a mapping quality score > 40 in at least one dataset, and probabilities < 10-2, 10-3, or 10-4 in at least one dataset. Using a SNP probability of 10-4, these criteria resulted in 278,979 potential SNPs for which we obtained Infinium design scores. Design scores were obtained using four different sequence datasets constructed using flanking probabilities (PF) of 10-1, 10-2, 10-3, and 10-4 (Figure 1, Step 11). The dataset of 278,979 potential SNPs constructed using a PS of 10-4 and PF of 10-1 was used as the starting point for constructing a genotyping array. These SNPs have been deposited in the NCBI dbSNP database under submitter handle HOWE_OSU, with ss numbers ranging from 523,746,501 to 524,245,331.
Infinium genotyping array
We used additional criteria to filter the target SNPs to obtain 8769 SNPs for testing on an Infinium II genotyping array (Figure 1, Step 12). During this filtering step, we did not consider SNPs from isotigs having low confidence scores (C5 or C6). First, we selected SNPs in genes that were differentially expressed during cold acclimation [58]; unpublished data] or had annotations suggesting they were associated with growth, phenological traits, stress resistance, or adaption to temperature or drought. For these SNPs, we selected as many as two SNPs per isotig, excluding SNPs within 50 nt of each other. For the remaining SNPs, we removed those not found in at least two datasets, and then retained the most probable SNP in each isotig (i.e., based on the mean probability across all datasets). In the final filtering step, we retained all SNPs in differentially expressed genes (see above), and then filtered the remaining SNPs if they required two probes to assay (i.e., Infinium I assay type = A/T and C/G SNPs), or had a design score < 0.60, fewer than 10 quality reads, or a frequency < 0.05 (i.e., based on mean values across datasets). These criteria, which yielded 8769 SNPs, were specifically chosen to be compatible with an Infinium II array [18] that has a capacity of 9,000 attempted bead types.
We tested the Infinium array by genotyping 260 trees of coastal Douglas-fir. DNA was isolated from ~50 mg of frozen needles using the DNeasy Plant 96 Kits (QIAGEN), genotyping was performed by the UC Davis Genome Center according to protocols from Illumina, and the resulting data were analyzed using Illumina GenomeStudio software v2011.1 [80].
We assessed the quality of the resulting SNP loci based on the Illumina GenTrain scores, GenCall scores, SNP call frequencies, MAFs, and probabilities of deviation from HWE (Figure 1, Step 13) [80]. Each of these measures ranges from 0 to 1. GenomeStudio software uses a custom algorithm to cluster the data for each locus into homozygous and heterozygous classes, and the GenTrain score reflects the quality of these clusters. The calling algorithm then uses the GenTrain model and signal intensities to assign (“call”) a genotype for each locus and tree. The GenCall score reflects the quality of this assignment, and can be used to judge the quality of an individual SNP call, a SNP locus, or DNA sample. For example, the median GenCall score (50% GenCall score) is often used to judge the quality of SNP loci. Another measure of locus quality is the call frequency, or call rate, which is the number of successfully called SNP genotypes divided by the number of DNA samples (260 in our case). Based on the recommendation for the Infinium platform [19], we considered calls with GenCall scores < 0.15 as unsuccessful (“no calls”). In this paper, we report the numbers and characteristics of high-quality SNP loci, which we defined as loci that were polymorphic in our sample of 260 trees with call rates ≥ 85%. We also identified SNPs that deviated from HWE using the exact test described by Wigginton et al. [81] and a probability level of 0.9 × 10-5 (i.e., Bonferroni-corrected P-value of 0.05 based on 5847 SNPs). Finally, we used SAS Proc Logistic and stepwise model selection to determine whether the high-quality SNPs could be predicted from 12 SNP bioinformatic characteristics.