Evolutionary insights from de novo transcriptome assembly and SNP discovery in California white oaks
© Cokus et al. 2015
Received: 10 November 2014
Accepted: 7 July 2015
Published: 28 July 2015
Reference transcriptomes provide valuable resources for understanding evolution within and among species. We de novo assembled and annotated a reference transcriptome for Quercus lobata and Q. garryana and identified single-nucleotide polymorphisms (SNPs) to provide resources for forest genomicists studying this ecologically and economically important genus. We further performed preliminary analyses of genes important in interspecific divergent (positive) selection that might explain ecological differences among species, estimating rates of nonsynonymous to synonymous substitutions (d N/d S) and Fay and Wu’s H. Functional classes of genes were tested for unusually high d N/d S or low H consistent with divergent positive selection.
Our draft transcriptome is among the most complete for oaks, including 83,644 contigs (23,329 ≥ 1 kbp), 14,898 complete and 13,778 partial gene models, and functional annotations for 9,431 Arabidopsis orthologs and 19,365 contigs with Pfam hits. We identified 1.7 million possible sequence variants including 1.1 million high-quality diallelic SNPs — among the largest sets identified in any tree. 11 of 18 functional categories with significantly elevated d N/d S are involved in disease response, including 50+ genes with d N/d S > 1. Other high-d N/d S genes are involved in biotic response, flowering and growth, or regulatory processes. In contrast, median d N/d S was low (0.22), suggesting that purifying selection influences most genes. No functional categories have unusually low H.
These results offer preliminary support for the hypothesis that divergent selection at pathogen resistance are important factors in species divergence in these hybridizing California oaks. Our transcriptome provides a solid foundation for future studies of gene expression, natural selection, and speciation in Quercus.
Transcriptome sequences provide valuable resources for research in comparative genomics, population genetics, and evolutionary biology. Although numerous crop and some tree species have fully sequenced and annotated reference transcriptomes [1–3], there is still a need for more sequences from ecologically important taxa. Oaks (Quercus spp.) are ecologically and economically important species that lack a reference genome or transcriptome and would benefit from such resources . To date, the genomic sequence resources available in oaks are limited primarily to expressed sequence tags (ESTs) in the European white oaks, Q. robur L. and Q. petraea (Matt.) Liebl. [5–8], and in the eastern North American oaks, Q. alba L. and Q. rubra L. , and many of those resources remain under development . Research in population genetics, genomics, evolutionary biology, hybridization, response to the environment, and global change biology of oaks would benefit from an annotated transcriptome assembly and deep panel of nucleotide polymorphisms, especially in species and geographic regions with fewer such resources to date. High-throughput, short-read RNA-Seq data have enabled the rapid development of reference sequences and identification of single-nucleotide polymorphisms (SNPs) among those sequences [10, 11].
male flower/leaf opening bud
male flower opening
male flower opening
male flower/small leaf opening bud
expanding male flower; small/medium leaf
male flower/small leaf opening bud
full-size young leaf
Q. lobata–douglasii hybrid
full-size young leaf
We used our reference transcriptome and SNP data to perform preliminary comparative analyses, testing for signatures of natural selection that can provide insight into the factors that explain phenotypic or ecological differences among species. A number of approaches have been employed in the literature to understand species divergence, including phenotypic selection experiments, quantitative trait locus studies, and genome-wide sequencing [16–18]. Together, they strongly implicate natural selection, not just genetic drift, as a prominent force in speciation and divergence. However, it is more complicated to explain the maintenance of species boundaries and ecological specialization in species that naturally hybridize. For example, even modest hybridization can break down species boundaries and lead to homogenization  or maladaptation , or it could facilitate the transfer of adaptive alleles among species , among other possibilities [22, 23]. One hypothesized explanation for the persistence of morphological and ecological distinctions among species despite hybridization is divergent natural selection at ecologically relevant genes, such as those involved in biotic or abiotic stress responses . With current large-scale single-nucleotide polymorphism (SNP) data sets among closely related species and classical molecular tests for selection, we now have the tools to test this hypothesis by revealing specific genes and functional classes of genes that are under divergent selection among lineages [25, 26].
The genus Quercus (oak) has long been recognized for its propensity to hybridize yet maintain ecological and morphological differentiation , which has led to the proposal of the ecological species concept . Although pre-zygotic isolating mechanisms exist [29, 30], oaks frequently hybridize and species maintenance must be explained by other factors related to divergent selection in many cases [31, 32]. Our preliminary analyses identify genes under divergent selection among several California white oaks to test the hypothesis that selection by abiotic and biotic stresses contribute to ecological divergence and maintenance of oak species boundaries. We used our transcriptome-wide SNPs to estimate the ratio of nonsynonymous to synonymous substitution rates (d N/d S) as evidence for divergent (>1) or purifying (<1) selection , and Fay and Wu’s H as evidence for positive selection (<0). In particular, we tested for functional classes of genes based on Pfam annotations [33, 34] that showed the strongest evidence of divergent or positive selection among these ecologically distinct species.
Results and discussion
Approximately 12–22 million 100-base paired-end reads per individual (n = 24) and 420 million read pairs total (84 Gb) were obtained (NCBI: PRJNA282155). Adapter contamination was minimal (≈1 in 5,000 reads with a possible fragment at any position, and, of these, mostly reads entirely rather than partially adapter). The insert size distribution was short enough that 237 million read pairs overlapped unambiguously and thus were merged, forming 35 Gb of virtual single-end reads of 100–184 bases. The Ray de novo assembler  was used to assemble these and unmerged paired ends as a pool into a draft transcriptome of 88,595 preliminary contigs of sizes 203–16,982 bp totaling 73 Mbp (N50 = 1.2 kbp). Approximately 23,000 contigs of total 41 Mbp were of size ≥ 1 kbp, and a large number of the remaining contigs were quite short and of low coverage (Additional file 1) and may be (fragments of) low-expression genes, intron leakage (as can be common in some RNA-Seq preparations), 5ʹ- and 3ʹ-UTR extremes that tail off to low abundance or are shared by multiple genes, etc.
Thousands of contig pairs were found to share intervals of ≥ 35 bp of exactly identical sequence away from regions masked by Tandem Repeats Finder . Pairs of contigs whose ratio of average coverage differed by no more than 2:1 and that had only one uniquely aligning interval compatible with end-to-end joining were joined. Thus, some 9.6 k contigs of total ~8 Mbp were joined and replaced with ~4.6 k larger contigs of total ~7 Mbp. These joined contigs were found to be much enriched for compatibility with amino acid-coding open reading frames (ORFs), such that often one model missing its 3ʹ-end was joined with one missing its 5ʹ-end. An effort was also made to identify over-merged contigs and split them, although a survey of likely coding regions suggested that over-joining was not prevalent (see “Gene models and UTRs” below).
The final number of contigs after merging and splitting was 83,644 (N50 = 1.2 kbp) (see http://genomes.mcdb.ucla.edu/OakTSA/ for files containing assembled contigs, all annotations, variant calls, and other results discussed below). The resulting 72.5 Mbp draft transcriptome is nearly 10 % of the total estimated oak genome sequence of 750 Mbp  and is comparable to, although somewhat larger than, the Populus trichocarpa  and Arabidopsis [38, 39] transcriptomes.
Our transcriptome assembly is of high quality and purity. First, almost all 100-mers in almost all transcriptome contigs are highly unique (Additional file 2). Thus, even though Quercus has high genetic variation [40–42], it does not appear that many genic loci were multiply assembled (e.g., forming separate contigs from divergent alleles or splice variants). Second, we found that 70–78 % of original read ends per individual mapped back to our assembled reference with mapping confidence ≥ 99 % (or 80–89 % of reads with any mapping quality), comparable to typical RNA-Seq experiments with established references. Third, the C+G content (mean 39 %) is consistent with that of other oaks  and plants in general [38, 43]. The distribution of C+G content across coding sequences, 5ʹ-UTRs, and 3ʹ-UTRs is distinct and descending across these groups, and are similar to Arabidopsis (Additional file 3). The oak transcriptome contigs lacking gene models, which are generally short and of low expression, have low C+G content similar to 3ʹ-UTRs, consistent with being enriched for introns. In Arabidopsis, introns also have low C+G content similar to 3ʹ-UTRs .
Fourth, Quercus is by far the dominant organism represented in the contigs and non-oak contamination is not prominent, both as determined by comparisons to known sequences (see “Gene models and UTRs”), as well as the 18 contigs putatively identified as rRNA used as indicators. One rRNA contig had high-coverage, high-identity best match to the Quercus rubra chloroplast, one to the Gossypium hirsutum (cotton) mitochondrion (note that NCBI does not have a full Quercus mitochondrial sequence but only a small fragment), and the rest to bacterial sequences. However, of the 2.2 million original reads that aligned to these known sequences, 87 % matched the Q. rubra chloroplast, 11 % matched the cotton mitochondrion, and just 2 % matched the bacterial sequences. Furthermore, percent identity tended to be about 99 % with the chloroplast or mitochondrial reads and considerably lower with the bacterial reads.
Additionally, our Californian oak (Q. lobata/garryana/douglasii) transcriptome contained many contigs in common with European Q. robur EST data and the eastern North American Q. alba 454 EST data (Additional file 4), although our assembly consisted of more total contigs with twice the mean and maximum length (Additional file 5). For example, 55 % of our California Quercus contigs were found in Q. robur ESTs, and 19 % were found in Q. alba ESTs. In contrast, the proportions of Q. robur (62 %) and Q. alba (79 %) found in our Quercus transcriptome are higher. Together, these reciprocal comparisons indicate that the Q. alba dataset is largely found within our Quercus or the Q. robur datasets, whereas the Q. robur and our Quercus datasets each contain many unique transcripts.
Further indications of transcriptome quality and purity appear below.
Repeats and transposons
Only 4.4 % of the transcriptome contig base pairs are marked as repetitive by RepeatMasker, breaking down as 1 % long interspersed elements (LINEs, mostly L1/CIN4 and RTE/Bov-B), 0.9 % Ty1-copia and gypsy/DIRS1 long terminal repeats (LTRs), 0.5 % DNA transposons (0.2 % hobo-Activator), 0.5 % unclassified, 0.7 % low complexity, 0.5 % simple repeats, and 0.1 % small RNA. This overall percentage is comparable to a 454-based transcriptome assembly in Pinus contorta, but much higher than that observed in many other EST-based studies in plants . LINE/LTR/DNA-transposon containing contigs tended to be higher than average coverage, thus the large pool of low expression contigs (Additional file 1) does not seem to be mostly from transposons or retroelements.
Gene models and UTRs
About 4.4 k high-confidence, complete, single-exon gene models and 101 intron-containing models were selected from a preliminary GlimmerHMM  calling with Arabidopsis parameters to bootstrap the AUGUSTUS gene caller . After re-training and re-running AUGUSTUS, we explored coverage patterns in contigs containing tentative gene models. A random sampling of contigs as well as most of the 384 contigs that had more than one gene model showed expected coverage patterns. However, 90 contigs with multiple gene models showed sharp coverage discontinuities between models, suggesting such contigs were over-assembled fusions of multiple transcripts. These were split into separate contigs and final AUGUSTUS runs made. The result is 14,898 complete gene models, 6,826 missing the 3′ end, 4,577 missing the 5′ end, and 2,375 internal fragments (missing both 3′ and 5′ ends). All models total 28 M coding nucleotides. Twenty-four percent of all models and nine percent of complete models contain introns, and only one percent of contigs with at least one gene model have multiple gene models. Assuming that the 13,778 partial gene models represent at least 5,000 separate genes, then about 20,000 total genes are represented in our bud/leaf/flower transcriptome — a number comparable to what one would expect from a given tissue at a given life stage in Arabidopsis (e.g., NCBI SRX145413 Col-0 wild type leaf RNA-Seq as a generic representative ).
The amino acid usage and model length distributions compared favorably with A. thaliana TAIR10 genes  (Additional file 6). 5ʹ- and 3ʹ-UTR lengths are reasonable, amounting to about 1/3 of transcript lengths, but are slightly longer than in Arabidopsis (Additional file 6), perhaps due to the tendency for model organism projects to be somewhat conservative in UTR annotation and Quercus having a genome roughly six times the size of Arabidopsis . Our oak draft transcriptome has 88 % of the complete protein count and 94 % of the partial protein count of core eukaryotic genes (CEGMA 2.4 ) found in Populus trichocarpa, with 69–74 % of genes having orthologs. Finally, an analysis based on reciprocal best hits with BLASTp alignments between six-frame contig translations and a large pool of NCBI proteins showed that the most common organisms closest to individual contigs of a draft version of the transcriptome were all plants, especially (in descending order) Vitis vinifera, Populus trichocarpa, Glycine max, Arabidopsis thaliana, Ricinus communis, and Medicago truncatula.
Gene and domain annotation
We identified 9,431 oak–Arabidopsis gene pairs as being one-to-one orthologs (7,543 with a complete oak model). These were generally near-entire on both sides, with a mode of 70 % amino acid identity (Additional file 7). Oak contigs of low coverage and Arabidopsis loci of low expression (based on accession SRX145413  as a generic RNA-Seq data set) are much less likely to have an orthologous pairing, but many of those at higher coverage levels are captured (Additional file 8). Additionally, 50 % of Arabidopsis SRX145413 reads aligning to an Arabidopsis gene are to genes with a called oak ortholog, suggesting the ortholog calls capture many commonly expressed genes. Further, gene expression levels of orthologous pairs are correlated (Additional file 9). In addition, the distribution of Gene Ontology (GO) Plant Slim  terms is comparable between all TAIR10 loci and those having a called Quercus ortholog (Additional file 10), demonstrating that the Quercus transcriptome contains a representative and wide variety of genes. Additional functional annotation is provided by ~50,000 hits to Pfam from 19,365 distinct oak gene models, with ~3,800 of the 14,831 Pfam accessions appearing at least once. These numbers are similar to a run on Arabidopsis TAIR10 with the same parameters, resulting in ~48,000 hits to ~22,000 genes (collapsing over splice variants, giving for each gene and domain the maximum number of hits over the gene’s versions) with ~4,200 distinct Pfam accessions appearing at least once.
Using a GATK-based pipeline , we found ~1.7 million possible sequence variants when considering all 24 samples (Q. lobata, Q garryana, and the Q. lobata–Q. douglasii hybrid). After filtering and restricting only to SNPs, we found over 1.1 million, of which 98.5 % were diallelic, 1.5 % were triallelic, and 0.02 % were tetrallelic, with overall transition:transversion ratio of 1.9. The filtered diallelic SNP loci were used in subsequent analyses below. Among only the 22 Q. lobata samples, we identified about 900 k diallelic SNPs. Our SNP data set is the largest available in any Quercus species  and among the largest for trees [51–53].
The overall SNP locus rate per base pair was 1.5 % among all samples and 1.2 % within Q. lobata. Such rates vary depending on organism, degree of sampling from a population, and depth of sequencing, but those here are generally consistent with studies in other plants [54, 55], including oaks (when considering the rate across all contigs, not just contigs with variant loci) [5, 41], but somewhat lower than Arabidopsis that was sampled more broadly . SNP locus rates varied in patterns similar to those in an Arabidopsis genomic study  depending on whether the loci were within coding, intron, or UTR sequence (Additional file 11). The main difference seen can be explained by a drop in our power to detect variants at contig edges where coverage drops off (whereas transcript edges are not special when sequencing whole genome libraries). For orthologous genes among Quercus and Arabidopsis, nucleotide diversity at synonymous sites (πS = 0.004) within Q. lobata was also consistent with other trees  and plants in general . Nucleotide diversity is an unbiased estimate of the population mutation rate (θ = 4N eμ) and thus independent of sample size [59, 60], although it too depends somewhat on the ability to identify SNPs and accurately genotype, which in turn can depend on sample size and coverage. SNPs in gene model coding regions also led to amino acid change distributions consistent with those observed in Arabidopsis (Additional files 12) and organisms generally (Additional file 13). Further, genotype calls were largely in line with expectations under Hardy-Weinberg equilibrium (Additional file 14). A number of factors can be attributed to the slight deviations seen from Hardy-Weinberg expectations, given the overly simplistic biological assumptions of that model. Notably, where the deviations are largest is where there is by far the least data, thus at most loci the genotype frequency is consistent with expectations. Overall, these comparisons suggest that the final list of called SNP loci and associated genotypes are generally of high quality.
Patterns of molecular evolution among species
Abiotic or biotic stress response and flowering/seed development genes with d N/d S > 1
Pfam or TAIR id(s)
d N/d S
Q. lobata v. Q. garryana
hybrid v. Q. garryana
Biotic or abiotic response
Homeodomain-like superfamily protein
light-harvesting complex of photosystem II 5
co-factor for nitrate, reductase and xanthine dehydrogenase 7
growth-regulating factor 5
Drought-responsive family protein
DNAJ heat shock N-terminal domain-containing protein
cell wall / vacuolar inhibitor of fructosidase 2
hydroxyproline-rich glycoprotein family protein
DNA mismatch repair protein MutS, type 2
Leucine-rich repeat (LRR) family protein
ATP binding microtubule motor family protein
endonuclease V family protein
5ʹ-3ʹ exonuclease family protein
Arabidopsis broad-spectrum mildew resistance protein
chitin elicitor receptor kinase 1
TIR and NB-ARC domains
PF13676; PF01582; PF00931
abscisic acid responsive elements-binding factor 2
Auxin-responsive family protein
Zinc finger C-x8-C-x5-C-x3-H type family protein
K + transporter 1
SUPPRESSOR OF AUXIN RESISTANCE 3
Late embryogenesis abundant (LEA) hydroxyproline-rich glycoprotein family
F-box family protein
cyclic nucleotide-binding transporter 1
ARP protein (REF)
purine permease 10
COP1-interacting protein 7
HhH-GPD base excision DNA repair family protein
5ʹ-3ʹ exonuclease family protein
Flowering and seed development
Male sterility protein; 3-β hydroxysteroid dehydrogenase/isomerase, NAD dependent epimerase/dehydratase families
PF07993; PF01073; PF01370
hydroxyproline-rich glycoprotein family protein
S-locus glycoprotein family; D-mannose binding lectin; PAN-like, protein kinase, protein tyrosine kinase domains
PF08276; PF00954; PF01453; PF00069; PF07714
Glucose-methanol-choline (GMC) oxidoreductase family protein
myosin heavy chain-related; maternal effect embryo arrest 13
Enoyl-CoA hydratase/isomerase family
S-locus glycoprotein family; D-mannose binding lectin
S-locus glycoprotein family; D-mannose binding lectin; Protein kinase, protein tyrosine kinase domains
PF00954; PF01453; PF00069; PF07714
TCP-1/cpn60 chaperonin family protein; embryo defective 3007
Within species, high genetic diversity at disease resistance genes is thought to play a role in conferring resistance to a broad diversity of pathogens and has been widely observed in plants [66, 72, 73] and animals . For example, pathogen pressure on common alleles can lead to positive selection for rare alleles, thus promoting diversity [69, 75]. When biotic conditions change among populations or lineages, this underlying mechanism can continue, leading to diversifying selection among species for different alleles or suites of alleles. Thus, our observation that disease response genes are under divergent selection is consistent with the general observation that diseases, parasites (e.g., gall wasps and mistletoes), and fungal associates (e.g., mycorrhizal fungi) are often specific to particular plant species, including oaks [76–78]. Such pathogen pressure has been implicated as an important factor explaining biodiversity in tropical forests [79, 80].
The remaining Pfam accessions associated with significantly high d N/d S are involved in a variety of functions. UDPGT-containing genes are involved in pigment biosynthesis in plants as well as toxin removal in mammals. PPR (pentatricopeptide repeat) genes are involved in organelle biogenesis and bind chloroplast RNA, suggesting a potential role in RNA editing [81, 82]. AAA (ATPase) genes are involved in a variety of functions, especially signal transduction and gene expression regulation. F-box associated (FBA) genes perform diverse roles in plants, including flowering and growth regulation, self-incompatibility, leaf senescence, and various abiotic and biotic responses . In addition, the S-locus glycoprotein (S_locus_glycop) family related to self-incompatibility had significantly high d N/d S (Q = 0.02), and three specific genes containing copies of it had d N/d S > 1 (Table 2). Although variation at S-loci is often found to predate speciation [83, 84], the role of these loci in mating compatibility and reproduction gives them potential to be involved in divergence in some contexts. Several other genes influencing floral and embryo development also had d N/d S > 1, consistent with the idea that some reproductive elements might differ among species. Finally, a number of abiotic response genes had high d N/d S, including those involved in light response (e.g., COP1-interacting protein 7, light-harvesting complex of photosystem II 5, and photolyase 1), heat shock (DNAJ heat shock N-terminal domain-containing protein), and drought (drought-responsive family protein) (Table 2).
As complementary tests of positive selection in Q. lobata relative to its ancestor with Q. garryana, we calculated for each gene Fay and Wu’s H, which is expected to have negative values under positive selection . We did not find any class of genes defined by the Pfam accessions they contained to have unusually negative or positive H (Q > 0.39) (Additional file 19). However, 1,380 genes have low values (H < −2), and 151 individual genes have very low values (H < −5), suggestive of positive selection (Additional files 15 and 20). Among all these, 37 also had d N/d S > 1, of which 4 likely play a role in abiotic or biotic stress response: light-harvesting complex of photosystem II 5 (m01oak00269cC-t01.1; H = −3.38), K+ transporter (m01oak16383cC-t01.1; H = −2.73), late embryogenesis abundant (LEA) hydroxyproline-rich glycoprotein family (m01oak02432Ct-t01.1; H = −2.92), and an AAA domain-containing protein (m01oak15073cF-t01.1; H = −4.01) (Table 2). The strongest evidence for positive selection is for light-harvesting complex of photosystem II 5, which has both very high d N/d S (Table 2) and low H. However, there was no overall correlation of d N/d S with H (r = 0.03). Given our limited sampling, our primary goal was to identify functional classes of genes under positive selection, and it appears that the signal in H is not strong enough to independently confirm the d N/d S results.
In summary, we found preliminary support for divergent selection on biotic and abiotic stress response genes based on d N/d S, consistent with the hypothesis that ecological forces are important in the divergence and potentially in the maintenance of species boundaries in California oaks. Identifying the classes of genes under divergent selection provides insight into the factors most strongly involved in divergence among these hybridizing taxa, but warrants future direct investigation of the role of ecological factors in speciation and maintenance of species boundaries.
We have constructed a draft transcriptome for several California oaks that includes transcript contig nucleotide sequences, gene models (including CDS, UTR, and intron annotations), and functional associations based on Pfam domain occurrences and gene orthologs with Arabidopsis. Although understandably not at the quality of a model organism, this transcriptome appears to be a large representative fraction of transcribed loci in Quercus and is the most complete for this genus reported to date. Many gene models are of good quality and complete, many transcripts correspond to exactly one gene, and individual genes are generally not multiply assembled, even though Quercus has relatively high genetic variation across individuals. We further identified over 1.1 million high-quality diallelic SNP loci and genotyped 24 individuals, representing the largest SNP resource available in Quercus  and among the highest for any tree [51–53]. This transcriptome provides a solid foundation for studies of genetic variation and gene expression in Quercus, and will enable research in comparative genomics, phylogeography, hybridization, adaptation genomics, and quantitative genetics in oaks.
By investigating patterns of nonsynonymous and synonymous SNP variation among classes of genes among species, we found preliminary support that biotic stress is a major factor in divergent evolution among oak species. By implication, selection for stress response could contribute to the maintenance of species boundaries among these hybridizing species. Future work that includes more species and codon-level analyses, as well as studies of genomic patterns of introgression along hybrid zones, will shed more light on the specific loci that are most important in speciation, species integrity, and adaptation.
Sampling and sequencing
The first week of April 2011, we sampled bud, expanding leaf/flower bud, young leaf and/or young male flower tissue from 22 Quercus lobata individuals spread throughout its entire distribution, 1 Q. garryana individual, and 1 probable Q. lobata–Q. douglasii hybrid individual (Table 1). The Q. garryana was sampled to permit interspecific comparison, while Q. lobata was more intensively sampled to generate a SNP resource for other studies. The hybrid was originally collected under the assumption that it was Q. lobata because of its relatively deeply lobed leaves, but it was later determined that it had intermediate leaf characteristics consistent with other hybrid Q. lobata × Q. douglasii (Gugger PF, unpublished data) . In addition, this sample is identified as extraordinarily different from both Q. lobata and Q. garryana in a principal components analysis on SNPs, and across SNPs it is heterozygous unusually often, suggesting recent hybrid origin (Additional file 21). Q. douglasii and Q. lobata co-occur at the collection site. Samples were immediately frozen on dry ice in the field and stored at −80 °C until total RNA extraction. Preliminary RNA precipitation (http://openwetware.org/wiki/Conifer_RNA_prep) was followed by the Qiagen RNeasy Plant Mini Kit with DNase treatment protocol. RNA-Seq libraries with insert lengths 100–380 bp (mode = 170 bp) as determined by BioAnalyzer (Agilent) were prepared from 4 μg of RNA using an Illumina TruSeq RNA Sample Prep Kit. Each library was uniquely tagged using Illumina TruSeq indexed adapters #1 to #12 to enable equimolar 12-plexing of samples, using two lanes to accommodate the 24 individuals. Sequencing was paired end 100 + 100 bases on Illumina HiSeq 2000 v3 flow cells at the Neuroscience Genomics Core, UCLA.
Transcriptome assembly and quality
Only read pairs with Illumina RTA PF = 1 and perfect matches to an expected 6-mer index tag were retained. Adapters were investigated using lists of known Illumina adapter and library preparation-related sequences (allowing substitutions and indels, as well as fragmentary matches anywhere within reads and these sequences) as well as de novo assemblies of over-represented k-mers from read tails. Paired read ends that unambiguously overlapped by ≥ 16 bases with ≤ 3 mismatches were merged to produce a large population of longer virtual single end reads, taking advantage of double sequencing in the overlap to lower basecall errors. Read pairs not overlapping were retained as ordinary paired ends. Each read was cut to the longest interval with ≤ 1 N that does not start or end with an N (resolving any ties in favor of the 5ʹ-most interval); if < 50 bases remained, the read was discarded, and otherwise any Ns were replaced by independent uniformly random choices from A/C/G/T. Surviving single and paired end sequences were given to Ray 2.2.0 (k = 51) for the primary de novo assembly step . As already described, subsequently some contigs were merged, and then some were split. The resulting contigs contain no IUPAC ambiguous nucleotides or gaps.
Evidence for quality of the transcriptome has already been presented earlier; only methodological details missing in those discussions are given here in Methods. For testing the fraction of original reads mapping to the transcriptome assembly, Bowtie 2.1.0  was used set to ‘--sensitive-local’. Identification of rRNA contigs to investigate contamination was as follows (although not poly-adenylated, rRNA is often sufficiently abundant in total RNA that it survives in RNA-Seq experiments prepared with poly-A purification). From a 39,412-sequence multiple alignment of large-subunit rRNA across the entire tree of life from SILVA 115 , a consensus of a 1 kbp “core” interval of high conservation was identified and used as a BLASTn 2.2.26 (E-value threshold 10−6) target for both strands of transcriptome contigs. The entire, end-to-end sequences of contigs with alignments were given to NCBI web BLASTn to the ‘nr’ database to identify high-coverage, high-identity hits to known sequences, with the entire, end-to-end known sequences then used as new targets to align original read pairs against with Bowtie2 .
The EST-based transcript data for other Quercus species are Q. alba from eastern North America (WO454 Unigene V2, which includes aboveground and belowground tissues)  and Q. robur/Q. petraea from Europe (OakContigV1, which includes leaves, buds, flowers, pollen) . Using USEARCH 7.0  (with thresholds of 92 % nucleotide identity and E-value 10−10; conclusions are insensitive to reasonable modifications), we estimated reciprocal overlap in the number of transcripts among our data and the other oak species.
Repeats and transposons
Tandem Repeats Finder 4.04  parameters were 2 7 7 80 10 50 2000. RepeatMasker 3.3.0 (http://www.repeatmasker.org/) with Repbase 2011–09–20 (using all of Eukaryota)  was run on the final transcriptome, with identified repeats presented in lowercase in the nucleotide sequences of deposited contigs.
Bootstrapping of gene models began with an ab initio run of GlimmerHMM 3.0.2  with stock Arabidopsis parameters . The subset of models containing introns were further filtered before training: only models with 1–3 introns were allowed as candidates, and these were aligned as amino acids with lastal (−m 100 -j 3) of LAST 189  to a large collection (45 M sequences, 7 G amino acids) of NCBI reference proteins (‘nr’ 2011–11–30 04:12 + ‘env_nr’ 2011–11–19 22:13 + ‘pataa’ 2011–11–30 12:35), and only models with at least one alignment covering ≥ 95 % of the model and ≥ 95 % of the NCBI sequence were retained. Throughout this project, all amino acid translations were taken via the standard universal genetic code, NCBI #1. AUGUSTUS 2.5.5  was then trained (on both the intronless and high quality intron models) and used iteratively as already described. Note that, even when calling genes on both strands, AUGUSTUS is surprisingly sensitive to the strand (Watson or Crick) presented to it; thus AUGUSTUS runs involved running AUGUSTUS twice, once for bi-strand calling on Watson and once for bi-strand calling on Crick strand, and an overlap resolution/non-redundant extraction procedure preferring complete models used to merge the results. The coding sequence of non-complete models may start and/or end on other than a codon boundary; such cases are properly represented, e.g., via the frame/phase (column 8) information in the deposited GFF/GTF file that communicates the models.
UTRs were only assigned to AUGUSTUS (CDS + intron) models that had both a start and stop codon, and additionally were the only model on their parent contig. If there were any nucleotides outside the span of the AUGUSTUS model and upstream of it on its strand, they were all taken as 5ʹ-UTR in a single interval. Similarly, if there were any nucleotides outside the model and downstream of it on its strand, they were all taken as 3ʹ-UTR in a single interval. No UTRs were assigned to models on contigs with multiple models, or to models lacking either a start or stop codon. No attempt was made to identify any introns in UTRs.
The process of determining draft orthologs with Arabidopsis began with BLASTp 2.2.26 (E-value threshold 10−6) amino acid alignments in both directions between all 35,386 TAIR10 models (splice variants included) and the final AUGUSTUS Quercus proteins (both complete and partial, with final STOP removed when present and each internal STOP [rare] replaced with an x). For each direction, for each query, only hits with bitscore ≥ 99 % of the top bitscore for that query were retained; then, for each subject, only hits with bitscore ≥ 99 % of the top bitscore for that subject were retained. Form a bipartite digraph with vertices the TAIR10 genes (dropping splice variant distinctions) and oak models, with arcs from queries to subjects, and collapse parallel multiple arcs to single arcs. Remove all vertices except those with exactly one outgoing arc, and declare as putative orthologous pairs those pairs of vertices with arcs pointing reciprocally at each other. These orthologs provide inferred gene annotation information for oak in the form of protein product names and associated GO (accessed 2014-03-25) controlled vocabulary terms via TAIRs extensive curation of the Arabidopsis model organism [38, 49].
Annotational coverage of a larger fraction of oak models (at the expense of generally less specific information) is obtained by finding occurrences of Pfam accessions (domains, etc.) across all the gene models. HMMer 3.1b1’s  hmmsearch (leveraging Pfam’s carefully chosen high specificity, high sensitivity per-accession thresholds with the --cut_tc option) was used to identify occurrences of Pfam 27.0 A [33, 34] accessions via their consensus HMM amino acid profiles. As Pfam to Gene Ontology associations are available (http://geneontology.org/external2go/pfam2go, downloaded 2014-06-23), these provide additional inferred GO associations for oak models.
In each case, the inferential (restricted transitive) closure of Gene Ontology associations was taken, using the conventional inference rules recommended by GO.
Variants were called using a pipeline based on GATK 2.8.1 (2.5.2 for steps before genotyping) , roughly following GATK best practices [94, 95]. GATK is, however, primarily designed for low coverage genomic re-sequencing of mammalian model organisms. Adaptations and parameter choices are needed to apply it to RNA-Seq reads (with their highly variable coverage reaching extraordinary levels for the highest genes) and de novo assembled contigs in a non-model organism such as oak, not only for computational considerations but also quality of output. The pipeline begins with the final draft oak transcriptome contigs as reference and the Illumina RTA PF = 1 paired-end 100 + 100 base reads tagged by individual with per-base RTA Phred-scale quality scores (with Illumina EAMSS quality score adjustments applied) as reads.
Reads were aligned to the reference with Bowtie 2.1.0  as paired ends (−q --very-sensitive-local --n-ceil L,2,0 --dpad 12 --gbar 5 --score-min L,102,0 --minins 75 --maxins 500 --fr --no-dovetail --no-contain) to produce per-individual SAM files that were converted to BAM and sorted by aligned position on the reference. Alignments were filtered to only retain those with MAPQ ≥ 10 and FLAG bitwise AND with 0xF04 being zero (segment not unmapped, and not secondary alignment, and flagged as passing quality controls, and not flagged PCR/optical duplicate, and not a supplementary alignment). For each position on each reference contig, for each individual, if there were multiple reads whose alignments started at the position, then only a single one selected uniformly at random was retained; this capped coverage to a high but computationally manageable level for GATK, while maintaining high diversity of contributing reads. The resulting BAM files were merged into a single BAM (maintaining tags marking individuals).
GATK RealignerTargetCreator (−−downsampling_type NONE --baq OFF --windowSize 10 --mismatchFraction 0.0 --minReadsAtLocus 4 --maxIntervalSize 500) and then IndelRealigner (−−downsampling_type NONE -- baq CALCULATE_AS_NECESSARY --baqGapOpenPenalty 30 --LODThresholdForCleaning 5.0 --consensusDeterminationModel USE_READS --entropyThreshold 0.15 --maxReadsInMemory 150000 --maxIsizeForMovement 3000 --maxPositionalMoveAllowed 200 --maxConsensuses 30 --maxReadsForConsensuses 240 --maxReadsForRealignment 20000) were run (partitioning contigs into 84 piles with approximately equal numbers of aligned reads and running piles in parallel). As no high quality, near complete file of already known variants was available, GATK base score quality recalibration (BSQR) was skipped.
Numerous experimental trial runs of GATK HaplotypeCaller and UnifiedGenotyper were made, evaluating outputs by various statistical measures as well as by detailed hand examination (using the IGV genome browser ) of variant/genotype calls and aligned reads for a small random contig sampling. Considerations of output quality and total computational effort required led to a decision to determine possible alleles at each reference position outside of GATK and then use GATK UnifiedGenotyper with certain parameters (see below) to genotype individuals against these possible alleles. Formation of possible alleles began with the Bowtie2 original SAM outputs pooled across individuals, filtering to retain alignments with MAPQ ≥ 20 and FLAG bitwise AND with 0x704 being zero. Each alignment is broken into maximal runs of four types: insert-to-reference, deletes-from-reference, basepairs-in-1:1-correspondence-and-different, and basepairs-in-1:1-correspondence-and-identical; runs of the last type were dropped. The number of times observed for each distinct quintuple of run type, contig name, position on contig (which is between two reference basepairs for inserts to reference), reference sequence (for + strand), and read sequence (as if + strand) was determined, and only quintuples seen ≥ 5 times (and with read sequence having no IUPAC ambiguous nucleotides) were retained; those that survive constitute the tentative possible alleles. The tentative alleles were converted to VCF file format (by grouping alleles at each position into a locus, and, since VCF requires all alleles to be non-empty, loci that would otherwise involve an allele being a string of zero nucleotides [e.g., indels] were grown to include one more upstream nucleotide on the + strand, in the usual way for VCF). As GATK UnifiedGenotyper does not fully handle multiple VCF loci overlapping on the reference, each chain of overlapping tentative loci was merged (with a custom script, as GATK CombineVariants was found insufficient) into a single variant locus (growing the extent involved on the reference and adjusting alleles appropriately in response). The resulting VCF file defines the possible alleles that were used as the genotyping targets in the next paragraph.
Genotyping against the possible alleles was conducted with GATK UnifiedGenotyper (−−downsampling_type NONE --baq CALCULATE_AS_NECESSARY --baqGapOpenPenalty 30 --defaultBaseQualities 30 --heterozygosity 0.05 --indel_heterozygosity 0.005 --genotyping_mode GENOTYPE_GIVEN_ALLELES --input_prior 0.020408163265306 «…48 copies total; 0.0204… is 1/49» --input_prior 0.020408163265306 --standard_min_confidence_threshold_for_calling 10 --standard_min_confidence_threshold_for_emitting 10 --max_alternate_alleles 24 --contamination_fraction_to_filter 0.0 --pcr_error_rate 0.0001 --computeSLOD --annotateNDA --pair_hmm_implementation LOGLESS_CACHING --min_base_quality_score 20 --max_deletion_fraction 9.99 --allSitePLs --min_indel_count_for_genotyping 5 --min_indel_fraction_per_sample 0.25 --indelGapContinuationPenalty 10 --indelGapOpenPenalty 30 --sample_ploidy 2 --output_mode EMIT_VARIANTS_ONLY), once (the “SNV run”) specifically for single-nucleotide variants (−−genotype_likelihoods_model SNP, where reference and all alternate alleles are single nucleotides) and once (the “MNV” run) to add multi-nucleotide and indel variants (−−genotype_likelihoods_model BOTH, with reference and/or at least one alternate allele being other than a single nucleotide). Runs were merged with GATK CombineVariants (−−downsampling_type NONE --baq OFF --genotypemergeoption PRIORITIZE --rod_priority_list SNV,MNV --filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED --printComplexMerges --excludeNonVariants --minimumN 1 --combineAnnotations). These jobs were broken into many parts run in parallel with individual parts run with GATK (−nt/-nct) threading; crashing of some parts was alleviated by re-running them without GATK threading. About 200 loci with 25+ possible alleles could not be fully processed. GATK variant quality score recalibration (VQSR) was skipped as no pre-existing high quality file of variants existed. The output VCF file is the “master” (unfiltered) deposited set of variants (with a reference and one or more variant [non-reference] alleles specified at each locus), genotypes (with each individual at each locus either assigned to two [not necessarily distinct] alleles, or left uncalled and assigned to no alleles), and a large variety of embedded statistics.
Our focus in this work is on diallelic SNPs. By examination of the distributions of the many metrics that GATK includes in its output VCF files and how statistical properties of variants (e.g., transition:transversion ratio) behave in different regimes of these metrics, filtering criteria for SNVs were established. Downstream analyses were restricted to SNP loci that satisfied all of the following: 17.0 ≤ QUAL < 100000.0, total AC is not 0 or 48, −3.5 < BaseQRankSum < 7.0, DP ≥ 5, Dels < 0.1, FS < 90.0, HaplotypeScore < 25.0, MQ ≥ 20.0, −25.0 < MQRankSum < 10.0, −3.5 < ReadPosRankSum < 9.0, and SB < 3.0; and further, individual genotype calls were filtered as follows (changing to uncalled those that fail either condition): DP ≤ 100 and GQ ≥ 20. Experimentation with GQ thresholds of 0, 10, 27, and 44 was also conducted. Downstream diallelic SNP analyses were performed on those surviving loci that had exactly two alleles represented across surviving genotypes, with one of these being the reference allele (and both being single nucleotides); these are the loci referred to in this work as “SNPs” unless otherwise specified.
Filtered diallelic SNP loci were divided into groups based on their position on the reference relative to gene models (i.e., start codons, stop codons, interior codons, introns, 5ʹ-UTR, 3ʹ-UTR, and remaining base pairs). The predicted coding effect of each variant allele landing in a codon was made by a script that considered each locus in isolation and examined the reference codon versus the alternate codon induced by the variant allele as interpreted by the standard universal genetic code, and codons with multiple loci were generally excluded from downstream analyses. A VCF-to-VCF run (in 100 parts) of SnpEff 3.2a  (with the oak transcriptome and gene models added) was also performed (−lof -oicr), although it was found that (although some of its source code seems to exist for the purpose) models not starting/ending on a codon boundary generated errors/warnings so that models not starting on a codon boundary had to be suppressed from the oak reference presented to it and its output does not contain all effects for all models.
For the Hardy-Weinberg analysis, allele frequencies were parameterized as the two-dimensional space of real triples (r, h, v) with r, h, v ≥ 0, r + h + v = 1, and r, h, and v defined as the fractions of individuals that are homozygous reference, heterozygous, or homozygous variant, respectively. Hardy-Weinberg equilibrium is a particular one-dimensional curve in this space. To estimate an empirical observed density from the oak data, only the 76,233 filtered diallelic SNP loci that, on restriction to the 22 Q. lobata individuals, had ≥ 11 called genotypes, ≥ 1 individual called homozygous reference, ≥ 1 individual called homozygous variant, and ≥ 1 individual called heterozygous were used. The empirical density was taken as the average over these loci of the posterior density from the observed Q. lobata genotype calls for that locus starting from a uniform Dirichlet prior, and the figure in Additional file 14 presents the logarithm of the result in equilateral barycentric coordinates (known in this context as a de Finetti diagram). Note that we do not expect (and do not observe) uniform distribution on the one-dimensional Hardy-Weinberg equilibrium curve; there is great enrichment for r to be high (as many alleles determined in this study occur in few individuals). Hence, we examined where the empirical density attains its maximum along each line from h = 1 to h = 0 (each such line crosses the Hardy-Weinberg equilibrium curve exactly once) and find this location to be near the Hardy-Weinberg equilibrium curve for all lines.
Molecular evolutionary tests of divergent selection
We estimated the nonsynonymous versus synonymous substitution rates among species (d N/d S = K a/K s) for the coding sequence of each gene model to identify genes and functional groups of genes under the influence of positive divergent (d N/d S > 1) or purifying (d N/d S < 1) natural selection among species lineages . The d N/d S rates were calculated from the filtered diallelic SNPs with C++ code equivalent to dnds(…, ‘METHOD’,‘PBL’, ‘WINDOW’,inf, ‘ADJUSTSTOPS’,true) revision 126.96.36.199 from MATLAB Bioinformatics Toolbox 4.0 (MathWorks, Inc., Natick, MA, USA), which uses the Pamilo-Bianchi-Li method [98, 99] based on the Kimura two-parameter model  that keeps track of three levels of codon degeneracy and corrects for transition/transversion imbalance. We used this model to estimate d N, d S, and their variances by directly comparing each individual to the outgroup Q. garryana . For each gene, for each individual, two coding sequences (“haplotypes”) were formed. Each sequence has all coding nucleotides (variant or not) from the gene model, padded with ambiguous nucleotides to the nearest codon boundary on each end. Nucleotides not at a filtered diallelic SNPs are from the transcriptome contig (and common to all individuals), while at a filtered diallelic SNPs, the two sequences together represent the called genotype for the individual at that locus (and note that lack of haplotype phasing across such positions does not matter for dnds()’s calculations). Uncalled genotypes are replaced with ambiguous nucleotides. When two individuals are compared, haplotypes are concatenated, so that each coding position in the model contributes four times, once for each pairing of haplotype in one individual to haplotype in the other individual. We also tried alternative methods of estimating d N/d S by inferring an “ancestral” sequence for comparison to each extant individual’s sequence , as well as a variety of other simpler methods [102, 103], but the results were highly similar, and thus only the more conventional approach described first is presented here.
The bulk of d N/d S values were found to generally follow log-normal distributions (e.g., normal distributions after log2-transformation). Hence, when summarizing the 22 Q. lobata d N/d S values for a gene or class into a single value, a geometric mean was taken. To identify unusually extreme summarized gene values on log2 scale, values outside [−5.0, 1.0] were temporarily ignored (removing tails completely) and parameters of a normal distribution truncated to this interval determined by maximum likelihood. Tail probabilities of the untruncated log-normal (after going back to full range on linear scale) then provide a measure of unusualness; the one-tail α = 0.05 levels for high d N/d S values are those ratios above 0.99 for the Q. lobata versus Q. garryana comparison and above 0.85 for the Q. lobata–Q. douglasii hybrid versus Q. garryana comparison. Hence, d N/d S > 1 not only represents potentially divergent selection, but is also significantly higher than the bulk of ratios in both comparisons.
When closely related species are compared, some SNPs might still be segregating and thus not fixed among species. This fact has the potential to render d N/d S less sensitive to selection intensity making it a conservative test for positive selection, especially in the most extreme case of a single population [104, 105]. However, our data set represents a more favorable scenario of substantial divergence despite some shared polymorphism, suggesting that d N/d S is appropriate but underpowered for identifying cases of divergent selection . d N/d S can also be considered a conservative test for positive selection when estimated along entire coding regions as we do because different sites within genes might be under different selection pressures . In the case of low d N/d S, intraspecific and interspecific estimates are highly correlated in practice, regardless of theoretical considerations . Nonetheless, we cautiously used d N/d S primarily for assessing differences among broad functional categories of genes as defined by those that contain specific Pfam accessions, rather than making strong claims about specific loci. Functional classes of genes based on Pfam accessions were tested for unusually high or low values of d N/d S with Wilcoxon rank-sum tests [107, 108], and P-values were adjusted for multiple testing to Q-values using the false discovery rate method of Benjamini and Hochberg [109, 110]. Only rates computed from at least six substitutions of either kind (synonymous or not) were considered. If d N was estimated as zero, we assigned d N/d S = 0; if d S was estimated as zero, we assigned d N/d S = 10. Because of the use of rank-based statistical tests, results are not sensitive to reasonable alternative assignments in these cases.
To complement the d N/d S tests, for each gene-containing contig we computed Fay and Wu’s H as an independent test for positive selection in Q. lobata relative to its ancestor with Q. garryana. H is expected to be negative when there is positive selection leading to high frequency of a derived allele. To calculate H, we used C++ code equivalent to faywu00h_test in the MATLAB PGEToolbox , and as the ancestral sequence, we used the Q. garryana allele, or, when missing, the most common allele across the Q. lobata individuals. The latter choice should bias the H estimates towards more positive values because the putatively derived allele would then be at lower frequency in Q. lobata, thus making the test more conservative for inferring positive selection. We allowed up to 4 of 44 missing alleles (n = 22) within Q. lobata at each site by randomly resampling 40 alleles at each SNP locus. H was calculated for ten such replicates and summarized with the median. Functional classes of genes were tested for unusually high or low H using Wilcoxon rank-sum tests, and the resulting P-values adjusted for multiple testing in the same way as for d N/d S.
Mean heterozygosity per gene was calculated for each gene with at least 6 SNPs genotyped in at least 10 of 22 Q. lobata individuals and the Q. garryana individual, ignoring the hybrid individual.
Availability of supporting data
Illumina sequence reads, the final draft transcriptome contigs, annotations, variant calls, per-individual genotypes, and results of tests for natural selection are publically available through NCBI under project accession PRJNA282155 and/or a dedicated oak transcriptome assembly project page on the UCLA genomics resource website at http://genomes.mcdb.ucla.edu/OakTSA/.
We thank J. Ortego for assistance with sampling and M. Pellegrini for helpful discussions. We also acknowledge the University of California Reserve System and the California Department of Parks and Recreation for access to some sample sites; the Neuroscience Genomics Core at UCLA for its sequencing facilities; and the computational services available through the Hoffman2 Shared Cluster provided by the UCLA Institute for Digital Research and Education’s Research Technology Group. Funding was provided by research seed money from UCLA to VLS.
- Tuskan GA, DiFazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science. 2006;313:1596–604.PubMedGoogle Scholar
- Bao Y, Xu S, Jing X, Meng L, Qin Z. De novo assembly and characterization of Oryza officinalis leaf transcriptome by using RNA-Seq. Biomed Res Int. 2015;2015:7.Google Scholar
- Sierro N, Battey J, Ouadi S, Bovet L, Goepfert S, Bakaher N, et al. Reference genomes and transcriptomes of Nicotiana sylvestris and Nicotiana tomentosiformis. Genome Biol. 2013;14:R60.PubMed CentralPubMedGoogle Scholar
- Kremer A, Abbott AG, Carlson JE, Manos PS, Plomion C, Sisco P, et al. Genomics of Fagaceae. Tree Genet Genomes. 2012;8:583–610.Google Scholar
- Ueno S, Le Provost G, Leger V, Klopp C, Noirot C, Frigerio J-M, et al. Bioinformatic analysis of ESTs collected by Sanger and pyrosequencing methods for a keystone forest tree species: oak. BMC Genomics. 2010;11:650.PubMed CentralPubMedGoogle Scholar
- Durand J, Bodenes C, Chancerel E, Frigerio J-M, Vendramin G, Sebastiani F, et al. A fast and cost-effective approach to develop and map EST-SSR markers: oak as a case study. BMC Genomics. 2010;11:570.PubMed CentralPubMedGoogle Scholar
- Bodénès C, Chancerel E, Gailing O, Vendramin GG, Bagnoli F, Durand J, et al. Comparative mapping in the Fagaceae and beyond with EST-SSRs. BMC Plant Biol. 2012;12:153.PubMed CentralPubMedGoogle Scholar
- Tarkka MT, Herrmann S, Wubet T, Feldhahn L, Recht S, Kurth F, et al. OakContigDF159.1, a reference library for studying differential gene expression in Quercus robur during controlled biotic interactions: use for quantitative transcriptomic profiling of oak roots in ectomycorrhizal symbiosis. New Phytol. 2013;199:529–40.PubMedGoogle Scholar
- Fagaceae Genomics Web. [http://www.fagaceae.org/].
- Cánovas A, Rincon G, Islas-Trejo A, Wickramasinghe S, Medrano J. SNP discovery in the bovine milk transcriptome using RNA-Seq technology. Mamm Genome. 2010;21:592–8.PubMed CentralPubMedGoogle Scholar
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011. doi:10.1038/nbt.1883.
- Pearse IS, Hipp AL. Phylogenetic and trait similarity to a native species predict herbivory on non-native oaks. Proc Natl Acad Sci U S A. 2009;106:18097–102.PubMed CentralPubMedGoogle Scholar
- Craft KJ, Ashley MV, Koenig WD. Limited hybridization between Quercus lobata and Quercus douglasii (Fagaceae) in a mixed stand in central coastal California. Am J Bot. 2002;89:1792–8.PubMedGoogle Scholar
- Nixon KC, Muller CH. Quercus Linnaeus Sect. Quercus White Oaks. In: Committee FONAE, editor. Flora of North America North of Mexico. New York: Oxford University Press; 1997. p. 436–506.Google Scholar
- Burns RM, Honkala BH. Silvics of North America: Hardwoods. Washington, DC: U.S. Department of Agriculture Forest Service; 1990.Google Scholar
- Rieseberg LH, Widmer A, Arntz AM, Burke JM. Directional selection is the primary cause of phenotypic diversification. Proc Natl Acad Sci. 2002;99:12242–5.PubMed CentralPubMedGoogle Scholar
- Seehausen O, Butlin RK, Keller I, Wagner CE, Boughman JW, Hohenlohe PA, et al. Genomics and the origin of species. Nat Rev Genet. 2014;15:176–92.PubMedGoogle Scholar
- Hoekstra HE, Hoekstra JM, Berrigan D, Vignieri SN, Hoang A, Hill CE, et al. Strength and tempo of directional selection in the wild. Proc Natl Acad Sci. 2001;98:9157–60.PubMed CentralPubMedGoogle Scholar
- Slatkin M. Gene flow in natural populations. Annu Rev Ecol Syst. 1985;16:393–430.Google Scholar
- Muhlfeld CC, Kovach RP, Jones LA, Al-Chokhachy R, Boyer MC, Leary RF, et al. Invasive hybridization in a threatened species is accelerated by climate change. Nat Clim Chang. 2014;4:620–4.Google Scholar
- Fitzpatrick BM, Johnson JR, Kump DK, Smith JJ, Voss SR, Shaffer HB. Rapid spread of invasive genes into a threatened native species. Proc Natl Acad Sci. 2010;107:3606–10.PubMed CentralPubMedGoogle Scholar
- Abbott R, Albach D, Ansell S, Arntzen JW, Baird SJE, Bierne N, et al. Hybridization and speciation. J Evol Biol. 2013;26:229–46.PubMedGoogle Scholar
- Rieseberg LH, Raymond O, Rosenthal DM, Lai Z, Livingstone K, Nakazato T, et al. Major ecological transitions in wild sunflowers facilitated by hybridization. Science. 2003;301:1211–6.PubMedGoogle Scholar
- Lexer C, Fay MF. Adaptation to environmental stress: a rare or frequent driver of speciation? J Evol Biol. 2005;18:893–900.PubMedGoogle Scholar
- Strasburg JL, Sherman NA, Wright KM, Moyle LC, Willis JH, Rieseberg LH. What can patterns of differentiation across plant genomes tell us about adaptation and speciation? Philos Trans R Soc Lond B Biol Sci. 2012;367:364–73.PubMed CentralPubMedGoogle Scholar
- Yang ZH, Bielawski JP. Statistical methods for detecting molecular adaptation. Trends Ecol Evol. 2000;15:496–503.PubMedGoogle Scholar
- Muller CH. Ecological control of hybridization in Quercus: a factor in the mechanism of evolution. Evolution. 1952;6:147–61.Google Scholar
- Van Valen L. Ecological species, multispecies, and oaks. Taxon. 1976;25:233–9.Google Scholar
- Cavender-Bares J, Pahlich A. Molecular, morphological and ecological niche differentiation of sympatric sister oak species, Quercus virginiana and Q. geminata (Fagaceae). Am J Bot. 2009;96:1690–702.PubMedGoogle Scholar
- Gailing O, Curtu AL. Interspecific gene flow and maintenance of species integrity in oaks. Ann For Res. 2014;57:5–18.Google Scholar
- Goicoechea PG, Petit RJ, Kremer A. Detecting the footprints of divergent selection in oaks with linked markers. Heredity. 2012;109:361–71.PubMed CentralPubMedGoogle Scholar
- Whittemore AT, Schaal BA. Interspecific gene flow in sympatric oaks. Proc Natl Acad Sci U S A. 1991;88:2540–4.PubMed CentralPubMedGoogle Scholar
- Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, et al. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014;30:1236–40.PubMed CentralPubMedGoogle Scholar
- Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: the protein families database. Nucleic Acids Res. 2014;42:D222–30.PubMed CentralPubMedGoogle Scholar
- Boisvert S, Laviolette F, Corbeil J. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J Comput Biol. 2010;17:1519–33.PubMed CentralPubMedGoogle Scholar
- Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27:573–80.PubMed CentralPubMedGoogle Scholar
- Kremer A, Casasoli M, Barreneche T, Bodenes C, Sisco P, Kubisiak T, et al. Fagaceae Trees. In: Kole C, editor. Genome Mapping and Molecular Breeding in Plants, Volume 7, Forest Trees. Volume 7. New York: Springer; 2007. p. 161.Google Scholar
- Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-Hernandez M, Foerster H, et al. The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res. 2008;36:D1009–14.PubMed CentralPubMedGoogle Scholar
- Initiative TAG. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815.Google Scholar
- Gugger PF, Cavender-Bares J. Molecular and morphological support for a Florida origin of the Cuban oak. J Biogeogr. 2013;40:632–45.Google Scholar
- Gugger PF, Ikegami M, Sork VL. Influence of late Quaternary climate change on present patterns of genetic variation in valley oak, Quercus lobata Née. Mol Ecol. 2013;22:3598–612.PubMedGoogle Scholar
- Petit RJ, Csaikl UM, Bordács S, Burg K, Coart E, Cottrell J, et al. Chloroplast DNA variation in European white oaks: phylogeography and patterns of diversity based on data from over 2600 populations. For Ecol Manag. 2002;156:5–26.Google Scholar
- Šmarda P, Bureš P, Šmerda J, Horová L. Measurements of genomic GC content in plant genomes with flow cytometry: a test for reliability. New Phytol. 2012;193:513–21.PubMedGoogle Scholar
- Parchman T, Geist K, Grahnen J, Benkman C, Buerkle CA. Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery. BMC Genomics. 2010;11:180.PubMed CentralPubMedGoogle Scholar
- Majoros WH, Pertea M, Salzberg SL. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics. 2004;20:2878–9.PubMedGoogle Scholar
- Stanke M, Steinkamp R, Waack S, Morgenstern B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 2004;32:W309–12.PubMed CentralPubMedGoogle Scholar
- Moissiard G, Cokus SJ, Cary J, Feng S, Billi AC, Stroud H, et al. MORC family ATPases required for heterochromatin condensation and gene silencing. Science. 2012;336:1448–51.PubMed CentralPubMedGoogle Scholar
- Parra G, Bradnam K, Korf I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics. 2007;23:1061–7.PubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9.PubMed CentralPubMedGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–303.PubMed CentralPubMedGoogle Scholar
- Pavy N, Deschênes A, Blais S, Lavigne P, Beaulieu J, Isabel N, et al. The landscape of nucleotide polymorphism among 13,500 genes of the conifer Picea glauca, relationships with functions, and comparison with Medicago truncatula. Genome Biol Evol. 2013;5:1910–25.PubMed CentralPubMedGoogle Scholar
- Muller T, Ensminger I, Schmid K. A catalogue of putative unique transcripts from Douglas-fir (Pseudotsuga menziesii) based on 454 transcriptome sequencing of genetically diverse, drought stressed seedlings. BMC Genomics. 2012;13:673.PubMed CentralPubMedGoogle Scholar
- Geraldes A, Pang J, Thiessen N, Cezard T, Moore R, Zhao Y, et al. SNP discovery in black cottonwood (Populus trichocarpa) by population transcriptome resequencing. Mol Ecol Resour. 2011;11:81–92.PubMedGoogle Scholar
- Subbaiyan GK, Waters DLE, Katiyar SK, Sadananda AR, Vaddadi S, Henry RJ. Genome-wide DNA polymorphisms in elite indica rice inbreds discovered by whole-genome sequencing. Plant Biotechnol J. 2012;10:623–34.PubMedGoogle Scholar
- Gaur R, Azam S, Jeena G, Khan AW, Choudhary S, Jain M, et al. High-throughput SNP discovery and genotyping for constructing a saturated linkage map of chickpea (Cicer arietinum L.). DNA Res. 2012;19:357–73.PubMed CentralPubMedGoogle Scholar
- Cao J, Schneeberger K, Ossowski S, Gunther T, Bender S, Fitz J, et al. Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nat Genet. 2011;43:956–63.PubMedGoogle Scholar
- Mosca E, Eckert AJ, Liechty JD, Wegrzyn JL, La Porta N, Vendramin GG, et al. Contrasting patterns of nucleotide diversity for four conifers of Alpine European forests. Evol Appl. 2012;5:762–75.PubMed CentralPubMedGoogle Scholar
- Branca A, Paape TD, Zhou P, Briskine R, Farmer AD, Mudge J, et al. Whole-genome nucleotide diversity, recombination, and linkage disequilibrium in the model legume Medicago truncatula. Proc Natl Acad Sci. 2011;108:E864–70.PubMed CentralPubMedGoogle Scholar
- Nei M, Li WH. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc Natl Acad Sci U S A. 1979;76:5269–73.PubMed CentralPubMedGoogle Scholar
- Begun DJ, Holloway AK, Stevens K, Hillier LW, Poh Y-P, Hahn MW, et al. Population genomics: whole-genome analysis of polymorphism and divergence in Drosophila simulans. PLoS Biol. 2007;5, e310.PubMed CentralPubMedGoogle Scholar
- Buschiazzo E, Ritland C, Bohlmann J, Ritland K. Slow but not low: genomic comparisons reveal slower evolutionary rate and higher dN/dS in conifers compared to angiosperms. BMC Evol Biol. 2012;12:8.PubMed CentralPubMedGoogle Scholar
- van der Biezen EA, Jones JDG. The NB-ARC domain: a novel signalling motif shared by plant resistance gene products and regulators of cell death in animals. Curr Biol. 1998;8:R226–8.PubMedGoogle Scholar
- Van Der Biezen EA, Jones JDG. Plant disease-resistance proteins and the gene-for-gene concept. Trends Biochem Sci. 1998;23:454–6.PubMedGoogle Scholar
- Jones DA, Jones JDG. The Role of Leucine-Rich Repeat Proteins in Plant Defences. In: J.H. Andrews ICT, Callow JA, editors. Advances in Botanical Research. Volume Volume 24. San Diego: Academic; 1997. p. 89–167.Google Scholar
- Koonin EV, Aravind L. The NACHT family – a new group of predicted NTPases implicated in apoptosis and MHC transcription activation. Trends Biochem Sci. 2000;25:223–4.PubMedGoogle Scholar
- Yang S, Li J, Zhang X, Zhang Q, Huang J, Chen J-Q, et al. Rapidly evolving R genes in diverse grass species confer resistance to rice blast disease. Proc Natl Acad Sci. 2013;110:18572–7.PubMed CentralPubMedGoogle Scholar
- Yang X, Kalluri UC, Jawdy S, Gunter LE, Yin T, Tschaplinski TJ, et al. The F-box gene family is expanded in herbaceous annual plants relative to woody perennial plants. Plant Physiol. 2008;148:1189–0.PubMed CentralPubMedGoogle Scholar
- Xiao S, Ellwood S, Calis O, Patrick E, Li T, Coleman M, et al. Broad-spectrum mildew resistance in Arabidopsis thaliana mediated by RPW8. Science. 2001;291:118–20.PubMedGoogle Scholar
- Bergelson J, Kreitman M, Stahl EA, Tian D. Evolutionary dynamics of plant R-genes. Science. 2001;292:2281–5.PubMedGoogle Scholar
- Wang G-L, Ruan D-L, Song W-Y, Sideris S, Chen L, Pi L-Y, et al. Xa21D encodes a receptor-like molecule with a leucine-rich repeat domain that determines race-specific recognition and is subject to adaptive evolution. Plant Cell Online. 1998;10:765–79.Google Scholar
- Meyers BC, Shen KA, Rohani P, Gaut BS, Michelmore RW. Receptor-like genes in the major resistance locus of lettuce are subject to divergent selection. Plant Cell Online. 1998;10:1833–46.Google Scholar
- Wan H, Yuan W, Bo K, Shen J, Pang X, Chen J. Genome-wide analysis of NBS-encoding disease resistance genes in Cucumis sativus and phylogenetic study of NBS-encoding genes in Cucurbitaceae crops. BMC Genomics. 2013;14:109.PubMed CentralPubMedGoogle Scholar
- Bakker EG, Toomajian C, Kreitman M, Bergelson J. A genome-wide survey of R gene polymorphisms in Arabidopsis. Plant Cell Online. 2006;18:1803–18.Google Scholar
- The MHCsc. Complete sequence and gene map of a human major histocompatibility complex. Nature. 1999;401:921–3.Google Scholar
- Tiffin P, Moeller DA. Molecular evolution of plant immune system genes. Trends Genet. 2006;22:662–70.PubMedGoogle Scholar
- Richard F, Millot S, Gardes M, Selosse MA. Diversity and specificity of ectomycorrhizal fungi retrieved from an old-growth Mediterranean forest dominated by Quercus ilex. New Phytol. 2005;166:1011–23.PubMedGoogle Scholar
- Roslin T, Laine A-L, Gripenberg S. Spatial population structure in an obligate plant pathogen colonizing oak Quercus robur. Funct Ecol. 2007;21:1168–77.Google Scholar
- Abrahamson WG, Hunter MD, Melika G, Price PW. Cynipid gall-wasp communities correlate with oak chemistry. J Chem Ecol. 2003;29:209–23.PubMedGoogle Scholar
- Gilbert G, Hubbell SP. Plant diseases and the conservation of tropical forests. Bioscience. 1996;46:98–106.Google Scholar
- Wills C, Condit R, Foster RB, Hubbell SP. Strong density- and diversity-related effects help to maintain tree species diversity in a neotropical forest. Proc Natl Acad Sci. 1997;94:1252–7.PubMed CentralPubMedGoogle Scholar
- Kotera E, Tasaka M, Shikanai T. A pentatricopeptide repeat protein is essential for RNA editing in chloroplasts. Nature. 2005;433:326–30.PubMedGoogle Scholar
- Lurin C, Andrés C, Aubourg S, Bellaoui M, Bitton F, Bruyère C, et al. Genome-wide analysis of Arabidopsis pentatricopeptide repeat proteins reveals their essential role in organelle biogenesis. Plant Cell Online. 2004;16:2089–103.Google Scholar
- Ioerger TR, Clark AG, Kao TH. Polymorphism at the self-incompatibility locus in Solanaceae predates speciation. Proc Natl Acad Sci. 1990;87:9732–5.PubMed CentralPubMedGoogle Scholar
- Dwyer K, Balent M, Nasrallah J, Nasrallah M. DNA sequences of self-incompatibility genes from Brassica campestris and B. oleracea: polymorphism predating speciation. Plant Mol Biol. 1991;16:481–6.PubMedGoogle Scholar
- Blanc G, Wolfe KH. Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell Online. 2004;16:1679–91.Google Scholar
- Fay JC, Wu CI. Hitchhiking under positive Darwinian selection. Genetics. 2000;155:1405–13.PubMed CentralPubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25.PubMed CentralPubMedGoogle Scholar
- Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 2013;41:D590–6.PubMed CentralPubMedGoogle Scholar
- Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–1.PubMedGoogle Scholar
- Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J. Repbase Update, a database of eukaryotic repetetive elements. Cytogenet Genome Res. 2005;110:462–7.PubMedGoogle Scholar
- St Laurent G, Shtokalo D, Tackett M, Yang Z, Eremina T, Wahlestedt C, et al. Intronic RNAs constitute the major fraction of the non-coding RNA in mammalian cells. BMC Genomics. 2012;13:504.PubMed CentralPubMedGoogle Scholar
- Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011;21:487–93.PubMed CentralPubMedGoogle Scholar
- Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011;39:W29–37.PubMed CentralPubMedGoogle Scholar
- Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, et al. From FastQ data to high-confidence variant calls: the Genome Analysis Toolkit best practices pipeline. In: Current Protocols in Bioinformatics. Hoboken: Wiley; 2002.Google Scholar
- DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–8.PubMed CentralPubMedGoogle Scholar
- Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative genomics viewer. Nat Biotech. 2011;29:24–6.Google Scholar
- Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly. 2012;6:80–92.PubMed CentralPubMedGoogle Scholar
- Pamilo P, Bianchi NO. Evolution of the Zfx and Zfy genes: rates and interdependence between the genes. Mol Biol Evol. 1993;10:271–81.PubMedGoogle Scholar
- Li W-H. Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J Mol Evol. 1993;36:96–9.PubMedGoogle Scholar
- Kimura M. A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980;16:111–20.PubMedGoogle Scholar
- Kosakovsky Pond SL, Frost SDW. Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol Biol Evol. 2005;22:1208–22.PubMedGoogle Scholar
- Nei M, Gojobori T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 1986;3:418–26.PubMedGoogle Scholar
- Li WH, Wu CI, Luo CC. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol Biol Evol. 1985;2:150–74.PubMedGoogle Scholar
- Kryazhimskiy S, Plotkin JB. The population genetics of d N/d S. PLoS Genet. 2008;4, e1000304.PubMed CentralPubMedGoogle Scholar
- Mugal CF, Wolf JBW, Kaj I. Why time matters: codon evolution and the temporal dynamics of d N/d S. Mol Biol Evol. 2014;31:212–31.PubMed CentralPubMedGoogle Scholar
- Liu J, Zhang Y, Lei X, Zhang Z. Natural selection of protein structural and functional properties: a single nucleotide polymorphism perspective. Genome Biol. 2008;9:R69.PubMed CentralPubMedGoogle Scholar
- Wilcoxon F. Individual comparisons by ranking methods. Biom Bull. 1945;1:80–3.Google Scholar
- Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18:50–60.Google Scholar
- Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003;100:9440–5.PubMed CentralPubMedGoogle Scholar
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B Meth. 1995;57:289–300.Google Scholar
- Cai JJ. PGEToolbox: a MATLAB toolbox for population genetics and evolution. J Hered. 2008;99:438–40.PubMedGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.