Comparative transcriptomics of a complex of four European pine species
© Wachowiak et al. 2015
Received: 21 October 2014
Accepted: 24 February 2015
Published: 25 March 2015
Pinus sylvestris, P. mugo, P. uliginosa and P. uncinata are closely related but phenotypically and ecologically very distinct European pine species providing an excellent study system for analysis of the genetic basis of adaptive variation and speciation. For comparative genomic analysis of the species, transcriptome sequence was generated for 17 samples collected across the European distribution range using Illumina paired-end sequencing technology.
De novo transcriptome assembly of a reference sample of P. sylvestris contained 40968 unigenes, of which fewer than 0.5% were identified as putative retrotransposon sequences. Based on gene annotation approaches, 19659 contigs were identified and assigned to unique genes covering a broad range of gene ontology categories. About 80% of the reads from each sample were successfully mapped to the reference transcriptome of P. sylvestris. Single nucleotide polymorphisms were identified in 22041-24096 of the unigenes providing a set of ~220-262 k SNPs identified for each species. Very similar levels of nucleotide polymorphism were observed across species (π=0.0044-0.0053) and highest pairwise nucleotide divergence (0.006) was found between P. mugo and P. sylvestris at a common set of unigenes.
The study provides whole transcriptome sequence and a large set of SNPs to advance population and association genetic studies in pines. Our study demonstrates that transcriptome sequencing can be a very useful approach for development of novel genomic resources in species with large and complex genomes.
Forest trees constitute over 80% of terrestrial biomass and harbour more than 50% of terrestrial biodiversity providing wood material and fundamental ecosystem services for humans including preservation of biodiversity, carbon cycling, climate regulation and preservation of water quality and soils [1,2]. Understanding the genomic basis of adaptation and architecture of complex phenotypic traits is needed for development of diagnostic tools for the conservation, restoration and management of natural populations and for genetic improvement programmes . Understanding plant adaptation is also one of the main interests of evolutionary biology. So far however, knowledge of the mutations, genes and biochemical pathways involved in species evolution and underlying phenotypic and adaptive variation remain scarce mostly due to a lack of efficient methods for accessing the polymorphisms at the whole genome scale. Recent advances in cost-effective, high-throughput sequencing technologies provide new tools for development of genomic resources with huge potential for downstream applications in virtually any species. In particular, these Next-Generation Sequencing (NGS) methods provide a unique opportunity to advance studies of non-model plants, including economically important trees with complex genomes such as conifers [3-5].
Here, we focus on a group of four closely related European pines: Scots pine (Pinus sylvestris L.) and the three taxa comprising the P. mugo complex including P. mugo Turra (dwarf mountain pine), P. uncinata Ramond (mountain pine) and P. uliginosa Neumann (peat-bog pine). These species differ from each other in phenotype, total population size, geographical distribution and ecology, in particular for traits related to dehydrative stress and temperature [6-8]. Pinus sylvestris is one of the most ecologically and economically important forest tree species in the world and has the largest distribution of all pines, being found from western Scotland to eastern Siberia and from Turkey and Spain north to the Arctic Circle. It is locally adapted to environmental conditions related to photoperiod and temperature and shows clinal latitudinal variation in timing of bud set and cold hardiness . Pinus mugo is a high-altitude polycormic European pine of up to a few meters in height, which forms shrub populations above the tree line in the mountainous regions of central and southeastern Europe. Pinus uncinata and P. uliginosa are trees of up to 20 m height; the former is a forest forming component in the high mountains of Western Europe, the latter is adapted to peatbogs in lowland areas of Central Europe.
Despite clear morphological and ecological differentiation, analysis of nuclear genes showed that the species share a similar genetic background, indicating recent divergence . However, despite significant inter- and intra-specific gene flow during historical range shifts, local adaptation to highly contrasting environments has occurred [10,11]. The species are not completely reproductively isolated, can occur and hybridize in sympatry and have the same number of chromosomes (2n = 24). Considering their genetic similarity, but distinctive phenotypes (tree/shrub), geographical ranges (widespread/restricted) and ecology (generalist/specialist) the species comprise a promising system for study of the genomic basis of adaptation and the genetic architecture of phenotypic traits. Taking advantage of the system for comparative studies requires development of a comprehensive array of genomic resources and methods addressing variation at the whole genome scale.
For large and complex genomes, transcriptome sequencing is an attractive alternative to whole genome sequencing, and yields a comparatively high content of functional information from coding regions. By constructing a comparative analysis within a phylogenetic framework we aimed to develop genomic resources relevant to molecular evolution in the genes and gene complexes underlying inter- and intra-specific variation in this important group of tree species.
Results and discussion
Characteristics of the transcript sequence
Plant material used for transcriptome sequencing
Scotland, Glen Tanar
Romania, Southern Carpathians, Busteni
Bosnia and Herzegovina, Bjelasnica Mts
Italy, Abruzzi, La Maiella
Austria, Karwendel Mts., Scharnitz
Poland, Sudety Mts, Śląskie Kamienie
France, Col de la Croix de Morand
Spain, Pyrenees, La Trapa
Andorra, Eastern Pyrenees, Vall de Ransol
Spain, Sierra de Gudar
Poland, Low Silesian Pinewood, Węgliniec
Poland, Wielkie Torfowisko Batorowskie reserve
Statistics for de novo transcriptome assembly of the reference sample (2_GT_31)
Raw assembly generated from Trinity
Max contig length
Num contigs >100
Total bases in contigs >100
N50 for contigs >100
Contigs >100 in N50
GC contigs >100
nonATGC in contigs >100
Mean length for contigs >100
Mapping statistics of the samples to the reference transcriptome sequence (2_GT_31)
% Mapped reads
% Duplicate reads
% Mapped reads as proper pairs
Number of SNPs
Nucleotide polymorphism and genetic relationships between species
Common and unique SNPs in pair-wise comparisons between species
UNIQUE SNPs (in reference to the species in each column)
Nucleotide variation at 676 merged nuclear ( n DNA) contigs in the pine species
Pairwise Fst between species at 27929 SNPs identified at 676 merged nuclear ( n DNA) contigs
We provide a reference transcriptome sequence for Scots pine, a conifer tree species of great ecological and economic importance in the world. We annotated the transcriptome in reference to many genes and metabolic pathways described in open access databases.
Putting our study in a phylogenetic framework we provide novel genomic resources comprising a publicly-available database of SNP markers for a set of four closely related pine species. Information about nucleotide polymorphism in coding regions will facilitate genotyping, population genetic and association studies to better understand the genetic basis of plant adaptation and speciation.
Our study shows the largest genetic divergence between P. mugo and P. sylvestris. Despite large differences in distribution range and total population size, all species showed very similar patterns of nucleotide polymorphism.
Our results demonstrate the high relevance of Illumina technology for de novo assembly, transcriptome characterization and marker discovery in a species with large and complex genomes, which lack draft genome sequence information.
Plant material and RNA extraction
Needles of the four pine species were collected from two year old seedlings grown in a glasshouse at the Centre for Ecology and Hydrology, Edinburgh, UK. The seedlings were obtained from seeds collected in seventeen populations of the species (five for each of P. sylvestris and P. mugo, four for P. uncinata and three for P. uliginosa) from across the species distribution range and environmental gradients in Europe (Table 1, Figure 1). After sampling, the needles were immediately frozen in liquid nitrogen and homogenized with a pestle and mortar. Total RNA for generation of transcript sequence was extracted from 100 mg of the needle powder using Spectrum™ Plant Total RNA Kit (Sigma) following the manufacturer’s protocol. RNA concentration and quality was assessed with the use of a Qubit® Fluorometer (Life Technologies). A total of 10 μg of input RNA for each sample was used for normalized cDNA library preparation.
cDNA library construction and sequencing
Template cDNA libraries for each sample were prepared using TruSeq™ RNA Sample Preparation Kits (Illumina). The poly‐A containing mRNA molecules were purified in two steps from 10 μg of total RNA using poly‐T oligo‐attached magnetic beads. During the second elution of the poly‐A RNA, the RNA was fragmented to 120-210 bp inserts (by incubation of the samples at 94°C for 8 minutes) and primed for cDNA synthesis. The cleaved RNA fragments primed with random hexamers were reverse transcribed into first strand cDNA followed by DNA Polymerase I second strand cDNA synthesis and RNase H treatment. Ampure XP beads were used to separate the double strand cDNA from the 2nd strand reaction mix. The synthesized cDNA was subjected to end-repair to convert the overhangs resulting from fragmentation into blunt ends. These repaired cDNA fragments were adenylated at 3′ ends to prevent them from ligating to one another during the adapter ligation reaction. Paired-end adapters were ligated to the ends of these double strand cDNA preparing them for hybridization onto a flow cell. DNA fragments that had adapter molecules on both ends were enriched by PCR to amplify the amount of DNA in the final cDNA library. Normalization of cDNA was conducted to increase the chance of discovering genes of low expression level. Quality control of the sample libraries and quantification of the DNA templates was conducted using Agilent Technologies 2100 Bioanalyzer using Agilent DNA1000 chip. The cDNA libraries were sequenced using Illumina HiSeq 2000 platform at Edinburgh Genomics, the University of Edinburgh, Scotland according to the manufacturer’s instructions (Illumina, San Diego, CA). Sequencing was conducted to generate 100 base paired-end reads for all samples including the Scots pine sample (2_GT_31, Scotland, Glen Tanar) used as a reference. Raw data for all samples were deposited in European Nucleotide Archive [ENA accession number: PRJEB6877].
Reference transcriptome assembly and gene annotation
Prior to assembly, filtering of the raw reads for the reference sample 2_GT_31 was carried out to increase the quality of data and eliminate any sequencing errors. Reads with adapter contamination, potential contaminant, and poor-quality reads with ambiguous sequences “N” were discarded. Reads were de novo assembled into contigs using Trinity (version r2012-06-08) . We got 151932 potential transcripts as an output. In order to reduce the redundancy in this dataset, only transcripts with ORFs were retained, and highly similar sequences were clustered (similarity level of >95%) using CD-HIT . A final set of 40968 clustered transcripts was BLASTx scanned for the presence of known retrotransposons and repetitive elements known to be present in conifer genome. Several search approaches were used including queries of known retrotransposon sequences in plants (IFG7, GYMNY, PtIFG7, Ta1-3, PpRT1) and searches for terms associated with retroelements such as copia, gypsy, gag, retrotransposon, integrase, retroelement, reverse transcriptase [23-25]. Using the above approaches 170 contigs were identified that may represent transcriptionally active retroelements. They were excluded from final reference transcriptome sequences of 40798 contigs. Annotation of the clustered transcripts based on the functional category was conducted using Annot8r based on BLAST similarity searches against annotated subsets of EMBL UniProt protein sequence and functional information database using an E-value threshold of 10−5 . BLASTx search was conducted against the Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway with an E-value cutoff of <10−5 to annotate the genes to known proteins and to look at the networks of molecular functions and interactions of the unigenes. Gene Ontology (GO) classification of the unigenes based on BLAST matches to known proteins was conducted based on biological processes, molecular function and cellular component.
Alignment, SNP calling and filtering
The set of 40798 transcripts (2_GT_31) was used as reference for mapping reads for the 16 other samples. Alignment was performed using BWA (version 0.6.1) . Duplicates were marked using Picard . To eliminate errors due to indel misalignment, local realignment was conducted using GATK . SNPs were called for each sample and species as compared to the reference using Samtools (version samtools-0.1.18) . A set of SNPs identified across all samples was filtered to look for those suitable for genotyping platforms such as Illumina with a minimum spacing between SNPs of 50 base-pairs (bp) flanking nucleotides on either side of a SNP.
Nucleotide polymorphisms and divergence
Polymorphism and divergence were quantified within and among species to provide information about the overall pattern of nucleotide variation in the samples. The number of shared and unique SNPs was calculated based on calls from pairwise comparisons between each species. A subset of contigs were selected that were common to, and polymorphic in, all samples relative to the reference. Fasta files for each contig were produced using vcf-tools  and concatenated into a single sequence for each sample. In total 1,364,676 bp of DNA was aligned across 676 contigs. Basic statistics including number of polymorphic sites, nucleotide diversity (measured as the average number of nucleotide differences per site (π) between two sequences ) and divergence between species were estimated using DnaSP v.5 . Relationships between samples were assessed using Principal Coordinate Analysis (PCoA) based on a pairwise genetic distance matrix (number of base differences per sequence) between samples, and using the UPGMA method based on the number of substitutions per site from averaging over all sequence pairs between groups using the Tamura-Nei model . Polymorphism at the common set of 676 merged nuclear contigs was used to evaluate the genetic differentiation in pairwise comparisons between species. Significance was estimated by 1000 permutations of the samples between species using Arlequin v.3.5 . The outlier Pinus uncinata sample (UN2), defined based on PCoA analysis, was excluded from divergence estimates in UPGMA and the species genetic differentiation analysis.
The sequence of 40798 transcripts of the reference Scots pine sample (2_GT_31):
Filename: Reference_PS2_trinity.fasta; URL: http://doi.org/10.5285/b6900166-ded6-4f7a-8734-484b6f77b2f1
SNP files for each sample with reference to Scots pine transcriptome sequence (2_GT_31):
Filenames: PS1_SNPs.vcf; PS2_SNPs.vcf; PS3_SNPs.vcf; PS4_SNPs.vcf; PS5_SNPs.vcf; M1_SNPs.vcf;
M2_SNPs.vcf; M3_SNPs.vcf; M4_SNPs.vcf; M5_SNPs.vcf; UN1_SNPs.vcf; UN2_SNPs.vcf;
UN3_SNPs.vcf; UN4_SNPs.vcf; UG1_SNPs.vcf; UG2_SNPs.vcf; UG3_SNPs.vcf;
The cDNA library preparation and Illumina sequencing was carried out at Edinburgh Genomics, the University of Edinburgh, Scotland. The research was financially supported by NERC (grant nr. NE/H003959/1). WW acknowledges financial support from Polish National Science Centre (DEC-2012/05/E/NZ9/03476).
- Bonan GB. Forests and climate change: forcings, feedbacks, and the climate benefits of forests. Science. 2008;320(5882):1444–9.PubMedView ArticleGoogle Scholar
- Neale DB, Kremer A. Forest tree genomics: growing resources and applications. Nat Rev Genet. 2011;12(2):111–22.PubMedView ArticleGoogle Scholar
- Neale D, Wegrzyn J, Stevens K, Zimin A, Puiu D, Crepeau M, et al. Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies. Genome Biol. 2014;15(3):R59.PubMed CentralPubMedView ArticleGoogle Scholar
- Birol I, Raymond A, Jackman S, Pleasance S, Coope R, Taylor G, et al. Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data. Bioinformatics. 2013;29:1492–7.PubMed CentralPubMedView ArticleGoogle Scholar
- Nystedt B, Street NR, Wetterbom A, Zuccolo A, Lin Y-C, Scofield DG, et al. The Norway spruce genome sequence and conifer genome evolution. Nature. 2013;497(7451):579–84.PubMedView ArticleGoogle Scholar
- Wachowiak W, Boratyńska K, Cavers S. Geographical patterns of nucleotide diversity and population differentiation in three closely related European pine species in the Pinus mugo complex. Bot J Linn Soc. 2013;172(2):225–38.View ArticleGoogle Scholar
- Joosen R, Lammers M, Balk P, Bronnum P, Konings M, Perks M, et al. Correlating gene expression to physiological parameters and environmental conditions during cold acclimation of Pinus sylvestris, identification of molecular markers using cDNA microarrays. Tree Physiol. 2006;26:1297–313.PubMedView ArticleGoogle Scholar
- Boratyńska K, Boratyński A. Taxonomic differences among closely related pines Pinus sylvestris, P. mugo, P. uncinata, P. rotundata and P. uliginosa as revealed in needle sclerenchyma cells. Flora. 2007;202(7):555–69.View ArticleGoogle Scholar
- Karhu A, Hurme P, Karjalainen M, Karvonen P, Kärkkäinen K, Neale D, et al. Do molecular markers reflect patterns of differentiation in adaptive traits of conifers? Theor Appl Genet. 1996;93(1–2):215–21.PubMedView ArticleGoogle Scholar
- Wachowiak W, Palme AE, Savolainen O. Speciation history of three closely related pines Pinus mugo (T.), P. uliginosa (N.) and P. sylvestris (L.). Mol Ecol. 2011;20(8):1729–43.PubMedView ArticleGoogle Scholar
- Jasińska AK, Wachowiak W, Muchewicz E, Boratyńska K, Montserrat JM, Boratyński A. Cryptic hybrids between Pinus uncinata and P. sylvestris. Bot J Linn Soc. 2010;163(4):473–85.View ArticleGoogle Scholar
- Parchman T, Geist K, Grahnen J, Benkman C, Buerkle CA. Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery. BMC Genomics. 2010;11(1):180.PubMed CentralPubMedView ArticleGoogle Scholar
- Grotkopp E, Rejmanek M, Sanderson MJ, Rost TL. Evolution of genome size in pines (Pinus) and its life-history correlates: Supertree analyses. Evolution. 2004;58(8):1705–29.PubMedView ArticleGoogle Scholar
- Hurme P, Sillanpaa MJ, Arjas E, Repo T, Savolainen O. Genetic basis of climatic adaptation in Scots pine by Bayesian quantitative trait locus analysis. Genetics. 2000;156(3):1309–22.PubMed CentralPubMedGoogle Scholar
- Pyhäjärvi T, Garcia-Gil MR, Knürr T, Mikkonen M, Wachowiak W, Savolainen O. Demographic history has influenced nucleotide diversity in European Pinus sylvestris populations. Genetics. 2007;177(3):1713–24.PubMed CentralPubMedView ArticleGoogle Scholar
- Wachowiak W, Balk P, Savolainen O. Search for nucleotide diversity patterns of local adaptation in dehydrins and other cold-related candidate genes in Scots pine (Pinus sylvestris L.). Tree Genetics & Genomes. 2009;5(1):117–32.View ArticleGoogle Scholar
- Kujala S, Savolainen O. Sequence variation patterns along a latitudinal cline in Scots pine (Pinus sylvestris): signs of clinal adaptation? Tree Genetics & Genomes. 2012;8(6):1451–67.View ArticleGoogle Scholar
- Wachowiak W, Salmela MJ, Ennos RA, Iason G, Cavers S. High genetic diversity at the extreme range edge: nucleotide variation at nuclear loci in Scots pine (Pinus sylvestris L.) in Scotland. Heredity. 2010;106:775–87.PubMed CentralPubMedView ArticleGoogle Scholar
- Mosca E, Eckert AJ, Di Pierro EA, Rocchini D, La Porta N, Belletti P, et al. The geographical and environmental determinants of genetic diversity for four alpine conifers of the European Alps. Mol Ecol. 2012;21(22):5530–45.PubMedView ArticleGoogle Scholar
- Zapata-Valenzuela J, Isik F, Maltecca C, Wegrzyn J, Neale D, McKeand S, et al. SNP markers trace familial linkages in a cloned population of Pinus taeda-prospects for genomic selection. Tree Genetics & Genomes. 2012;8:1307–18.View ArticleGoogle Scholar
- Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protocols. 2013;8(8):1494–512.View ArticleGoogle Scholar
- Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2.PubMed CentralPubMedView ArticleGoogle Scholar
- Morse AM, Peterson DG, Islam-Faridi MN, Smith KE, Magbanua Z, Garcia SA, Kubisiak TL, Amerson HV, Carlson JE, Nelson CD et al.: Evolution of Genome Size and Complexity in Pinus. PLoS ONE 2009, 4(2). doi:10.1371/journal.pone.0004332.Google Scholar
- Rocheta M, Cordeiro J, Oliveira M, Miguel C. PpRT1: the first complete gypsy-like retrotransposon isolated in Pinus pinaster. Planta. 2007;225(3):551–62.PubMedView ArticleGoogle Scholar
- Kovach A, Wegrzyn JL, Parra G, Holt C, Bruening GE, Loopstra CA, et al. The Pinus taeda genome is characterized by diverse and highly diverged repetitive sequences. BMC Genomics. 2010;11(1):420.PubMed CentralPubMedView ArticleGoogle Scholar
- Schmid R, Blaxter M. annot8r: GO, EC and KEGG annotation of EST datasets. Bmc Bioinformatics. 2008;9(1):180.PubMed CentralPubMedView ArticleGoogle Scholar
- Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–60.PubMed CentralPubMedView ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.PubMed CentralPubMedView ArticleGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–303.PubMed CentralPubMedView ArticleGoogle Scholar
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–8.PubMed CentralPubMedView ArticleGoogle Scholar
- Nei M. Molecular evolutionary genetics. New York: Columbia University Press; 1987.Google Scholar
- Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123:585–95.PubMed CentralPubMedGoogle Scholar
- Librado P, Rozas J. DnaSP v5: a software for comprehensive analysis of DNA polymorphism data. Bioinformatics. 2009;25(11):1451–2.PubMedView ArticleGoogle Scholar
- Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S. MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol. 2011;28:2731–9.PubMed CentralPubMedView ArticleGoogle Scholar
- Excoffier L, Lischer HEL. Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol Ecol Resour. 2010;10(3):564–7.PubMedView ArticleGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.