Combined de novo and genome guided assembly and annotation of the Pinus patula juvenile shoot transcriptome

Visser, Erik A.; Wegrzyn, Jill L.; Steenkmap, Emma T.; Myburg, Alexander A.; Naidoo, Sanushka

doi:10.1186/s12864-015-2277-7

Research article
Open access
Published: 12 December 2015

Combined de novo and genome guided assembly and annotation of the Pinus patula juvenile shoot transcriptome

Erik A. Visser¹,
Jill L. Wegrzyn²,
Emma T. Steenkmap³,
Alexander A. Myburg¹ &
…
Sanushka Naidoo¹

BMC Genomics volume 16, Article number: 1057 (2015) Cite this article

8554 Accesses
37 Citations
1 Altmetric
Metrics details

Abstract

Background

Pines are the most important tree species to the international forestry industry, covering 42 % of the global industrial forest plantation area. One of the most pressing threats to cultivation of some pine species is the pitch canker fungus, Fusarium circinatum, which can have devastating effects in both the field and nursery. Investigation of the Pinus-F. circinatum host-pathogen interaction is crucial for development of effective disease management strategies. As with many non-model organisms, investigation of host-pathogen interactions in pine species is hampered by limited genomic resources. This was partially alleviated through release of the 22 Gbp Pinus taeda v1.01 genome sequence (http://pinegenome.org/pinerefseq/) in 2014. Despite the fact that the fragmented state of the genome may hamper comprehensive transcriptome analysis, it is possible to leverage the inherent redundancy resulting from deep RNA sequencing with Illumina short reads to assemble transcripts in the absence of a completed reference sequence. These data can then be integrated with available genomic data to produce a comprehensive transcriptome resource. The aim of this study was to provide a foundation for gene expression analysis of disease response mechanisms in Pinus patula through transcriptome assembly.

Results

Eighteen de novo and two reference based assemblies were produced for P. patula shoot tissue. For this purpose three transcriptome assemblers, Trinity, Velvet/OASES and SOAPdenovo-Trans, were used to maximise diversity and completeness of assembled transcripts. Redundancy in the assembly was reduced using the EvidentialGene pipeline. The resulting 52 Mb P. patula v1.0 shoot transcriptome consists of 52 112 unigenes, 60 % of which could be functionally annotated.

Conclusions

The assembled transcriptome will serve as a major genomic resource for future investigation of P. patula and represents the largest gene catalogue produced to date for this species. Furthermore, this assembly can help detect gene-based genetic markers for P. patula and the comparative assembly workflow could be applied to generate similar resources for other non-model species.

Background

Pinus species play keystone ecological roles, representing the major component of many forests across all continents [1]. These species are also the predominantly planted trees in the global commercial forestry sector [2]. One of the largest threats to global pine forestry is the pitch canker fungus Fusarium circinatum, especially where susceptible Pinus species are cultivated [3]. Consequent losses caused by this fungus have large economic impacts on commercial forestry [3, 4]. Resistance to F. circinatum varies among Pinus species [5]. Species such as Pinus patula and P. radiata, both of which are important plantation species in southern Africa, are highly susceptible, while species such as P. tecunumanii are more resistant [5]. Very little is known regarding the interaction between F. circinatum and its Pinus hosts at the molecular level. Investigation of the different responses employed by susceptible and resistant hosts, such as P. patula and P. tecunumanii [5], will improve our knowledge of responses necessary for effective defence against F. circinatum.

RNA sequencing (RNA-seq) approaches have opened the way for transcriptome-wide analysis of gene expression [6]. Accurate quantification of gene expression using RNA-seq, however, requires a high quality reference sequence for read mapping. For organisms with well characterised reference genomes, such as Arabidopsis, this requirement is easily met, while organisms lacking a well characterized reference sequence present numerous challenges. Although the P. taeda v1.01 draft genome assembly is available [7], the size and fragmented state of the assembly can limit comprehensive transcriptome analysis [8, 9]. De novo transcriptome assembly can be used to provide a reference sequence for RNA-seq analysis while circumventing potential issues arising from problems in a genome assembly [10]. De novo transcriptome assemblies are available through GenBank and the TreeGenes database [11, 12] for at least ten Pinus species (P. banksiana [13], P. contorta [13], P. lambertiana, P. massoniana, P. monticola [14], P. palustris, P. pinaster, P. radiata, P. sylvestris [15], and P. taeda), at various levels of completion. Of these, the P. taeda transcriptome is the most comprehensive, consisting of data obtained from many different tissues and developmental stages (Mockaitis et al. unpublished).

The aim of this study was to generate a resource for transcriptome profiling in P. patula by assembling the shoot transcriptome of this economically important species. We report a P. patula shoot transcriptome containing 52 112 transcripts, of which 30 844 (60 %) are annotated. This is the largest gene catalogue for P. patula to date and a major genomic resource, which will facilitate functional genomics research in this tropical pine species.

Methods

Plant material

Six month old P. patula seedlings, from a single open pollinated family, were sourced from Top Crop Nursery, South Africa. Seedlings were transferred to, and maintained for the duration of the trial in an environmentally controlled glasshouse at 25–28 °C without supplemental lighting and allowed to acclimatize for two weeks. F. circinatum isolate FCC3579 was cultured on ½ strength potato dextrose agar (½ PDA; Merck) after which spores were harvested by washing with 15 % (v/v) sterile glycerol. Spore concentration was quantified using a haemocytometer and diluted to 5×10⁴ spores/mL by addition of 15 % (v/v) sterile glycerol. Seedlings were inoculated by clipping the apical bud and pipetting 10 μL of diluted spore solution onto the wound [16]. Seedlings inoculated with 10 μL sterile 15 % glycerol served as mock-inoculated control. Shoot tissue was harvested one day post inoculation (dpi) for three biological replicates per group (inoculated and mock-inoculated). Each biological replicate consisted of the top 4 cm of shoot tissue, measured from the wounded apical bud, harvested from 16 seedlings and pooled prior to being frozen using liquid nitrogen. Frozen tissue was stored at −80 °C until use. Disease development was monitored for six weeks post inoculation by measuring lesion and stem length from the wounded apical bud on 52 plants per group. F. circinatum infection was confirmed based on culture morphology on ½ PDA by re-isolation using tissue harvested from inoculated plants 14 dpi.

RNA isolation and sequencing

Frozen samples were homogenised using a high speed grinder (IKA-Werke, Staufen, Germany) and total RNA extracted using a modified version of Lewinsohn’s protocol [17]. Modifications were as follows: All solutions were prepared using diethylpyrolecarbonate (DEPC) treated water. Approximately 5 g homogenised shoot tissue was placed in a 50 mL conical tube containing 150 mg PVP-360 and 300 mg PVPP before adding 15 mL chilled extraction buffer. Tubes were mixed by vortexing, snap frozen in liquid nitrogen and allowed to thaw on ice. Polysaccharides were precipitated by addition of 1/9^th volume 3.3 M sodium acetate and 10 % (v/v) absolute ethanol. Nucleic acids were precipitated at −20 °C for 4 h. The pellet produced from ultracentrifugation was re-suspended in 100 μL DEPC treated water and stored at −80 °C until use. Total RNA samples were treated with RNase-free DNaseI enzyme (Qiagen Inc, Valencia, CA) to digest genomic DNA and purified using the RNeasy® MinElute kit (Qiagen) according to the manufacturer’s instructions. Concentration and integrity of purified RNA samples were evaluated using a Bio-Rad Experion™ automated electrophoresis system (Bio-Rad Laboratories, Hercules, CA, USA).

High quality RNA samples (RNA Integrity Number > 7.5) for both groups were sequenced using Illumina HighSeq 2000 instruments (200 bp insert size, 101PE sequencing, 40 million reads per sample; BGI, Hong Kong). Sequence quality of raw RNA-seq data was assessed using FastQC v0.10.1 [18]. Quality trimming and filtering of data was performed using Sickle v1.210 [19] and all unpaired reads were discarded. Short reads (<40 bp) were removed from the filtered RNA-seq reads using SolexaQA LengthSort [20]. The trimmed and filtered read data for all six samples were combined, resulting in Dataset 1. FastUniq v1.1 [21] was used to reduce PCR artefacts from Dataset 1 by removing duplicate reads, resulting in Dataset 2.

Transcriptome assembly

Multiple k-mer de novo transcriptome assembly

De novo transcriptome assembly was performed using three assemblers; Trinity r2013-11-10 [22], SOAPdenovo-Trans v1.04 [23] and Velvet v1.2.10/ Oases v0.2.08 [24, 25]. Assembly with Trinity was performed on both datasets using default parameters [26], except min_contig_length = 350, and repeated on Dataset 1 with the CuffFly parameter included. Trinity was applied to both Dataset 1 and 2 as Trinity allows for duplicate reads, however, SOAPdenovo-Trans and Velvet/Oases assemblers were used on Dataset 2 only, where duplicates were removed. Assembly with SOAPdenovo-Trans was performed on Dataset 2, for eight different k-mer lengths (23, 25, 31, 39, 47, 55, 63 and 71), with the parameters as follows: max_rd_len = 100, rd_len_cutoff = 100, avg_ins = 200, reverse_seq = 0, asm_flags = 3, pair_num_cutoff = 3, map_len = 35, −f and -F. Assembly with Velvet/Oases was performed on Dataset 2, for seven different k-mer lengths (23, 25, 31, 39, 47, 55, and 63), with the parameters as follows: default parameters for velveth; cov_cutoff = 10, ins_length = 200 and read_trkg = yes for velvetg; cov_cutoff = 10, min_pair_count = 5 and min_trans_lgth = 350 for Oases.

P. taeda v1.01 genome guided transcriptome assembly

Trinity genome guided transcriptome assembly was performed on Datasets 1 and 2 using the P. taeda v1.01 draft genome assembly (ca. 14.4 million scaffolds) with a minimum contig length of 350 bp. GSNAP 2014-02-28 (Genomic Short-read Nucleotide Alignment Program) [27] was used to align reads to the reference genome for transcriptome assembly using the following paramters: −-nofails, −-novelsplicing = 1, −-localsplicedist = 250000, −-npaths = 20. Transrate v0.3.1 [27] was used to calculate assembly quality metrics.

Decreasing redundancy across assemblies

The de novo and genome guided transcriptome assemblies were combined to form a redundant over-assembly. The tr2aacds pipeline, from the EvidentialGene package [28], was used to reduce redundancy in the over-assembly by selecting for the ‘optimal’ set of assembled transcripts based on coding potential. The pipeline consists of five steps: (1) prediction of coding DNA sequence (CDS) and amino acid sequences for each transcript, (2) removal of redundant sequences based on coding potential among identical sequences, (3) substring de-replication to remove sequence fragments, (4) clustering of highly similar sequences into loci and (5) classification of transcripts as ‘primary’ or ‘alternate’ and discarding of low scoring ‘drop’ transcripts. The primary assembled transcripts were used for further assessments.

Annotation

Local alignments to the National Centre for Biotechnology Information (NCBI) non-redundant (nr) and plant protein databases were generated for the primary assembled transcripts from the tr2aacds pipeline using uBLAST (Edgar RC, unpublished) [29]. Parameters used for local alignments were: −evalue 1e-10, −weak_evalue 1e-4, −id 0.9, −weak_id 0.8. Local alignment sequence descriptions were used to remove non-pine origin sequences, sequences with significant alignments to prevalent fungal, bacterial, viral and insect sequences, from the assembly to produce the P. patula v1.0 draft transcriptome assembly. Blast2GO® v2.7.2 [30] was used to predict protein domains through InterProScan 5 [31] as well as to perform Gene Ontology (GO) and Enzyme Code (EC) mapping. The P. patula transcriptome GO distribution was compared to the P. taeda v1.01 draft genome annotation using CateGOrizer [32]. Gene family memberships among species were visualized using custom scripts and Venn diagrams (http://bioinformatics.psb.ugent.be/webtools/Venn/).

Identification of orthologous protein groups

Annotated protein sequences for ten different species were retrieved from version 2.5 of the PLAZA protein database [33] and four external proteins sets, from conifer species, were also included (Table 1). The complete set of predicted P. patula v1.0 proteins from the assembled transcriptome were included. Each of the 15 protein sets were clustered to 90 % identity within species and combined. Gene families were identified and annotated for the 442 372 sequences using the approach described in [8]. Pfam domains [34] were assigned to the P. patula sequences using InterProScan 5.7 [31]. Identified gene families unique to P. patula with fewer than 5 members were discarded as these could result from assembly artefacts.

Table 1 Protein sets used for analysis of orthologous genes

Full size table

Assembly validation

The Core Eukaryotic Genes Mapping Approach (CEGMA) pipeline [35] as well as the Benchmarking Universal Single-Copy Orthologs (BUSCO) v1.1b1 tool [36] were used to identify putative core eukaryotic genes (CEGs) and universal single copy orthologs (USCOs) in the assembly as a measure of the completeness and contiguity. BUSCO analysis was performed using the early access plant dataset. In addition, conditional reciprocal best BLAST (CRBB) analysis of the P. patula draft transcriptome assembly, the P. taeda v1.01 gene models and the P. taeda draft transcriptome assembly was implemented with two different sets of reference sequences using Transrate [27]. Reference sets used were as follows: the P. taeda v1.01 predicted gene models available through the TreeGenes Database [11, 12] and the 87 P. patula protein sequences available through the NCBI and TrEMBL databases.

Sequence alignments against the P. taeda v1.01 draft genome assembly were generated to compare transcript to genome mapping of the P. patula v1.0 transcriptome assembly to that of other Pinus transcriptomes. Comparative alignments were produced using transcriptome data for seven other Pinus spp. available from the TreeGenes database [11, 12]: P. taeda (83 285 sequences), P. banksiana (21 675), P. contorta (14 375), P. pinaster (14 130), P. palustris (14 228), P. lambertiana (48 891), and P. radiata (4 742). Transcript sequences were aligned to the reference genome using GMAP 2014–02-28 (Genomic Mapping and Alignment Program) [37] with the following parameters: −-intronlength = 350000, −-no-chimeras, −-canonical-mode = 1, −-cross-species. The ‘--cross-species’ parameter was excluded for alignment of the P. taeda transcriptome. Sequence alignments were examined at two different cut-offs, the first (95 % identity, 95 % coverage) to compare mapping between species and the second (95 % identity, 50 % coverage) to account for possible effects due to genome fragmentation. The P. patula v1.0 transcriptome assembly was further validated by alignment to full-length Sanger sequenced P. taeda cDNA reference sequences present in NCBI and obtained through the TreeGenes database [11]. The 188 cDNA sequences were clustered to 90 % identity. CRBB analysis to the P. patula v1.0 transcriptome was performed using Transrate [27].

Differential expression analysis

RNA-seq read mapping to the P. patula v1.0 transcriptome and expression quantification was performed through RSEM v1.2.23 (RNA-Seq by Expectation-Maximum) [38] using Bowtie2 v2.2.5 [39]. Differential expression testing was performed with EBSeq v1.10.0 [40] using median normalization (FDR < 0.05).

Results and discussion

Data generation and pre-processing

Due to the expected size of the P. patula genome (ca. 22 Gb) [41], sequencing and assembly of the genome would be a costly and challenging endeavour. Therefore, transcriptome assembly was employed to generate a P. patula reference sequence. RNA-seq of shoot tissue harvested 1 dpi for inoculated and mock-inoculated samples yielded between 21 and 43 million read pairs per sample and a total of ca. 440 million reads (Table 2). Quality filtering removed ca. 13 % of reads and duplicate filtering removed a further 35 % of reads. Thus Dataset 1 consisted of ca. 36 Gb of sequence data and Dataset 2 consisted of ca. 23 Gb that passed through quality filtering and were subsequently used for transcriptome assembly.

Table 2 Quality statistics for RNA sequencing data

Full size table

Comparison of assembler output

The completeness and quality of an assembled transcriptome is affected by the assembly program used as well as the assembly parameters used [42–48]. Comparative studies have also shown that the effectiveness of assembly programs can vary by input data set, with no assembler consistently outperforming any other [42, 43]. Due to this variability among assembler outputs, each variant assembly is likely to contain more accurate and complete assemblies at different loci. Therefore, in an effort to maximise diversity of assembled transcripts, we produced 18 de novo and two genome guided transcriptome assemblies using; Trinity, SOAPdenovo-Trans and Velvet/Oases. As expected from previous studies, large variation in the number, length and redundancy of contigs assembled was observed within and between assemblers (Fig. 1).

Trinity exhibited the most uniformity among assemblies compared to the variation among assemblies from Velvet/Oases and SOAPdenovo-Trans. An inverse relationship has been shown to exist between the number of contigs assembled and k-mer length used for assembly [43]. Therefore, the greater uniformity in assembled contig number between Trinity assemblies can likely be attributed to the program’s implementation of a fixed k-mer length for all assemblies. For each value of k used in assembly, SOAPdenovo-Trans resulted in the highest number of assembled contigs followed by Velvet/Oases and lastly, Trinity. In a comparative study, Trinity consistently assembled more contigs than Velvet/Oases and Trinity assemblies consistently had a higher N50 statistic [45]. Although the present study used newer versions of Velvet/Oases and Trinity, the difference in trends obtained illustrates the difference in performance of assemblers under different conditions, supporting the need for use of multiple assemblers during transcriptome reconstruction.

Trinity genome guided assemblies displayed lower GC ratios, as well as fewer predicted open reading frames (ORFs) compared to other assemblies (Fig. 1). This was ascribed to fragmentation of the P. taeda v1.01 genome. Nevertheless, the two genome guided Trinity assemblies were included in downstream analysis. In total, 3 447 807 assembled transcripts were used as input for the EvidentialGene tr2aacds pipeline.

The EvidentialGene pipeline selects the ‘best’ transcripts based on coding potential, thus selecting for the best ORFs assembled. Open reading frames were successfully predicted for ca. 2.7 million (77 %) of the input transcripts. Of these, 49 % were classified as redundant and 51 % were classified as differing in CDS (non-redundant). A further 55 % of non-redundant sequences were classified as perfect fragments of other longer CDS, leaving 23 % of the predicted 2.7 million CDS as informative. Of the informative CDS, 60 % were assigned to the ‘drop’ category and discarded. Overall, this brought about a 14-fold reduction in assembled transcript number, with only 7 % of the original input sequences maintained. The resulting merged assembly contained 247 035 transcripts grouped into 66 377 predicted loci (Additional file 1: Table S1). This assembly was compared to the average assembly statistics across assemblies for each assembly program respectively (Fig. 2; Additional file 2: Table S2). Despite the decrease in transcript number, the proportion of transcripts containing a predicted ORF in the merged assembly was 10–40 % higher compared to the average ORFs per assembly for all three assemblers. This indicates that a higher proportion of transcripts in the merged set have been accurately assembled to near-full or full length. The average length among the 1 000 longest predicted proteins in the merged assembly was 1 425 amino acids. Due to the high resource expenditure required to produce these long proteins they are often well-conserved and a biological maximum has been observed for their average length [28]. For plants this maximum observed average is ca. 1 500 amino acids (based on the average between Arabidopsis, banana, cacao and poplar) [28]. This indicates that these proteins are well assembled in the merged assembly, although there is room for improvement.