Re-annotation of the woodland strawberry (Fragaria vesca) genome

Background Fragaria vesca is a low-growing, small-fruited diploid strawberry species commonly called woodland strawberry. It is native to temperate regions of Eurasia and North America and while it produces edible fruits, it is most highly useful as an experimental perennial plant system that can serve as a model for the agriculturally important Rosaceae family. A draft of the F. vesca genome sequence was published in 2011 [Nat Genet 43:223,2011]. The first generation annotation (version 1.1) were developed using GeneMark-ES+[Nuc Acids Res 33:6494,2005]which is a self-training gene prediction tool that relies primarily on the combination of ab initio predictions with mapping high confidence ESTs in addition to mapping gene deserts from transposable elements. Based on over 25 different tissue transcriptomes, we have revised the F. vesca genome annotation, thereby providing several improvements over version 1.1. Results The new annotation, which was achieved using Maker, describes many more predicted protein coding genes compared to the GeneMark generated annotation that is currently hosted at the Genome Database for Rosaceae (http://www.rosaceae.org/). Our new annotation also results in an increase in the overall total coding length, and the number of coding regions found. The total number of gene predictions that do not overlap with the previous annotations is 2286, most of which were found to be homologous to other plant genes. We have experimentally verified one of the new gene model predictions to validate our results. Conclusions Using the RNA-Seq transcriptome sequences from 25 diverse tissue types, the re-annotation pipeline improved existing annotations by increasing the annotation accuracy based on extensive transcriptome data. It uncovered new genes, added exons to current genes, and extended or merged exons. This complete genome re-annotation will significantly benefit functional genomic studies of the strawberry and other members of the Rosaceae.


Background
The diploid strawberry Fragaria vesca is native to temperate regions of Eurasia and North America and is commonly known as the alpine or woodland strawberry. Due to its small size and small genome it is a versatile experimental perennial plant system and an emerging model for the Rosaceae family. The extant genome exhibits synteny with other commercially important members of the Rosaceae family such as apple (Malus domestica) and peach (Prunus persica) [1] and an ancestral F. vesca genome contributed to the genome of the octoploid dessert strawberry (F. × ananassa). Information obtained from studies of all aspects of plant growth, biochemistry, and physiology of F. vesca should be applicable to or inform studies of other Rosaceae species [1].
The time, cost, and difficulty of generating transcriptome sequences has been greatly reduced due to recent advances in sequencing technology, and RNA-Seq is now dominant over microarrays for in-depth transcriptome studies. The Illumina HiSeq 2000 platform was previously used to sequence 50 RNA-Seq libraries of 25 different F. vesca tissue types from early developing fruit at various stages, young leaves, and seedlings [2] of the 7 th generation inbred line Yellow Wonder 5AF7 (YW5AF7) [3]. The 50 libraries represent two biological replicates of 25 tissue types, and each library yielded between 12 and 40 million 51 bp, single end reads, for a total of~70 Giga bytes of sequence data [2,4].
The genome sequence of the F. vesca inbred line Hawaii4×4 was published in 2011 [5] and the first version of the gene predictions is hosted at the Genome Database for Rosaceae (GDR) http://www.rosaceae.org/ projects/strawberry_genome/v1.1/assembly [4]. In 2013, The National Center for Biotechnology Information (NCBI) published a new F. vesca annotation using the NCBI eukaryotic gene prediction tool Gnomon. Both annotations, from the GDR and the NCBI, are based on ab initio gene predictions and alignment of high confidence ESTs.
Using Bowtie2 [6] with default parameters, an average of 80.32% of the transcriptome reads from each library aligned to the genome (version 1.1), while only an average of 60.58% of these sequence reads aligned to the current gene predictions at GDR. Visualization of the mapped reads using GBrowse [7] uncovered incidences of genes of incorrect size, mis-annotated intron/exon junctions, and reads mapping to the genome that could represent non-coding transcripts, indicating that current annotation would be improved by incorporating the RNA-Seq data.
For the new annotation we used the MAKER2 annotation pipeline [8,9] to combine the following data sources: 1) de novo and reference based assemblies of the 50 RNA-Seq transcriptomes, 2) RefSeq alignments of publicly available plant transcripts, 3) current annotations from GDR, and 4) ab initio gene predictions based on analysis by SNAP, Augustus and GeneMark [10][11][12][13].
The resulting F. vesca genome re-annotation increases the number of coding regions and the total coding length across all seven linkage groups (LG1 -LG7) and the non-anchored scaffolds (LG0). This increase is due to the addition of exons to existing genes, the extension or merging of current exons, correction of intron/exon junctions, and the discovery of additional genes. Overall, this new annotation, named TowU_Fve, provides an improved annotation file and facilitates future gene isolation and identification in strawberry and other Rosaceae species.

Results and discussion
De novo transcriptome generation and assembly The de novo assembly pipeline shown in ( Figure 1A) was used to assemble the reads from 50 stage and tissue libraries, resulting in 754,400 transcripts. The average number of assembled transcripts across different samples   (Table 1), with the minimum number of assembled transcripts found in the early stage embryo (Embryo3) and the maximum number found in the leaf (Leaf1). All de novo assembled transcripts were aligned to the F. vesca genome using GMAP [14] within PASA, with the aims of eliminating sequences not aligning to the genome and merging de novo assembled sequences to remove redundancy. An average of 88.32% of the de novo assembled transcripts from each sample aligned to the genome.

Reference based assembly
Next, we carried out reference-guided assembly using TopHat (http://ccb.jhu.edu/software/tophat/) and Cufflinks (http://cole-trapnell-lab.github.io/cufflinks/) ( Figure 1B). TopHat aligned RNA-Seq reads to the reference genome and identified exon-exon splice junctions. Cufflinks then used the alignment generated by TopHat and GeneMark gene models to assemble a total of 1,302,739 reference based transcripts, with the average being 52,110 and the minimum and maximum number found in the same tissues as for the de novo assembly (Table 1).

F. vesca genome re-annotation pipeline
We then used the MAKER annotation pipeline (Figure 2) to generate the revised F. vesca annotation. MAKER is able to generate ab initio gene prediction using several tools within its pipeline; it identifies repeats, aligns proteins and ESTs to a genome, and automatically combines all classes of evidence data into gene annotations. Data analyzed in the MAKER pipeline included: 1) 754,400 de novo assembled transcripts from 25 samples (Table 1), each with two biological replicates; 2) trained ab initio predictions from the SNAP gene prediction tool; 3) Augustus trained datasets of Arabidopsis thaliana and Solanum lycopersicum (tomato) transcriptomes; 4) first generation F. vesca gene predictions obtained from the Comparison of TowU_Fve annotation with the prior annotation (version 1.1) The TowU_Fve annotation increased the number of coding regions by 9,139 compared to the version 1.1 annotation. This translates into over two million base pairs of extra coding DNA sequence (CDS) ( Table 2). As summarized in Table 3, there are 2,286 newly predicted gene models (genes models that do not overlap with any of the genes from version1.1 annotation) in TowU_Fve. The number of newly identified coding exons is 6,006, and the average length of each of these exons is 183bp. The total coding length in all 7 linkage groups was found to be over 1.1 Mb.
The increased numbers of coding regions were discovered based on the RNA-Seq reads from different tissue libraries, as illustrated in Figure 3. For example, Figure 3A illustrates that the transcriptome data uncovered potential splice variation for gene21088, a putative receptor protein kinase, and re-annotation resulted in retention of a previously annotated intron. Figure 3B shows the addition of an exon to gene31621, a bZIP transcription factor. Figure 3C illustrates the discovery of a new gene, a putative hydrolase, absent from previous annotations.
We PCR amplified, cloned and sequenced the cDNA of gene11268 encoding a MADS box protein to confirm experimentally the TowU_Fve annotation. TowU_Fve predicts additional exons in the second intron based on the RNA-seq data ( Figure 4A,B). Figure 4C illustrates the amplified coding region found by sequencing two independent cDNA clones, which contained the additional TowU_Fve predicted exons.
Because the RNA-seq data was obtained from the inbred line YW5AF7, which is different from Hawaii4×4 on which the prior genome assembly was based, there remains some possibility that TowU_Fve predictions differ from the prior annotations because of genome sequence differences between the two lines. Nevertheless, TowU_Fve represents a substantial improvement over previous annotations and annotation differences due to sequence differences between YW5AF7 and Hawaii4×4 may potentially underlie interesting phenotypic differences between these two lines.

Functional annotation of new gene models
Orthologous relationships between the new gene models predicted by the TowU_Fve annotation and other plants Figure 2 Summary of overall bioinformatics pipeline for F. vesca genome re-annotation. Evidence data including de novo assembled transcripts from all 50 samples, reference-guided assembled transcripts from all 50 samples, gene models generated using ab initio algorithm based tools (Augustus, SNAP and GeneMark), the first generation F. vesca gene predictions and plant reference proteins were passed to the MAKER pipeline to generate TowU_Fve annotation.
were evaluated by Blast2GO against the NCBI nonredundant protein database [15][16][17]. The goals of the GO analysis were to obtain support for the plant origin of the newly identified genes and to acquire information as to the function of these genes. The Blast2GO analysis showed that about 70% of the new gene models were found to have sequence homologies to plant proteins in GenBank (at e = 10 −5 ) and could be assigned GO functions, or were found to contain a known protein domain using InterProScan. About 30% of the new gene models did not have any significant Blast hits or InterProScan identified domains.

Conclusions
The F. vesca genome was sequenced and subsequently released in 2010, along with a first generation annotation (version 1.0) [5], that was subsequently replaced by version 1.1. Recently published deep transcriptome sequencing has shown that these previous versions of annotations were not completely accurate, as might be expected given that they were mainly derived from ab initio predictions combined with mapping high confidence ESTs and mapping of gene deserts from transposable elements. Accurate and detailed genome annotation for diploid strawberry would be a valuable resource for Fragaria and the entire Rosaceae family. Seventy percent of the 2,286 new gene models identified by TowU_Fve have homologs in other plant species and/or have known GO ontologies. The remaining 30% potentially encode proteins of special interest to the Fragaria research community. The revised annotation, based on transcriptome sequences from a large number of different tissue samples, represents an important milestone in improving the accuracy of the diploid strawberry genome annotation. This improved genome annotation, TowU_Fve, provides a valuable resource for comparative and functional studies in flowering plants.

RNA-Seq Data
Tissue collection from the YW5AF7 cultivar of Fragaria vesca [3], RNA extraction, and sequencing were previously   The first generation annotation (peach color) shows an absence of a gene between 5613k and 5616k, while the aligned reads from leaf tissue revealed the existence of an expressed gene at that site. The TowU_Fve annotation (red) shows a newly predicted gene (an alpha/beta-hydrolase domaincontaining protein) between gene35181 and gene12565. described in detail [2]. Briefly, plants were grown in growth chambers with 12 hours light at 25°C and 12 hours dark at 20°C. Samples were manually dissected from 25 different tissues (listed in Table 1) with two biological replicates for each tissue. cDNAs resulting from reverse transcription of RNA extracted from each of the 50 samples were sequenced on the Illumina HiSeq2000 platform using single-end chemistry with read lengths of 51 bp [2].

De novo assembly and PASA alignments
More than 600 million single end reads were assembled using a two-step de novo assembly pipeline, Figure 1A. Sequence reads of the replicates were first merged into one library, and then merged libraries were assembled using Trinity [18]. This step generated redundant transcripts and transcripts that do not align to the genome.
In the second step of the de novo assembly pipeline these issues were resolved using the Genomic Mapping and Alignment Program for mRNA and EST Sequences (GMAP) [14] within Program to Assemble Spliced Alignments (PASA) [19]. All de novo assembled transcripts from the first step were aligned to the F. vesca genome version1.1. The resulting alignments were then used as one of the inputs of the re-annotation pipeline. All transcripts not aligning to the genome were thereby discarded.

Reference based assembly
The bioinformatics pipeline for the reference based assemblies of the transcriptome data is shown in Figure 1B. The first step was to align all reads from the 50 RNA-Seq libraries to the F. vesca genome version1.1 by passing each of the RNA-Seq samples through TopHat [20], resulting in 50 BAM files. TopHat aligns RNA-Seq reads to the reference genome in order to identify exon-exon splice junctions. It is built on the ultrafast short read mapping program Bowtie [20,21]. The BAM files from sample replicates were then merged using the "merge" command within SAMtools [22] reducing the number of BAM files to 25. All 25 BAM files were then sorted using the SAMtools "sort" command. The final step was to pass each of the 25 sorted BAMs to Cufflinks [23][24][25] to generate assemblies. Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples [23][24][25], and uses the alignment generated by TopHat and GeneMark gene models to assemble the reference based transcripts.
Training ab initio gene finding tools (GeneMark, Augustus, SNAP) GeneMark models were built by training the GeneMark tool using the F. vesca genome version1.1. Augustus pre-trained datasets of A. thaliana and S. lycopersicum (tomato) were used along with the following: GeneMark models, de novo assembled transcripts, reference-guided assemblies, GDR annotation, and reference sequence proteins to run the first round of the MAKER annotation pipeline. The gene models, generated from the first round of the MAKER annotation pipeline were then used to train the SNAP tool.

MAKER annotation pipeline
All de novo assembly steps were executed on the Data Intensive Academic Grid (DIAG), a shared computational cloud that is available for academic and non-profit institutions for performing bioinformatics analyses http://diagcomputing.org/about/investigators.php. All MAKER runs were executed on the iPlantCollaborative (http://www.iplantcollaborative.org/) cloud infrastructure service platform. We used the virtual machine instance emi-490420DC size c1.xlarge (16 CPUs, 16 GB memory and 50 GB disk) to run the MAKER annotation pipeline.
The MAKER annotation pipeline was utilized to automatically synthesize the following input data for a final run into gene annotations with evidence-based quality values. The data used as input into the MAKER pipeline shown in (Figure 2) are: de novo assemblies, referenceguided assemblies, reference sequence proteins, Augustus trained datasets, SNAP trained models, GeneMark models and the GDR annotation.

Experimental verification of gene11268
Stage 12 anthers were dissected from YW5AF7 flowers and total RNA was extracted using the RNeasy Plant Mini Kit (Qiagen, www.qiagen.com) in conjunction with RNase-free DNase (Qiagen). PolyA selection and cDNA synthesis were conducted using the iScript cDNA Synthesis Kit (BioRad).