Sequence diversity is usually described in comparison to a reference genome
[38, 39]. Given the high degree of genetic diversity among plant cultivars, this approach might fail to recognize highly polymorphic regions and will not detect the presence or absence of genes residing in private (cultivar-specific) regions of the genome
. Whole-genome sequencing and re-annotation is therefore recommended for each variety, but in predominantly heterozygous species such as grapevine the sequence diversity would make contig assembly a daunting and resource-intensive task
When a reference genome is available, genes and transcript isoforms are built de novo by mapping RNA-seq reads, but this does not solve the problem of hypervariable sequences and private genes
[34, 40]. However, the de novo assembly strategy does not depend on the genome and has been applied successfully to reconstruct the transcriptomes of non-model species lacking reference genomes.
We have demonstrated the feasibility cDNA sequencing by RNA-seq for the analysis of varietal diversity between a local grapevine cultivar (Corvina) and the PN40024 reference genome without genomic data. The availability of a reference genome allows the reconstruction procedure to be validated and highlights the diversity between the two genomes.
Improved annotation of the reference genome
The latest grapevine genome annotation (v1 produced by CRIBI; http://genomes.cribi.unipd.it/) comprises 29,971 genes identified by a combination of ab initio prediction and cDNA mapping. By comparing this annotation to the transcripts we identified, we found our method had detected 51% of the annotated genes, the remainder probably representing tissue/condition-specific transcripts that were not present in our pooled samples. The genes overlapping our sample and the v1 annotation have a higher expression level than the v1-specific genes (mean = 35.67 vs 14.31 FPKM, median= 13.03 vs 1 FPKM). These data indicate that many of the v1 annotations undetected using our method were missed because of the paucity of sequencing reads generated from the corresponding loci. A large number (2249) of potential protein-coding genes were detected in the non-annotated parts of the genome. A recent comparison of the 8x, 12x v0 and 12x v1 annotations showed that 6089 genes present in either the 8x or 12x v0 assemblies were not present in the v1 annotation
. Interestingly, 1171 of our 2353 potential protein-coding genes (72 of which are only present in raw reads) were represented in the 8x or 12x v0 annotations. Current annotation is therefore incomplete and insufficient to describe the full gene space of a cultivar other than the reference Pinot Noir clone. Our method provided experimental support for 72 protein-coding genes missing from the final assembly because they were excluded from the 12x consensus, and for 2249 additional genes that appear to have been missed in the v1 annotation. Novel genes excluded from the v1 annotation appear to have meaningful biological roles, including those modulated during berry ripening and/or withering e.g. eight disease-resistance genes (Novel_1755, Novel_2241, Novel_0853, Novel_2382, Novel_2375, Novel_1428, Novel_2207, Novel_1998), two stress-inducible genes (Novel_4520 and Novel_4511), a heat shock protein 70 gene (Novel_4478) and a senescence-associated gene (Novel_1324). The expression of the disease-resistance genes generally declined during berry development and withering (clusters 1 and 2) suggesting their role is to protect the berry from pathogens and pests during early development. In contrast, the stress-inducible genes and heat shock protein gene were induced during ripening and withering, supporting a protective role against abiotic stress during the accumulation of sugars and secondary metabolites as previously reported
[25, 41, 42]. The RNA-seq data therefore provide a comprehensive insight into the biologically-relevant landscape of gene expression during berry development and ripening.
Our method not only offers a way to annotate previously uncharacterized genes but also improves the annotation of known genes by helping to define their boundaries more robustly and to identify splice variants. Our data indicate that up to 11% of the genes in the v1 annotation are split incorrectly, similar to the error rate in other annotated plant genomes
. A previous in silico analysis identified 1429 instances of erroneously split genes in the v1 annotation
. We also detected 462 of these genes and our analysis suggested that 75% of them were split incorrectly in the v1 annotation. Furthermore, our data resulted in the 3′ and 5′ extension of nearly 90% of the genes we detected compared to the boundaries in the v1 annotation, indicating that the untranslated regions were longer than previously reported, using in silico prediction methods
. Our approach may therefore provide a useful complement to ab initio gene prediction methods to establish gene boundaries and define UTRs. Finally, our de novo transcriptome assembly method detected an average of 1.75 transcripts per locus, in line with previous reports using a reference-guided assembly of grapevine transcripts (1.25 transcripts per locus
). Although beyond the scope of our investigation, the de novo reconstruction indicated alternative splice variants for 9463 loci, providing a much more exhaustive description of the grapevine transcriptome compared to in silico predictions. The number of studies which try to describe alternative splicing events in plants are still scarce, however many recent studies point to an extensive diffusion of the phenomenon and to its important role in modulating gene expression and stress response (
[46–48]). Our results indicate that the transcriptional landscape in Vitis is more complex than previously thought and therefore warrants further investigation.
Expression of Corvina private genes during berry development
Recent data from the deep sequencing of human individuals and Arabidopsis ecotypes revealed portions of genome that are not shared among all genotypes and the reference genome
[14, 15]. Interestingly, the novel genomic sequences included a set of protein-coding genes (private genes) potentially contributing to the intra-species variability. Similarly, we detected 180 putative protein-coding genes with a high coding potential or matches to plant ESTs that represent potential Corvina private genes.
We identified 146 private genes expressed in at least one berry-sampling phase, 50 of which were differentially expressed between samples, and these could represent a group of genes that directly contribute to the specific characteristics of the Corvina berry. Some of these private genes could have been selected by ancient breeders looking for particular berry quality traits, such as the ability to withstand the lengthy drying phase (rasinate) required to make passito wines (straw wines) such as Amarone and Recioto. For example, we identified a heat shock protein gene (Private_087) and a stress-inducible gene (Private_101) induced during ripening, consistent with the ability of Corvina berries to undergo dehydration for up to 100 days
[26, 27]. Furthermore, we detected the induction of genes involved in translation and protein metabolism during withering, including three ribosomal proteins (Private_068, Private_108 and Private_116), three elongation factors (Private_166, Private_164 and Private_152), ubiquitin (Private_122), a 5-methyltetrahydropteroyltriglutamate-homocysteine methyltransferase (Private_094) and a DNA-binding protein (Private_171). This supports cDNA-AFLP data indicating the induction of genes with similar functions during withering
Thirty-three of the Corvina private genes matched homologs in other grape varieties but not the reference genome. This is expected because the dispensable part of the genome may be partly shared among different cultivars and only a few genes may be truly unique to a particular accession
. For example, we found two Flowering Locus T (FT) genes (Private_100 and Private_113) the first corresponding to the previously-described VvFT gene found in the cultivars Cabernet Sauvignon
 and Tempranillo
. At least six members of the FT/TFL1 gene family were identified in the Tempranillo genome, including VvFT which appears to be the ortholog of Arabidopsis FT and therefore induces precocious flowering when expressed in Arabidopsis, consistent with reported expression patterns associated with seasonal floral induction in latent buds and with the development of inflorescences, flowers and fruits
. There is no evidence for the presence of classical floral regulatory pathways in grapevine, and the expression profile of VvFT suggests that it only partially corresponds to the florigen role of Arabidopsis FT. We also observed the expression of VvFT during berry formation, suggesting an additional and uncharacterized role of this gene during early berry formation.