Genomic analysis of the native European Solanum species, S. dulcamara

Background Solanum dulcamara (bittersweet, climbing nightshade) is one of the few species of the Solanaceae family native to Europe. As a common weed it is adapted to a wide range of ecological niches and it has long been recognized as one of the alternative hosts for pathogens and pests responsible for many important diseases in potato, such as Phytophthora. At the same time, it may represent an alternative source of resistance genes against these diseases. Despite its unique ecology and potential as a genetic resource, genomic research tools are lacking for S. dulcamara. We have taken advantage of next-generation sequencing to speed up research on and use of this non-model species. Results In this work, we present the first large-scale characterization of the S. dulcamara transcriptome. Through comparison of RNAseq reads from two different accessions, we were able to predict transcript-based SNP and SSR markers. Using the SNP markers in combination with genomic AFLP and CAPS markers, the first genome-wide genetic linkage map of bittersweet was generated. Based on gene orthology, the markers were anchored to the genome of related Solanum species (tomato, potato and eggplant), revealing both conserved and novel chromosomal rearrangements. This allowed a better estimation of the evolutionary moment of rearrangements in a number of cases and showed that chromosomal breakpoints are regularly re-used. Conclusion Knowledge and tools developed as part of this study pave the way for future genomic research and exploitation of this wild Solanum species. The transcriptome assembly represents a resource for functional analysis of genes underlying interesting biological and agronomical traits and, in the absence of the full genome, provides a reference for RNAseq gene expression profiling aimed at understanding the unique biology of S. dulcamara. Cross-species orthology-based marker selection is shown to be a powerful tool to quickly generate a comparative genetic map, which may speed up gene mapping and contribute to the understanding of genome evolution within the Solanaceae family.


Background
Solanum dulcamara (bittersweet, climbing nightshade) is an allogamous diploid (2n = 2x = 24) species with a genome size of~780 Mb [1]. It is one of the few Solanaceae species native to Europe, although it has been widely naturalised around the world (e.g. North America, Asia). S. dulcamara is placed in the clade Dulcamaroid, one of the 13 well-supported monophyletic clades in the Solanum section. The Dulcamaroid clade is closely related to the Morelloid clade, which includes S. nigrum, also native to Eurasia and considered as a weed of arable fields, and next to the Potato clade which includes species of economic importance such as S. tuberosum (potato) and S. lycopersicum (tomato) [2].
Bittersweet is adapted to grow in a wide range of ecological niches, from wet habitats such as riverbanks, lake shores and irrigation ditches to dry plains and dunes [3,4] (Figure 1). It is able to produce different phenotypes under different environmental conditions and in this respect it could represent a good model to study and explore the molecular bases of acclimation and adaptation. It has been demonstrated that S. dulcamara serves as an alternative host for many agriculturally important pathogens, such as Ralstonia solanacearum [5,6] and Phytophthora infestans [7] and pests, such as the Colorado Potato Beetle (Leptinotarsa decemlineata; T. Golas, personal observation). For this reason attempts were made to eradicate it from potato cultivation areas and beyond [8]. On the other hand, as a host it can represent a European source of resistance genes for Solanum breeding, as alternative to the currently used American sources. We have recently reported on the identification of two resistance loci to Phytophthora infestans, Rpi-dlc1 and Rpi-dlc2, which are located on chromosome 9 and 10, respectively [9,10].
In spite of its important ecological role and potential to provide genetic resources for plant breeding, genomic research tools are lacking for S. dulcamara and only a very small number of nucleotide sequences (123) are currently available in GenBank [11]. Availability of a near-complete transcriptome, especially in combination with comparative genomics approaches and information transfer from related species with more genomics data can have a remarkable impact on the in-depth characterization of a species. Combining data and knowledge from the potato and tomato genome sequencing projects [12,13] with a de-novo RNAseq-based S. dulcamara transcriptome would thus be a powerful and valuable approach to speed up research on and exploitation of S. dulcamara.
We here present a deep sampling of the S. dulcamara transcriptome and first assessment of its complexity. The transcriptome enabled development of SSR and SNP markers, of which the latter were used to generate the first genetic map of S. dulcamara. This map was compared to the maps of tomato, potato and eggplant in order to elucidate chromosomal evolution within the genus and to contribute to future gene mapping efforts.

Results and discussion
De novo transcriptome assembly Short reads from seventeen different S. dulcamara cDNA libraries that were sequenced using either Roche GS-FLX or Illumina HiSeq2000 sequencing technologies (Table 1; Additional file 1: Table S1) were combined to build de novo a consensus transcriptome using the Trinity package [14]. This resulted in an assembly of 32,157 contigs of more than 500 nts in size, with an average length of 1,346 nts (Table 2; Additional file 1: Table S2). The dataset encompasses 24,193 unigenes, of which 3,885 are clusters with multiple variants (Table 2; Figure 2). These variants are expected to comprise allelic variants, splice variants, nearly-identical paralogs or mis-assemblies. The sequences of all

Functional annotation BLAST annotation
To attach biological information to each contig, a multistep annotation workflow was designed (see Additional file 1; Table S2). First, sequence similarity search with BLASTx was performed against all tomato, potato and Arabidopsis predicted proteins as well as the UniProtKB/ Swiss-Prot sequence set [15]. According to this analysis, 85% of the contigs presented at least one match at an E-value of e -10 . No more than 47 contigs (0.15%) were found to have matches only to the UniProtKB/Swiss-Prot database, of which 30 were similar to sequences from viruses. Of these, 24 represented RNA replication and coat proteins from the potato virus M (PVM). This is in agreement with earlier findings of PVM in S. dulcamara [16,17], confirming it may serve as a reservoir for the virus from which it could move into potato. The remaining 17 contigs (0.05%) had significant matches to proteins from a wide spectrum of source organisms (from bacteria to human), and should be considered contaminations of the samples. Second, all the contigs that did not match any protein (4,829) were searched against the GenBank nucleotide non-redundant database with BLASTn. 1,913 contigs had correspondence to entries in the database at an e-value of e -10 . Most of the first hits (88%) were sequences coming from Solanaceae species, with tomato the most represented. These sequences most likely represent UTRs or as yet un-annotated protein coding loci. The remaining sequences were similar to nuclear genes in GenBank (5%), mitochondrial DNA (4%), plastid DNA (1%) or ncRNAs, repetitive elements and sequences annotated as genomic markers (2%). Finally, 2,916 contigs, equal to~9% of the assembled transcriptome, had no significant match in protein and nucleotide databases. The contigs in this dataset may encode for novel proteins, represent non-conserved UTR regions or are mis-assemblies.

Gene ontology and KEGG ortholog annotation
In order to describe gene functions in a standard and controlled vocabulary, we used the Blast2GO suite [18]. InterProScan searches were used to identify conserved protein domains in the S. dulcamara transcriptome and showed that 16,483 contigs (51.26%) had matches to conserved protein domains (Additional file 1: Table S2).
Mapping of the InterPro entries to gene ontology (GO) terms resulted in the assignment of 33,008 GO terms to 12,637 contigs (Additional file 1: Table S2). The 32,157 S. dulcamara contigs were also analysed with the KEGG Automatic Annotation Server (KAAS) [19]  All data combine to a high quality, thoroughly annotated draft of the S. dulcamara transcriptome.
Comparison of protein family structure between S. dulcamara and other plant species Multi-species transcriptome comparison may be used in order to identify orthologous gene groups, measure changes in the size of protein-coding gene families, study gene family evolution and detect taxonomically restricted sequences (i.e. species-specific or genus-specific sequences).

ORF/protein prediction
To be able to compare protein family structure between S. dulcamara and other plant species we first predicted the ORFs and protein sequences encoded by the S. dulcamara contigs. ESTScan [20] of the 32,157 contigs indicated that 26,696 contigs (~83%) contain putative coding sequences that could be translated into proteins. This is very similar to the percentage of contigs predicted to be protein coding by BLASTx (see above), with the slightly higher percentage of the latter probably explained by the fact that BLASTx better tolerates sequencing errors that result in frame shifts and premature stop codons than ESTScan. In total, 11,760 full-length proteins (i.e. containing a putative start and stop codon) and 14,936 truncated proteins (from partial coding sequences) were identified (Table 3). To confirm the reliability of the ESTScan prediction we carried out BLASTp searches of the predicted proteins against the tomato, potato and Arabidopsis protein complement. About 95% (25,418) of the S. dulcamara proteins had a significant match in at least one of these protein databases. Comparison of the BLASTp results with the BLASTx results of the same contigs revealed that in 99.9% of the cases, the best hit was identical. As a measure of the quality of our assembly, we also compared the size distribution of the subset of S. dulcamara fulllength proteins to the length distribution of the proteins encoded in the genomes of tomato (34,727) and potato (39,031), the two Solanum species for which a full genome sequence was published recently [12,13]. Although the number of S. dulcamara full-length proteins is three to four times smaller than the number of proteins in the tomato and potato genome, protein size in the three datasets shows a similar log-normal distribution ( Figure 3). Together, these results support the reliability of the assembly and the predicted protein data set.

OrthoMCL clustering
Orthologous gene groups were identified using orthoMCL [21]. The analysis included protein datasets from S. dulcamara, from the related Solanum species tomato and potato, as well as from the two model plant species Arabidopsis and rice. As the input for S. dulcamara we used the partial and full-length proteins predicted by ESTScan. To ensure that each locus was represented only once in the orthologous gene group analysis, only the longest predicted protein from each variant cluster was used. Similarly for the other species, only the longest protein variant encoded by a locus was used. A total of 164,689 protein sequences from the five species were clustered into 23,370 ortholog groups (see Additional file 1: Table S3). A consensus annotation was automatically assigned to each group based on the frequency (greater or equal to 0.33) of the most prevalent InterPro entry list. In case the threshold criterion was not met, the combination of the two most frequent InterPro entry lists was used.
In Figure 4, the number of orthologous and putative species-unique gene groups is shown. Of the 19,713 proteins from S. dulcamara, 15,073 were placed in a total of 13,518 gene groups with multiple members and 4,640 were not grouped and defined as species-specific singletons. As expected, a large part of the S. dulcamara gene groups (7,737) contained orthologs from all other species, thus representing genes that are highly conserved in flowering plants. High sequence conservation and high gene expression have been suggested to correlate [22][23][24], which may explain why the RNAseq-based S. dulcamara transcriptome has a slight bias towards highly conserved gene groups (43%), compared to the transcriptomes of tomato (31%) and potato (32%), which were derived from whole-genome sequencing. In S. dulcamara, as in the other species, many genes were species specific: 17 gene groups and 4,640 singletons.

Enrichment analysis
In order to understand which molecular functions were over-represented in the S. dulcamara specific set, we performed a GO enrichment analysis compared to all S. dulcamara proteins used for the OrthoMCL clustering. The analysis showed that genes associated with the molecular function terms "kinase activity" and "transporter activity" were most significantly overrepresented ( Table 4), suggesting that these type of genes have evolved relatively fast in S. dulcamara. When looking at the S. dulcamara genes common only to either or both of the other two Solanum species (i.e. not to Arabidopsis and rice), molecular functions associated to regulation of gene expression and regulation of protein activity were enriched (Table 4). Again, this suggests that S. dulcamara genes involved in regulatory processes evolved relatively rapidly. Because the S. dulcamara transcriptome is not complete and nearly-identical paralogs are likely to collapse into variant clusters in the assembly

SSR analysis
In-silico SSR prediction As a first effort aimed at the development and characterization of EST-based SSR markers in bittersweet, the S. dulcamara contigs were examined for the presence of SSR motifs using MISA (http://pgrc.ipk-gatersleben.de/ misa/; Additional file 1: Table S4). A total of 6,029 SSRs were identified ( Table 5). The frequency and the number of repeat units for each of the SSR motifs (except the mono-type repeats) are shown in Additional file 1: Table  S5. With the exception of mono-type and compound repeats, the average length of the SSRs was~16 nucleotides (range from 12 to 96 nucleotides). By exploiting the ORF/protein predictions from ESTScan, we also investigated the position of the SSRs along the transcripts ( Figure 5). Tri-nucleotide as well as hexa-nucleotide repeats are preferentially located in the coding regions, while the other types are more frequent in UTRs. In particular, di-, tetra-and penta-nucleotide are preferentially located in 5' UTRs, whereas mono-nucleotide SSRs are equally distributed between 5' and 3' UTRs. These results are in agreement with observations in [25]. Using the Pri-mer3 software [26], primer pairs to amplify each SSR were successfully designed for 4,233 transcripts and failed for 1,537 sequences (see Additional file 1: Table S4).

In-vitro SSR analysis
To verify the usefulness of the SSRs as genetic markers and assess genetic diversity within a panel of seven S. dulcamara accessions, the list of putative SSRs for which primers were designed was first reduced to 344 by filtering out all mono-nucleotide SSRs and considering only those contigs that have a single orthologous counterpart in the tomato genome (see methods for criteria). Subsequently, 12 SSRs that seemed to be polymorphic in our short read libraries were amplified (see Additional file 1: Table S6). Capillary gel separation showed that the primer pairs for all but one SSR amplified fragments of around the expected size. All these microsatellites were polymorphic, having 2 to 4 alleles ( Table 6). In all cases, heterozygous individuals were identified in our panel of seven plants, which is in agreement with the outcrossing biology of the species.  Penta-nucleotide (p5) 9 Hexa-nucleotide (p6) 22 Taken together, this shows that the set of SSRs and accompanying primer pairs is reliable and has potential use in the identification of marker-trait association as well as in the assessment of genetic diversity within S. dulcamara and between related species [27,28].

Genetic map construction
S. dulcamara contains a number of traits, such as several pathogen resistances, which may be valuable for agricultural purposes. To speed up gene mapping efforts in this species we generated a genetic map that is anchored to the high-quality genetic/physical maps of tomato. In addition to providing practically useful information on synteny and co-linearity between these two species, such a map also offers insight into genome evolution in the genus Solanum.

SNP identification
For map construction, we first identified sequence polymorphisms between accessions A54750069-1 and 944750001-2 for which an F 1 population was available [10] (Additional file 1: Table S1). By independently mapping the reads of the two parent genotypes to a subset of the contigs, 20,162 putative SNPs were identified (see methods for criteria). The observed transition:transversion ratio of 1.52 is very similar to what has been reported for SNP types in other plant species [29,30] (Additional file 1: Table S7).

Segregation analysis and map construction
A subset of SNPs was identified that was heterozygous in the female parent and homozygous in the male parent (see methods for criteria) and 96 of these were selected in such a way that the tomato orthologs of the corresponding contigs were distributed evenly over the tomato genome (Additional file 1: Tables S2 and S8). Segregation of 90 out of 96 selected SNP could successfully be determined in 94 F 1 individuals, using KASPar assays (http://www.lgcgenomics.com/). In addition, segregation of 108 AFLP and 27 CAPS markers was analysed in the same individuals. Linkage analysis using JoinMap [31] revealed the existence of 12 linkage groups, in agreement with the haploid chromosome number of the species. The resulting linkage groups ranged from 76 to 121 cM in size and harboured 10 to 30 markers each ( Figure 6; Additional file 1: Table S9). Synteny and co-linearity with tomato was studied using marker orthology (Figure 7; Additional file 1: Table S8).
To better understand chromosome evolution in the Solanaceae, the genetic map was subsequently compared to the integrative maps of tomato (S. lycopersicum), potato (S. tuberosum), eggplant (S. melongena), pepper (Capsicum annuum), tobacco (Nicotiana) and their deduced common ancestors, as presented by Wu and Tanksley [32] (Additional file 2: Figure S1).

Chromosomes co-linear with tomato
S. dulcamara (Sd) chromosomes 1, 3, 6, 8 and 9 are syntenic and fully co-linear with their respective tomato (T) counterparts. Co-linearity of Sd and T chromosomes suggests that their arrangement represents the ancestral tomato/bittersweet arrangement. In case of Sd1/T1, Sd3/ T3 and Sd8/T8 this is in agreement with the suggestion that also the more ancient common ancestor of tomato/ bittersweet and eggplant (the ancestral Solanum) is likely to have had these arrangements [32]. In case of chromosome 6, tomato and potato contain two small inversions, located at the top and bottom ends of the chromosome, when compared to eggplant and their common Solanum ancestor [32]. Whether these occurred before or after the deviation of bittersweet could not be evaluated, because only a single and no markers are present in those regions, respectively.
The finding that Sd9 is fully co-linear to T9 is surprising, because Wu and Tanksley [32] suggested that tomato harbours an inversion that arose after the split from potato. More detailed examination of this region in potato, however, suggests this inversion is also present, but was followed-up by a second inversion, unique to potato, largely restoring the eggplant like arrangement [33]. Thus, tomato and bittersweet have retained the arrangement ancestral to potato, in which there is a unique inversion. Figure 6 A genetic map of S. dulcamara. The map is based on 225 markers: 12 linkage groups (chromosomes) were identified. The genetic position (in cM) with respect to the first marker in the linkage group is indicated. Marker codes SD1 to SD96 refer to SNP markers analysed with KASPar assays, marker codes SD103 to SD122 refer to SNP markers analysed with CAPS assays, marker codes starting with E/S##M## refer to AFLP markers (see http://wheat.pw.usda.gov/ggpages/keygeneAFLPs.html for standard nomenclature). All other marker codes refer to published markers from potato.

Chromosomes with inversions compared to tomato
S. dulcamara chromosomes 2, 5, 7 and 10 are syntenic with their corresponding tomato counterparts, but show intra-chromosomal rearrangements (inversions). The difference between Sd2 and T2 is most easily explained by two subsequent inversions in the same region. The inversion between SD14 and SD16 (taking tomato as the reference; Figure 7) overlaps with an inversion seen in potato and eggplant and may thus represent the ancestral Solanum arrangement [32]. The second inversion, between SD15 and SD16 (Figure 7) is similar to one seen in eggplant. Wu and Tanksley [32] argued that this inversion is not present in the more distant relative pepper, and is thus eggplant-specific. Taking into consideration our data, however, the inversion might have been present in the common Solanum ancestor and reversed in the tomato/potato lineage, or alternatively, occurred independently in bittersweet and eggplant. Either scenario would imply repeated use of the breakpoint. However, it should be noted that detailed examination of this chromosome segment has shown that it has undergone complex rearrangements, making conclusive interpretations difficult [34].
Sd5 shows a double inversion with respect to T5, while having the same gene content. The large inversion from SD33 to SD36 (taking tomato as the reference; Figure 7) is tomato-specific, as the bittersweet structure is similar to that in potato and the ancestor of Solanum. The second, nested inversion between SD35 and SD36 (Figure 7) is shared with the more distant relative pepper. Wu and Tanksley [32] argued that this inversion was not easily recognisable in eggplant, which has complex rearrangements in this chromosomal region, and was thus already absent in the ancient Solanum, implying a reversal in bittersweet. A more parsimonious explanation would be that the bittersweet/pepper marker order is ancestral so Solanum, and that the SD35/SD36 region diverged independently in eggplant and tomato/potato lineages. The translocation of the bottom half of T5 and potato (Pt) chromosome 5 with respect to eggplant and their last common ancestor must have occurred early after the split from eggplant, as it is already present in bittersweet.
Sd7 has the same gene content as T7, but shows two inversions relative to tomato, potato and eggplant, which thus occurred in the bittersweet lineage, after its separation from the tomato/potato lineage.
Sd10 contains the same genes as T10, with two inversions. The large long-arm inversion is tomato-specific; the arrangement in bittersweet being ancestral to the genus Solanum. The origin of the smaller inversion at the top half is hard to deduce, because a similar rearrangement can be seen in eggplant and its more distant relative tobacco, but not in the intermediate relative pepper. Irrespective of the evolutionary order of the rearrangements, this means inversions with similar breakpoints have occurred multiple times in this region.

Chromosomes showing translocations relative to tomato
The gene content of tomato chromosomes 4, 11 and 12 is represented by S. dulcamara chromosomes 4, 11 and 12, but with a different inter-chromosomal arrangement. Sd4 consists of the upper part of T4 and the upper part of T12. Roughly the same upper part of T4 associates with parts of T11 in eggplant and the more distant relatives pepper and tobacco [32], suggesting that both, the arrangement in tomato-potato and in bittersweet are derived within their respective lineages. Likewise, the same upper part of T12 associates with parts of T5, T3/T9 and T6 in eggplant, pepper and tobacco, respectively. Because of this complexity, it is impossible to deduce ancestral chromosomal arrangements and the origin of the inversions between bittersweet and tomato. Repeated usage of the translocation breakpoints shows that these chromosomes are unstable over evolutionary time.
The top half of Sd11 has the same gene content as T11, but is inverted. The same inversion is not only seen in potato, but also in eggplant, pepper and tobacco, although the latter was not indicated by Wu and Tanksley [32]. This orientation thus is ancestral to Solanum. However, while the top of T11 is associated with the bottom part of T12 in bittersweet, the more ancestral association is with the upper part of T4. The bottom part of T12, in turn, has been associated to parts of T5, T4/ T3/T9 and T8 in eggplant, pepper and tobacco, respectively, again indicating repeated usage of a region as a translocation breakpoint.
Sd12 consists of the lower half of T4 and the bottom of half of T11, the latter being inverted in orientation. This inversion can also be seen in eggplant and pepper [32], and is therefore ancestral to Solanum. Again, these chromosomal segments have been associated to various different fragments during the evolution of the Solanaceae.

Conclusion
We present a variety of genomics resources for the nonmodel species S. dulcamara and demonstrate their use for functional, genetic and comparative analyses. The large-scale characterization of the bittersweet transcriptome provides a first catalogue of the S. dulcamara gene repertoire and allowed SNPs and SSRs to be identified and successfully used as genetic markers for the generation of a linkage map and the analysis of genetic diversity, respectively. We show that molecular markers derived from transcribed regions can be anchored to the genomes of related species for map comparison. Such information is very useful for gene mapping efforts, as we recently showed for mapping of the Rpi-dlc2 locus, which is located near the inversion breakpoint on chromosome 10, in comparison to tomato [10]. The observed chromosome inversions as deduced from the genetic map concur well with previously published data from other Solanaceae and support the position of S. dulcamara in the tomato/potato clade ("Clade I") [2]. Furthermore, the data sustain the notion that certain chromosomal regions are more likely to serve as inversion and translocation breakpoints. For chromosome Sd4, -11 and -12 we report a new chromosome composition of segments that in other species are also associated with translocations. For future research, the S. dulcamara transcriptome will serve as a reference for RNAseq gene expression profiling and be used to facilitate functional genomics studies. This is crucial to the identification of key regulators of important biological phenomena, such as adaptation to different environmental conditions and responses to biotic stressors. Together, this will allow us not only to target genes underlying important agronomic traits, but also help us understand and exploit the unique biology of this species.

Plant material
S. dulcamara material used to generate mRNA samples for RNAseq is described in Additional file 1: Table S1. Material used to test SSRs was provided by Dr Janny Peters (Radboud University Nijmegen, The Netherlands). The segregating population used for map construction (code 05-150) was derived from a cross between accession A54750069-1 and 944750001-2 [10]. All plants were cultivated in standard greenhouse conditions as described in [10], unless indicated otherwise.

RNA extraction and sequencing
Total RNA was isolated using Trizol (for the "Mixed" and "Leaves" libraries, see Table 1) or the Plant RNeasy kit (Invitrogen; for the "Stem & primordia" library, see Table 1) and treated with DNase. In case of the "Mixed" libraries, mRNA was purified and duplex-specific-nuclease-normalized cDNA samples were prepared and sequenced by Eurofins MWG Operon (Ebersberg, Germany) on the Roche GS-FLX platform. In case of the "Leaves" library, mRNA was purified and duplex-specific-nucleasenormalized cDNA samples were prepared and sequenced by Fasteris SA (Geneva, Switzerland). For the "Stem & primordia" library, mRNA was purified and cDNA samples were prepared and sequenced by Fasteris SA without prior normalisation.

De novo transcriptome assembly
Raw read filtering based on quality values and length was performed with the 'Trim sequences' algorithm in CLC Genomics Workbench v4.7.1 (CLCBio, Aarhus, Denmark). Default settings were used and low quality sequences (limit=0.05) and sequences no longer than 50 nts were removed. Although the assembler algorithm discarded low coverage k-mers, the raw reads were error corrected in order to speed up the assembly process. Therefore all sequence data except the Roche GS-FLX data was base-error-corrected with decGPU version 1.06 [35]. DecGPU was run with default settings. The decGPU algorithm output consisted of error free reads, fixed reads and discarded reads. For the assembly both error free and fixed reads were used. The decGPU process discarded 66M sequences (11% of total Illumina input sequences). All samples where pooled, both Roche GS-FLX and Illumina sets, and assembled using the de novo transcriptome assembler Trinity version 2011-10-29 (http://trinityrnaseq.sourceforge.net/) [14]. The Trinity assembly was run with a default fixed k-mer length of 25, minimal contig length of 500 bp, minimal k-mer coverage of 2 and a butterfly heap space size of 50GB.  [18] was used to identify InterPro entries that were mapped to GO terms. KAAS [19] was used to assign KO (KEGG orthologs) terms (representative gene data set for eukaryotes + plants) to S. dulcamara transcripts. The BBH (bi-directional best hit) option was used to map KO terms onto KEGG pathways, using the same program.

Identification and annotation of orthologous gene groups
ESTScan [20] was used to predict ORFs in the S. dulcamara transcriptome using the default Arabidopsis thaliana training matrix for peptide prediction. OrthoMCL (version 2.0.2) [21] was used to identify gene family groups among S. dulcamara (19,713), S. lycopersicum (34,727), S. tuberosum (39,031), A. thaliana (27,416), O. sativa (43,802). Enclosed within brackets, is reported the number of proteins used as input data, after removing all but the longest protein sequence in case of splice variants. All the resulting sequences were merged into a single FASTA file and all-versus-all comparisons were performed using BLASTp (e-value < e -5 ). For the MCL clustering algorithm we used an inflation value (-I) of 1.5 (OrthoMCL default). Consensus annotation of each gene group was automatically assigned based on of the most frequent InterPro entry list (frequency ≥ 0.33). In case the threshold criterion was not satisfied, the combination of the two most frequent InterPro entry lists was used. In case of Arabidopsis, rice and tomato we exploited the already available nterPro annotations (ftp://ftp. arabidopsis.org/home/tair/Genes/TAIR10_genome_release/ TAIR10_functional_descriptions; http://rapdb.dna.affrc.go. jp/download/archive/irgsp1/IRGSP-1.0_representative_2012-05-28.tar.gz; ftp://ftp.solgenomics.net/tomato_genome/ annotation/ITAG2.3_release/ITAG2.3_desc_and_GO.csv). In contrast, because no InterPro annotation is available at http://potatogenomics.plantbiology.msu.edu/index.html, we identified the InterPro protein domains within the potato sequence collection using the Blast2GO suite. The GO term enrichment analysis was performed using TopGO package (v2.8) from Bioconductor (http://www.bioconductor.org/ packages/2.12/bioc/html/topGO.html). Fisher's exact test (p-value < 0.01) was used to identify the over-represented GO terms.
Genomic DNA was isolated from seven individuals using a standard CTAB method [36,37]. PCR amplifications were performed in 20 μL volume and the reaction mixture contained 16 ng of genomic DNA, 100 μM of each primer, 400 μM of dNTPs each, 0.5 units of DreamTaq Polymerase (Fermentas), 1× DreamTaq DNA polymerase buffer containing 20 mM MgCl 2 . PCR conditions consisted of an initial denaturation step at 95°C for 2 min, followed by 40 cycles of 95°C for 20 sec, 55°C for 30 sec, and 72°C for 20 sec and a final step at 72°C for 30 min. Forward primers were labelled with WellRED D2-PA fluorescent dye (Sigma-Aldrich). 1 μL of PCR product and 0.3 μL of GenomeLab DNA Size Standard 400 (Beckman Coulter) were diluted with 30 μL sample loading solution (Beckman Coulter) and electrophoresis was performed on the CEQ 8800XL Genetic Analysis System (Beckman Coulter). SSR locus allele sizes were determined using the Beckman CEQ fragment analysis software.

SNP discovery and KASPar marker design
For SNP discovery, RNAseq sequence reads were first trimmed and mapped to the S. dulcamara transcriptome assembly using CLC Genomics Workbench v4.7.1 (CLCBio, Aarhus, Denmark). For trimming, low quality sequence (limit 0.05) and ambiguous nucleotides (if longer than 2 nucleotides) were removed and reads shorter than 50 nucleotides were discarded. As a template, we selected contigs with an average coverage of more than 20 and less than 10,000 reads. For read mapping, the minimal read coverage was set at 90% and the minimal alignment identity at 90%. Reads that could be mapped to multiple locations with the same score (repeats) were assigned randomly to one of these locations. SNP calling was done using an upgraded version of QualitySNP [38] (H. Nijveen, Wageningen University, The Netherlands, unpublished). To determine valid SNPs, the minimum similarity score per polymorphic site was set at 0.75 and the minimum similarity score of all polymorphic sites at 0.8, INDEL SNPs were marked as low quality and removal of low quality at sequence ends was disabled. For SNP selection, SNPs were taken as heterozygous in A54750069-1 when coverage was at least 10 reads and frequency of each allele was more than 30%. SNPs were taken as homozygous in 944750001-2 when coverage was at least 20 reads and the number of alternative allele reads was 0. Furthermore, the GC content of the 50bp flanking the SNP on either side was between 30 and 70% to aid in primer design. Primers for KASPar assays (ftp:// ftp.solgenomics.net/tomato_genome/annotation/ ITAG2.3_release/ITAG2.3_sgn_data.gff3) were designed by KBioscience (Hoddesdon, UK) and assays were performed according to the manufacturer's protocol on a Fluidigm EP1 system (Fluidigm Corporation, San Francisco, CA). A BLASTp-based analysis was performed to identify probable orthologous pairs as reciprocal best hits between bittersweet and tomato. BLASTp hits with bit scores lower than 100 were filtered out. In addition, BLASTp searches were performed to identify near identical paralogs within the reference species tomato. In case the bit score of a paralogous pair was higher than the score of the corresponding orthologous pair, the corresponding tomato transcript was regarded as having in-paralogs (i.e. paralogs arising from duplication after speciation) and, unless paralogs were located not more than 10 loci apart, discarded from the analysis. The position of tomato genes on the genetic map was estimated through the identification of the closest 3' and 5' marker from the tomato Kazusa F2-2000 linkage map available at the SGN ftp site [39]. Physical positions were retrieved from a gff3 file that contains alignments of marker sequences to the tomato pseudo-molecules (http://solgenomics.net/itag/release/ 2.3/list_files/ITAG2.3_sgn_data.gff3).

Genetic map construction and comparison
An S. dulcamara F 1 population containing 94 individuals was screened with 328 SNP, CAPS and AFLP markers (Additional file 1: Tables S8 and S9). Markers with >75% missing data points, completely linked markers (except for SD2/SD58, used for anchoring to the tomato map) and markers showing a segregation ratio significantly different from 1:1 (χ 2 test, p<0.005) (except for SD2/SD58, used for anchoring to the tomato map) were excluded prior to analysis. Regarding the KASPar assays, in one case no SNP was detected and in four cases the assay did not work properly. Matrix data was analysed using JoinMap v4.1 [31] with CP (cross-pollination) population type settings. For linkage analyses the regression and maximum-likelihood mapping algorithms were used, with Haldane's mapping function. Markers with different relative positions on the two generated maps and markers that could not be assigned to a linkage group were rejected. The positions of 225 markers, as obtained by regression mapping, are presented here. Rooting of groups was done with LOD>4. For comparison of the S. dulcamara genetic map with those of other Solanaceae as produced by Wu and Tanksley [32], all markers were mapped to the tomato Kazusa F2-2000 linkage map. Three markers were excluded from the genetic map because they suggested singleton translocations that were not corroborated by neighbouring markers: Agl (S. dulcamara chr1, 65.0 cM, S. lycopersicum chr4, 36.5 cM), SD27 (S. dulcamara chr11, 57.0 cM, S. lycopersicum chr4, 37.4 cM) and SD122 (S. dulcamara chr6, 63.0 cM, S. lycopersicum chr2, 35.2 cM).

Accession numbers
Raw sequence reads obtained from 454 and Illumina sequencing were submitted to the NCBI Short Read