Tomato breeding in the genomics era: insights from a SNP array
© Víquez-Zamora et al.; licensee BioMed Central Ltd. 2013
Received: 28 February 2013
Accepted: 20 May 2013
Published: 27 May 2013
Skip to main content
© Víquez-Zamora et al.; licensee BioMed Central Ltd. 2013
Received: 28 February 2013
Accepted: 20 May 2013
Published: 27 May 2013
The major bottle neck in genetic and linkage studies in tomato has been the lack of a sufficient number of molecular markers. This has radically changed with the application of next generation sequencing and high throughput genotyping. A set of 6000 SNPs was identified and 5528 of them were used to evaluate tomato germplasm at the level of species, varieties and segregating populations.
From the 5528 SNPs, 1980 originated from 454-sequencing, 3495 from Illumina Solexa sequencing and 53 were additional known markers. Genotyping different tomato samples allowed the evaluation of the level of heterozygosity and introgressions among commercial varieties. Cherry tomatoes were especially different from round/beefs in chromosomes 4, 5 and 12. We were able to identify a set of 750 unique markers distinguishing S. lycopersicum ‘Moneymaker’ from all its distantly related wild relatives. Clustering and neighbour joining analysis among varieties and species showed expected grouping patterns, with S. pimpinellifolium as the most closely related to commercial tomatoesearlier results.
Our results show that a SNP search in only a few breeding lines already provides generally applicable markers in tomato and its wild relatives. It also shows that the Illumina bead array generated data are highly reproducible. Our SNPs can roughly be divided in two categories: SNPs of which both forms are present in the wild relatives and in domesticated tomatoes (originating from common ancestors) and SNPs unique for the domesticated tomato (originating from after the domestication event). The SNPs can be used for genotyping, identification of varieties, comparison of genetic and physical linkage maps and to confirm (phylogenetic) relations. In the SNPs used for the array there is hardly any overlap with the SolCAP array and it is strongly recommended to combine both SNP sets and to select a core collection of robust SNPs completely covering the entire tomato genome.
Landraces and wild relatives constitute a vast genetic resource that can be tapped to introduce novel traits into tomato breeding programmes . During the last decades, the focus has mainly been on the introduction of disease resistance genes. But, within the breeding efforts, the lack of sufficient molecular markers in tomato has been a bottle neck in genetic and linkage studies. Although all known marker systems have been applied in tomato, most of them fall short in the genomics area mostly because they are too laborious and too low throughput . These shortcomings are now being overcome by next generation sequencing projects and Single Nucleotide Polymorphisms (SNPs) identification . The importance of SNPs as bi-allelic molecular markers is now widely recognized and their use is rapidly increasing [4, 5] , since they have the advantage of being locus specific markers that can be scored co-dominantly in a flexible way. Technology has been developed for scoring single SNPs in thousands of different samples, all the way up to scoring millions of SNPs in a single sample .
Currently, the most widely used systems for high throughput SNP genotyping are the Illumina GoldenGate™, Infinium™ arrays and the KBioscience Competitive Allele‒Specific PCR genotyping system (KASPar: http://www.kbioscience.co.uk) [7–9]. The evolution of genotyping technologies has resulted in unprecedented possibilities for evaluating germplasm collections, characterizing populations, and finding markers linked to specific alleles of important genes. SNPs are also markers of choice for studying evolutionary processes . Characterization of a large set of tomato varieties with a large number of markers can show the impact of breeding on the molecular level and the extent to which these markers are useful for variety identification [11, 12].
A whole genome tomato genotyping array (custom made) using the Illumina® Infinium Beadarray technology  (http://www.illumina.com) was constructed to generate a multiplexing platform [7, 14] to analyse tomato germplasm. A set of 5528 SNPs was used to evaluate more than a thousand tomato samples. This enabled us to compare data at the level of species, varieties and segregating populations. Within the Solanaceae Coordinated Agricultural Project (SolCAP: http://solcap.msu.edu/), in 2012 Sim et al. also developed a genotyping array . However, they focused on different applications and, as we found out, with almost 100% different markers. We were interested in the question to what extent our SNP collection, which is based on a limited number of genotypes, can be applied for variety identification, phylogenetic analysis, genetic mapping, evaluation of introgressions and germplasm identification.
Validated SNPs and their distribution over the chromosomes
454-seq on cDNA from breeding lines
Illumina Solexa on gDNA from breeding lines
Illumina Solexa on gDNA from introgression free varieties
Markers from previous analysis
Total SNPs per chromosome
However, approximately 10% of the markers could not be reliably scored mainly because of wrong automatic clustering by the GenomeStudio software. Closely linked markers in segregating populations can be used to find the correct score and the reasons for the mistakes in the automatic clustering (Additional file 1). Six percent of the SNPs resulted in NCs. These markers were removed resulting in 4072 SNPs for further analysis. Forty eight percent of the monomorphic markers within S. lycopersicum still were useful because they were polymorphic within tomato wild relatives or between wild relatives and cultivated tomato.
Among the cherry varieties, Gardeners Delight did not have the cherry specific Chromosome 5, and this variety has somewhat larger fruits than what we considered as cherry. In total four round tomatoes clearly had the cherry specific Chromosome 5 (including R100), but after close inspection these were catalogued as deviating from round and more plum types.
Accessions from S. cheesmaniae and S. galapagense clustered together (Figure 8). In our study only approximately 30 SNPs (from 5528) were found between the accessions of S. cheesmaniae and S. galapagense in spite of clear phenotypical differences in leaf structure and trichomes .
The high reproducibility of the results for the 12 Heinz samples shows the robustness of the data obtained with the Infinium array. This was also evident from the comparison between cv Moneymaker and cv Moneyberg where the only differences were a few NCs. Although the data were of high quality, individual SNP calls can be wrong. Wrong calls can be recognized in dense genetic linkage maps of a species from which the sequence is known. We observed errors in 10% of the SNPs, when using the standard settings of the Illumina Genome Studio software. Such errors can be corrected manually or 10% of the SNPs can be deleted . Since the amount of data is vast, enough data remained after deleting 10% of the SNPs. Reasons for errors can be DNA quality, presence of outliers (Additional file 1) within the germplasm and, in few cases, double signalling due to duplications in the genome.
Even though the SNPs were looked for in a limited number (4) of breeding lines of S. lycopersicum in combination with four introgression free varieties, they were polymorphic enough in the Solanum sect. Lycopersicon germplasm to discriminate varieties and species as well as to confirm phenetic relations. This implies that many of the SNPs originated from the time before domestication [21–23].
The S. lycopersicum specific markers must have evolved after this species separated from the others. These markers will be polymorphic in any interspecific cross (Figure 7). A relatively cheap, SNP array with a limited number (as few as 20 per chromosome) of well distributed markers will be an excellent tool for a first fast characterization of any new interspecific mapping population involving S. lycopersicum. Based on our results such an array can be easily developed.
Furthermore, it is interesting to note that so many SNPs were found among the four breeding lines. This result was quite unexpected as S. lycopersicum is considered as a species with little genetic variation .
One application of the SNP array was to compare genetic with physical positions when working with mapping populations. On the genetic linkage maps most of the markers were in the expected order; identical to the order in the assembled tomato sequence . This confirmed the accurateness of the assembled Tomato Genome.
Some unassigned markers could be mapped to specific chromosomal positions in one or more of the linkage maps that we produced (results not shown). Comparison of genetic linkage maps and the physical linkage map also pointed out a misassembly on the long arm of chromosome 12 (between 48.8 Mbp and 61.7 Mbp; Additional file 6) in version 2.4 of the tomato genome. Also the data published by Sim et al.  suggest a disruption of marker order in the same region (see their Figure 3), but the conclusion that this might be due to a misassembly was not drawn. Markers should be used to genetically validate and further improve de novo genome assemblies.
Several DNA profiling techniques have been used for variety identification . For tomato, one of the most extensive studies was done by Bredemeijer et al. in 2002  using simple sequence repeats (SSRs). They showed that 90% of the more than 500 varieties that were genotyped had a unique SSR profile using 20 markers (on average this is less than 2 markers per chromosome and one chromosome was even without markers). The SNP array covered between 150 and 900 markers per chromosome and all varieties could be distinguished, except the varieties Moneyberg and Moneymaker. That these two showed identical profiles means that they are highly related, if not identical. Both have been registered by the International Union for the Protection of New Varieties of Plants in the National Listing in Great Britain (UPOV : http://www.upov.int), so phenotypic differences must have been seen. Under the UPOV act of 1991 such varieties would likely be considered as essential derived varieties . The SNP markers developed in our study will be very useful for establishing whether varieties are essentially derived from other varieties using the protocol developed for lettuce .
The trend to exploit genes from tomato wild relatives for specific traits enlarges the variation in cultivated tomato and the differences among varieties . Such introgressions can easily be detected using the SNP array as we have shown for the Mi1.2 and TMV gene. When gene-specific (or closely linked) SNP markers are used, genotyping may substitute phenotypic assays even in variety registration as was demonstrated by Arens et al. in 2010 . The markers also allowed us to determine the level of heterozygous markers in present day varieties, which varied between zero and almost 45%. It is interesting to see that the highest numbers are found for some of the plum/cherry tomato. This is most likely because they are hybrids between round and cherry tomatoes and the 955 cherry specific SNPs will contribute to a large number of heterozygous markers (Figures 4 and 5). The high throughput SNP marker determination can be carried out at relatively low cost and is less laborious than other methods used. Therefore it is likely that SNP markers will be the markers of choice for variety identification and registration in future. However it may be anticipated that the SNP arrays will soon be replaced by complete sequencing of varieties.
Many of the polymorphisms located on chromosomes 4, 5 and 12 were between round/beef and cherry tomatoes. This suggests that regions on these chromosomes are essential to get the full cherry tomato phenotype and that there is selection for these regions in breeding programs for cherry tomatoes. The fact that whole chromosomes (4, 5 and 12) look to be involved is possibly due to suppression of recombination in the large pericentromeric regions [29, 30]. This is not the case on Chromosome 1 where the cherry region is a hotspot of recombination as shown in a RIL population of S. lycopersicum and S. pimpinellifolium under study (unpublished observations by the authors
Cherry type tomatoes have more SNPs in common with S. pimpinellifolium accessions than the round/beef varieties indicating that cherry tomatoes are closer to this wild relative than round and beef commercial lines. The varieties chosen for SNP selection might have been the reason that so many cherry specific markers were found. The SolCAP array also revealed different patterns of genetic variation particularly for chromosomes 2, 4, 5, 6 and 11. For chromosome 4 and 5 this is probably also due to the cherry round differences we observed. In general, relatively little is known about genomic regions distinguishing cultivated tomato gene pools .
Some regions are known to contain genes/QTLs that are related to differences between cherry and round. For instance a QTL for fruit weight and soluble solids content, is found on chromosome 2, QTLs for yield, brix, fruit weight, fruit shape, colour and epidermal reticulation have been mapped on chromosome 4 . Chromosome 5 is known to harbour QTLs for fruit colour and QTLs for viscosity traits related to total red yield and pH in chromosome 12 are known [33, 34].
Our SNP based phenetic trees were comparable to the ones made by Bretó et al. in 1993 using isozymes , Palmer & Zamir in 1982 and Spooner et al. in 1993 with chloroplast DNA , McClean & Hanson in 1986 with mitochondrial DNA , Miller & Tanksley  with genomic DNA, Marshall et al. in 2001  with internal transcribed spacer (ITS) region of nuclear ribosomal DNA sequences and, also Alvarez et al. in 2001 with microsatellite markers . Peralta et al.  performed the most extensive taxonomic study of tomato and its wild relatives and our results confirm their findings.
In our analysis we found S. pimpinellifolium as the closest wild relative to S. lycopersicum, which is similar to observations made by Grandillo et al.  and The Tomato Genome Consortium in 2012. The cherry tomato is considered either as a domesticated group or as an admixture of S. pimpinellifolium and S. lycopersicum. S. cheesmaniae and S. galapagense are also very closely related to the domesticated tomato. Introgressions in the cultivated germplasm can affect the similarity weight in the relationships between S. pimpinellifolium, S. galapagense and S. cheesmaniae on one hand and S. lycopersicum hybrids on the other hand. For phylogenetic studies it is important to define the initial germplasm and its characteristics. In the case of S. habrochaites and S. pennellii the increased number of NCs decreased the resolution.
For our custom made array, the SNP selection was based on commercial breeding lines. Sim et al. [15, 31] developed a large SNP genotyping array using commercial varieties. To evaluate if the same SNPs were present, the precise SNP positions from both arrays were compared (allowing a window of ± 3 base pairs). Only 98 SNPs, less than 2% of our SNPs were found in the exact same position or within the allowed window. This means that there is still a large number of SNPs to be discovered in tomato. For further comparisons among the two arrays we made the SNPs including the flanking sequences available at: http://www.plantbreeding.wur.nl/Publications/SNP/4072SNP-Sequences.xlsx.
Our results show that an SNP search in only a few breeding lines permitted the development of markers generally applicable in tomato and its wild relatives and furthermore that the Illumina bead array generated highly reproducible data. Our SNPs can be roughly divided in two categories: SNPs of which both forms are present in the wild relatives and in domesticated tomatoes and SNPs unique for the domesticated tomato. The SNPs can be used for genotyping, identification of varieties, comparison of genetic and physical linkage maps and to confirm phylogenetic relations. There is hardly any overlap with the SolCAP array and we suggest to combine both SNP sets and to select a core collection of robust SNPs completely covering the tomato genome for the development of future arrays.
Tomato germplasm was obtained from the collection of Wageningen UR Plant Breeding, The Netherlands: the Tomato Genetics Resource Center (TGRC) at University of California, Davis; the Centre for Genetic Resources (CGN), The Netherlands; and from the breeding companies Monsanto, RijkZwaan, Takii, Vilmorin & Cie (VCo), ENZA and Syngenta. The evaluated material included hybrid varieties of the project within the Centre of Biosystems Genomics (CBSG: http://www.cbsg.nl). Based on QTL model predictions, four breeding lines were chosen to obtain a large diversity in taste related characteristics . A half diallel was made with the four breeding lines resulting in six segregating populations. The parents were C74 (cherry, orange), C85 (cherry, red), R75 (round, yellow), and R104 (round, red). Further material included landraces, hybrids, commercial varieties, accessions of tomato wild relatives and mapping populations (Additional file 2: Table S1). The genotyping results of the varieties with the used SNPs can be found in Additional file 3.
Genomic DNA from young leaflets was extracted following a CTAB based protocol [44, 45] adjusted for high throughput isolation. Two young leaflets were ground with a Retsch 300 mm shaker (Retsch BV, Ochten, The Netherlands) using 1 ml micronic tubes (Micronic BV, Lelystad, The Netherlands). The DNA pellets were washed in 76% EtOH with 10mM NH4Ac before re-suspending the DNA in TE buffer.
Total RNA was isolated using TRIzol reagent  according to the manufacturer’s instructions (Roche, Switzerland) and finally treated with DNaseI (Invitrogen).
Total RNA was isolated from the four chosen breeding lines (C74, C85, R75 and R104), and at Vertis Biotechnologie AG (Freising, Germany: http://www.vertis-biotech.com/) cDNA was made. The 454 Sequencing gave 1.3 ×106 reads of a median length of 400 base pairs. The reads were aligned to the tomato genome (v2.10) and SNPs were called using QualitySNPng after being adapted for large numbers of reads . After Tomato v2.30 was available the SNP positions were renamed based on this version.
A potential risk with the four breeding lines was that primarily interspecific SNPs would be found due to introgressed regions originating from tomato wild relatives (Additional file 2: Table S3). To include additional intraspecific (S. lycopersicum) variation four introgression free varieties were also included in the Illumina/Solexa sequencing. To reduce the complexity, genomic DNA (gDNA) of the eight different samples (C74, C85, R75, R104 and the introgression free varieties, Ailsa Craig/round, Rutgers/beef and Gardeners Delight/cherry plus the reference line Heinz/round) was digested with restriction enzyme MboI (four cutter) and the 400–600 bp fraction was cut out of a 1.5% agarose gel and purified. Theoretically, this should result in a coverage of at least 23x per fragment. After Illumina sequencing 15 × 106 fragments were blasted against the Heinz v2.10 contigs and compared. The Illumina reads of 72 basepairs were aligned with the software tool Bowtie (>95% similarity) . After alignment SNPs were called with VarScan (variant detection in massively parallel sequencing data) . All SNPs with a minimal coverage of three in a genotype were listed in Excell. A SNP was called when it was present in at least six reads in one genotype and six reads in another genotype.
Putative SNPs and their flanking regions were blasted against the then available contig sequences of tomato (Tomato WGS contigs v2.10) in order to choose SNPs as dispersed over the genome, when possible at least one SNP per contig. Later the availability of the tomato genome sequence (Tomato WGS chromosomes v2.3) allowed us to assign the SNPs to their physical location. A total of 6000 SNPs with two times 50 bp flanking sequences of Heinz were used for designing the oligo’s for the Illumina beadarray . After the oligo’s were synthesized, ~8% of them did not comply to the quality standards set by Illumina and were discarded leaving 5528 SNP markers per array.
Solanum sp. DNA samples with a concentration of 50 ng/μl were sent to ServiceXS, Leiden, The Netherlands, where 4μl was processed according to the Infinium HD Ultra Assay protocol  and used for hybridization onto the BeadChip .
All the SNPs were named after their position on the SL2.30 version of the tomato genome sequence published online by the International Tomato Genome Sequencing Project (http://solgenomics.net/). This version contains approximately 85% of the tomato genome sequence. The lacking sequences are mostly highly repetitive or heterochromatic regions .
The Genotyping Module 1.9.4 of the Illumina’s software GenomeStudio® V2011.1 software package was used to analyse the genotyping results under default settings. The software assigned allele calls (‘GeneCall’) according to the intensity signals obtained, resulting in a [AA], [BB], [AB] or a non-call for each SNP. Advanced assembling within each correspondent analysis was performed and manual inspection and adjustment were performed in order to optimize call rates in the case of questionable SNPs. In particular those cases, and based on the knowledge on segregation patterns within the material, clustering errors were identified and amended .
Before further analysis, markers that were more than 98% monomorphic, were removed, as well as markers with more than 25% heterozygosity in accessions or breeding lines. Finally, also markers with a large number of NCs were removed. For this two thresholds for the percentage NCs were used: more than 20% NCs among the commercial hybrids and/or more than 50% among wild relatives.
When specific populations were evaluated, synchronization of parental lines together with the corresponding offspring was performed. This means that, for each analysis alleles were sorted according to the parent lines and replaced by a specific allele designation (A or B) for each parent.
For cluster analysis the genotype calls were converted into numerical values: [AA]=1, [AB]=2, [BB]=3. Cluster analyses were done using the Jukes-Cantor similarity measure with 1000 bootstraps. Neighbour joining analysis using the Manhattan similarity measure with an out-group rooting and 1000 bootstraps was performed using the statistical package PAST version 2.12 . The BioNJ analysis was carried out using SplitsTree version 4.6 with 1000 bootstraps.
Data visualization heat maps were made in GeneMaths XT 2.12 (Applied Maths). Linkage maps were constructed using JoinMap® version 4.1 (Kyazma©) . The default calculation parameters were adjusted to cope with the large number of markers. In the similarity thresholds the option ‘show individual pairs with a similarity larger than’ was decreased from 0.95 to 0.7. Recombination frequency was used as a grouping parameter and the linkage parameters were set to take all LOD values from 0 to 100. The ‘Show strong linkages with a rec. freq. larger/smaller than’ were set to 0.5/0. The number of maximum linkages to show per locus was set to 0. As algorithm we used the ML (Maximum Likelihood) mapping option, and within the map building, the spatial sampling thresholds were set one to 0.1 the first and the rest to 0. The ‘Number of map optimization rounds per sample’ was fixed to 1. Thereafter, linkage groups were compared with chromosomal distribution in the physical maps using MapChart 2.2 .
This project was carried out within the research programme of the Centre of BioSystems Genomics (CBSG) which is part of the Netherlands Genomics Initiative / Netherlands Organization for Scientific Research.
We would like to acknowledge Fien Meijer-Dekens for collecting and maintaining all the tomato samples at WUR-Plant Breeding. The department of Bioinformatics of Wageningen UR for their support in the development of the SNP markers and in allocating them into the different versions of the Tomato Genome Sequence.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.