Genomic variation in tomato, from wild ancestors to contemporary breeding accessions
- José Blanca†1,
- Javier Montero-Pau†1,
- Christopher Sauvage2,
- Guillaume Bauchet2, 3,
- Eudald Illa4,
- María José Díez1,
- David Francis4,
- Mathilde Causse2,
- Esther van der Knaap†4Email author and
- Joaquín Cañizares†1Email author
© Blanca et al.; licensee BioMed Central. 2015
Received: 8 October 2014
Accepted: 6 March 2015
Published: 1 April 2015
Domestication modifies the genomic variation of species. Quantifying this variation provides insights into the domestication process, facilitates the management of resources used by breeders and germplasm centers, and enables the design of experiments to associate traits with genes. We described and analyzed the genetic diversity of 1,008 tomato accessions including Solanum lycopersicum var. lycopersicum (SLL), S. lycopersicum var. cerasiforme (SLC), and S. pimpinellifolium (SP) that were genotyped using 7,720 SNPs. Additionally, we explored the allelic frequency of six loci affecting fruit weight and shape to infer patterns of selection.
Our results revealed a pattern of variation that strongly supported a two-step domestication process, occasional hybridization in the wild, and differentiation through human selection. These interpretations were consistent with the observed allele frequencies for the six loci affecting fruit weight and shape. Fruit weight was strongly selected in SLC in the Andean region of Ecuador and Northern Peru prior to the domestication of tomato in Mesoamerica. Alleles affecting fruit shape were differentially selected among SLL genetic subgroups. Our results also clarified the biological status of SLC. True SLC was phylogenetically positioned between SP and SLL and its fruit morphology was diverse. SLC and “cherry tomato” are not synonymous terms. The morphologically-based term “cherry tomato” included some SLC, contemporary varieties, as well as many admixtures between SP and SLL. Contemporary SLL showed a moderate increase in nucleotide diversity, when compared with vintage groups.
This study presents a broad and detailed representation of the genomic variation in tomato. Tomato domestication seems to have followed a two step-process; a first domestication in South America and a second step in Mesoamerica. The distribution of fruit weight and shape alleles supports that domestication of SLC occurred in the Andean region. Our results also clarify the biological status of SLC as true phylogenetic group within tomato. We detect Ecuadorian and Peruvian accessions that may represent a pool of unexplored variation that could be of interest for crop improvement.
The domestication process of crop plants led to dramatic phenotypic changes in many traits that result from changes in the genetic makeup of the wild species ancestors [1,2]. The analyses of genomic variation and the structure of genetic diversity of cultivated crops and their wild relatives provides insights into the history of domestication, adaptation to local environments, and breeding [3,4]. The resulting analyses offer valuable information for germplasm management and the exploitation of natural variation to improve crops.
Cultivated tomato (Solanum lycopersicum L.) (SL) is a member of the family Solanaceae, genus Solanum L., section Lycopersicon . Its wild relatives are native to western South America, including the Galapagos Islands. S. pimpinellifolium L. (SP) is thought to be the closest wild ancestor to cultivated tomato [5-7]. SP accessions are found in Coastal Peru and Ecuador and are divided in three main genetic groups corresponding to the environmental differences found in the coastal regions of Northern Ecuador, in the montane region of Southern Ecuador and Northern Peru, and the coastal region of Peru [8,9].
S. lycopersicum is divided into two botanical varieties: S. l. var. cerasiforme (Dunal) Spooner, G.J. Anderson & R.K. Jansen (SLC) and S. l. var. lycopersicum (SLL). SLC is native to the Andean region encompassing Ecuador and Peru, but it is also found in the subtropical areas all over the world . SLC grows either as a true wild species, in home gardens, along roads, sympatrically with tomato landraces, or as a cultivated crop . SLC thrives in the humid environments of Ecuador and Peru at the eastern edge of the Amazon basin whereas SP occupies the drier Peruvian coasts and valleys and the wetter Ecuadorian coast [9,11,12]. Although there is no reproductive barrier between SP and SLC , the Andes mountains impose strong physical and ecological barriers for cross reproduction among these species.
Many details of tomato domestication remain debated, especially regarding the role of SLC in this process. The South American SLC native to the Ecuadorian and Peruvian Andes has been proposed to be an evolutionary intermediate between SP and cultivated SLL [6,9,14] or, alternatively, an admixture resulting from the extensive hybridization between SP and SLL [15,16]. The location of tomato domestication also remains uncertain. Both Mesoamerica  or Ecuador and Northern Peru, near the center of origin of SP , have been proposed as the center of domestication. If the former were true, SLC would have had to migrate north to Mesoamerica as a wild or weedy species, where it would have been domesticated into SLL. Instead, a two-step domestication process has been proposed for tomato . The first step would have consisted of a selection from SP or primitive SLC by early farmers resulting in the Ecuadorian and Northern Peruvian SLC. The second step likely occurred in Mesoamerica, and consisted of further selection from these pre-domesticated SLC after their migration from Ecuador and Peru. This second step completed the domestication process of tomato. Genetic data confirmed that European SLL accessions originated from Mesoamerica and constitute the genetic base of the SLL vintage varieties . It has also been proposed that a genetic bottleneck was associated with the migration of SLL from Mesoamerica to Europe [18-20]. Blanca et al.  proposed that the main bottleneck happened during the migration from Peru and Ecuador.
Extensive breeding efforts have modified tomato over the last 100 years. Breeding goals were focused on improving SLL for disease resistance, adaptation to diverse production areas, yield and uniformity. These efforts resulted in the introduction of many introgressions from SP and more distant tomato relatives , leading to a broadening of the genetic diversity of SLL [21-23]. Another consequence of these breeding programs was the selection for specific traits that are characteristic of the fresh and processing markets which has led to further diversification and genetic differentiation among market classes.
The traits that most likely have been selected during the domestication of tomato were fruit weight and, to a lesser extent, shape. In recent years, several genes affecting these traits have been identified [24-29]. As the underlying polymorphism causing the change in allele function for all these genes is known, the presence of the derived and ancestral alleles is easily sampled. For example, in vintage SLL the majority of the shape diversity is explained by the derived alleles of the FAS, SUN, OVATE and LC genes . What is not well understood is when and where these alleles arose and how they spread through the germplasm. Quantifying the allele frequency of the loci among the SP and SLC populations will help to elucidate the process of selection that is at the foundation of tomato domestication.
The aim of this study was to better delineate the evolutionary history of tomato including its domestication. By using a dataset with over 7,000 SNPs and 1,008 accessions of SP, SLC and SLL we aim to compare and contrast the genome-wide molecular diversity of populations spanning the entire red-fruited clade. Additionally, the allele frequency of six fruit weight and shape genes have been measured in order to elucidate the domestication process.
Plant material and passport data
We analyzed 1,008 tomato accessions from the species representing the red-fruited clade of tomato (Additional file 1: Table S1). Of these, 912 corresponded to accessions genotyped in studies conducted at COMAV, Spain , through the Solanaceae Coordinated Agricultural Project (SolCAP) in the USA  and INRA, France . These data sets were combined with an additional set of 96 accessions originating from vintage and processing germplasm genotyped in Ohio (62), and from the COMAV collection (34). Altogether, these 1,008 accessions represent 952 uniquely named accessions. Several accessions were independently genotyped in different experiments. For example, Moneymaker was represented several times and these duplicates were used for quality control of the genotyping results between the laboratories. The number of uniquely named accessions per species, according to their passport data, were: Solanum lycopersicum var. lycopersicum (SLL; 530 accessions), S. l. var. cerasiforme (SLC; 316 accessions), S. pimpinellifolium (SP; 145 accessions), Solanum galapagense S.C.Darwin & Peralta (SG; 4 accessions), Solanum neorickii D.M.Spooner, G.J.Anderson & R.K.Jansen (SN; 1 accession), Solanum chmielewski (C.M.Rick, Kesicki, Fobes & M.Holle) D.M.Spooner, G.J.Anderson & R.K.Jansen (SChm; 1 accession), crosses between S. lycopersicum and S. pimpinellifolium (SL x SP; 10 accessions), and one hybrid between S. l. lycopersicum and S. pennellii. The hybrids were included to determine the ability of detecting heterozygous SNPs with the genotyping platform.
A unified passport classification, which includes species name, collection site and use, was compiled for all accessions based on the information retrieved from the different sources and donors (Additional file 1: Table S1). For SP and SLC, the passport classification mainly reflected the collection site. An additional category for SLC was introduced as “SLC commercial cherry” to group the SLC accessions with a commercial purpose. For SLL, the vintage, landrace and heirloom categories were grouped together and classified collectively as vintage consistent with the nomenclature of Williams and St. Clair . Additionally, a category was created in SLL to include the early breeding lines such as Moneymaker and Ailsa Craig. The SLL accessions derived from crop improvement programs currently active (i.e. contemporary to the time of writing) were categorized based on use (fresh market or processing) and location of breeding. Overall, sufficient information was available for 84% of the accessions to classify them beyond the species level. In cases where this was not possible, the passport classification only reflected the species (i.e., SP, SLC or SLL). For 48.3% of the accessions, geographic location information was available in the form of Global Positioning System (GPS) coordinates or from the location of its collection site (Additional file 1: Table S1).
Genotyping and data set merging
All samples were genotyped using the Tomato Infinium Array (Illumina Inc., San Diego, CA, USA) developed by the United States Department of Agriculture (USDA) funded SolCAP project (http://solcap.msu.edu/). The SolCAP SNP discovery work-flow was described , as were details of the array . The genotyping array contained probes for 8,784 biallelic SNPs. These SNPs represented a highly filtered and selected set, based on transcriptome sequence for SLL, SLC, and SP, optimized for polymorphism detection and distributed throughout the genome. Of these, 7,720 SNPs (88%) passed manufacturing quality control . All SNPs on the array have been incorporated into the Solanaceae Genome Network database (http://solgenomics.net/), the SNP annotation file is available (http://solcap.msu.edu/tomato_genotype_data.shtml), and sequences are available through the Sequence Read Archive (SRA) at the National Center for Biotechnology Information (study summary SRP007969; accession numbers SRX111556, SRX111557, SRX111558, SRX111845, SRX111848, SRX111849, SRX111850, SRX111853, SRX111857, SRX111858, SRX111859, SRX111862, SRX111861).
Genomic DNA was isolated from fresh young leaf tissue. DNA concentrations were quantified using the PicoGreen assay (Life Technologies Corp., Grand Island, NY, USA) and diluted to 50 ng/μl in TE buffer (10 mM Tris–HCl pH 8.0, 1 mM EDTA). Genotyping was performed using 250 ng of DNA per accession following the manufacturer’s recommendations. The intensity data were analyzed in GenomeStudio version 1.7.4 (Illumina Inc., San Diego, CA, USA). The automated cluster algorithm generated from the SolCAP project was used to obtain initial SNP calls. Visual inspection was used to assess the default clustering of each SNP, and calls were modified when the default clustering of a SNP was not clearly defined.
There are three methods for SNP calling for the Illumina Infinium array: relative to the reference (also known as customer), the design (also known as Illumina) or the TOP strand (a designation based on the polymorphism itself and its flanking sequence). To merge data sets from three different laboratories that had used different SNP calling methods, we developed a Python script to facilitate detection, reorientation and merging of the data such that all SNPs are called relative to the design strand (the script is available upon request to J. Blanca).
Selection of SNPs for downstream analyses
The accessions were genotyped with 7,720 SNPs (Additional file 2: Table S2) that passed the manufacturing quality control and constituted the raw data set. Of those, we removed 240 markers (3.1%) that had more than 10% missing data and 1137 (14.7%), which had a major allele frequency above 0.95. For all analyses, except for the rarefaction and the linkage disequilibrium (LD), SNPs that mapped closer than 0.1 cM were removed as well, yielding a final dataset of 2,313 markers uniformly distributed across the genome. This filtering was done in order to avoid an overestimation of polymorphism and genetic distances among populations due to genomic introgressions from wild relatives. For this purpose a minimum genetic distance of 0.1 cM was chosen as a trade-off between the number of markers left for the analysis and the LD minimization. Genetic distances were based on the genetic maps of Sim et al. .
Genetic classification and sample filtering
Principal Component Analyses (PCA) were used to explore the patterns of genomic variation in the entire collection without considering the a priori classification based on passport data (i.e., species, location and use). A three level classification scheme, based on a series of hierarchical PCAs, was used to define genetic groups within species and genetic subgroups within genetic groups. PCAs were performed with the smartPCA application included in the Eigensoft 3.0 package [34,35]. This genetic classification was used in the subsequent analyses unless mentioned otherwise.
Pairwise genetic distances were computed among accessions within each group at each level of the hierarchical classification. Kosman and Leonard’s distance method  was used and a violin plot was produced for each hierarchy level using the R package ‘vioplot’ .
When an accession was genotyped more than once and both genotypes were inconsistent (e.g., both samples were classified in different subgroups in the PCA) all data for the accession was removed from the analysis (see Additional file 1: Table S1), unless it was clear based on the passport information, which genotype was correct (e.g., two entries from the same SLC accession collected in Peru, one grouping with other Peruvian accessions and another grouping with the mixture group). In total 8 genotypes out of the 1,008 were removed due to inconsistent data. We assume that these rare inconsistencies were related to uncontrolled cross pollinations or seed mixing during regeneration.
Genetic distances among samples of the same uniquely named accession were evaluated (see above) to check the reproducibility between genotyping datasets coming from different laboratories. For the genetic analyses, unless stated to the contrary, only one randomly chosen genotype representative of the uniquely named accessions was used.
Diversity and genetic differentiation
For polymorphic loci with a major allele frequency lower than 0.95 (P95), the expected (He) and observed (Ho) heterozygosity were calculated using custom scripts for each hierarchy of the genetic classification. Differentiation among genetic subgroups was explored by calculating differentiation index Dest  using custom scripts and Fst using Arlequin v. 188.8.131.52 . Only groups with at least 5 individuals were considered for genetic diversity estimates and mixture groups (SP mixture, SLC mixture and mixture) were not included in these analyses. Statistical significance of Dest and Fst was assessed after 1,000 permutations.
An unrooted network was built based on the genetic differentiation matrix using the Neighbor-net algorithm implemented in SplitsTree v.4.13.1 . Additionally, a neighbor-joining tree was created using the same distance matrix. Bootstrap values were obtained from 1,000 trees. The tree was built using functions included in PyCogent v. 1.5.3 library .
Allelic richness and private allelic richness (private alleles are defined as alleles found exclusively in a single population) were estimated using the rarefaction method implemented in the software ADZE . LD was calculated using TASSEL v.4.0 . Pairwise r 2 was obtained for all markers within each chromosome and data was fitted using a local polynomial regression fitting (LOESS)  implemented in R v. 3.0.1 . Rarefaction and LD analyses were performed using genetic groups defined by PCA and network analysis. These groups are defined as follows: SP, SLC Ecuador and Northern Peru, SLC non Andean, SLL vintage and SLL contemporary (split for some analyses into SLL processing and SLL fresh).
Isolation by distance
Correlations between genetic, geographic and climatic distances were analyzed to infer patterns of isolation by distance or the effect of ecological conditions on the genetic structure. Pairwise genetic distances between accessions were computed using Kosman and Leonard’s distance method . Pairwise geographic distances were calculated when GPS information was available using the haversine formula . Climatic data for accessions with GPS coordinates was obtained using the R package ‘raster’ . Current climatic data interpolated from 1950 to 2000 was obtained from worldclim (http://www.worldclim.org) at 30 arc-seconds resolution (approx. 1 km). A PCA was carried out with all the climatic information and the resulting scores were used to obtain the pairwise climatic distances based on a Euclidean metric. Significance of the correlations between distance matrices was assessed with a Mantel test based on 1,000 permutations implemented by the PyCogent Python library . A density plot for each distance comparison was created using the kde2d function in the R ‘MASS’ package .
A phylogenetic tree was built with SNAPP  to infer the evolutionary history of the tomato species in the Andean region encompassing Ecuador and Peru. SNAPP, which is part of the BEAST package , is a recently developed method that allows reconstructing the species tree from unlinked SNPs by using a finite-sites model likelihood algorithm within a Bayesian Markov chain Monte Carlo (MCMC). A MCMC chain was run for 2,000,000 steps with a sampling interval of 1,000 and a burn-in of 25%. Convergence of posterior and likelihood distributions, and number of estimated sample size for model parameters were assessed using Tracer v.1.5 . Due to the high computational demands of SNAPP, only one accession per genetic subgroup was used. For the same reason, not all genetic subgroups were considered; only SP and Peruvian, Ecuadorian and Mesoamerican SLC accessions were included. Three outgroup species were also included, namely S. galapagense, S. neorickii and S. chmielewski.
Fruit weight and shape genes genotyping
Fruit shape and size marker information
Primer sequence (5′ to 3′)
Wild-type allele size (bp)
Cultivated allele size (bp)
Chakrabarti et al. 
*Muños et al. 
Rodríguez et al. 
*Rodriguez et al. 
An additional 4.3-kb fragment
Xiao et al. 
All markers, except sun, were genotyped by amplification using standard PCR following previously published methods . PCR products were scored directly (fas) or after restriction enzyme-digestion (lc, ovate, fw2.2, fw3.2) by electrophoresis on 3% TBE (110 mM Tris, 90 mM boric acid, 2.5 mM EDTA) agarose gels. The sun duplication was scored as an RFLP using standard Southern blotting and hybridization protocols .
Genetic structure of the tomato accessions
Summary of genetic-based classification: observed and expected heterozygosity (H o , H e ), percentage of markers with a major frequency allele lower that 0.95 (P95) and number of individuals (N) for the species, groups and subgroups of the genetic-based classification (subgroups with less than 5 accessions are not listed)
SP Ecuador 1
SP Ecuador 2
SP Montane 1
SP Montane 2
SP Peru 1
SP Peru 2
SP Peru 3
SP Peru 4
SP Peru 7
SP Peru 8
SP Peru 9
SLC Ecuador 1
SLC Ecuador 2
SLC Ecuador 3
SLC Peru 1
SLC Peru 2
SLC Peru 3
SLC SP Peru
SLC SP Peru
SLC Costa Rica
SLL vintage 1
SLL early breed
SLL vintage 2
SLL fresh 1
SLL fresh 2
SLL processing 1
SLL processing 1 1
SLL processing 1 2
SLL processing 2
SLL processing 2
For SLC, the first two PCs explained 16.0% of the total variance and showed a clustering based on geography (Figure 2B; Additional file 3: Figure S1). The Ecuadorian and Peruvian SLC formed two non-overlapping clusters in the PCA representation and showed a higher genetic diversity compared to SP Ecuador and SP Montane (SLC Ecuador He = 0.19 and SLC Peru He = 0.18, Table 2). An SLC group which included accessions from all over the subtropical regions of the world was called SLC non-Andean, and was located between the two Andean clusters (Figure 2B). A distinct cluster named SLC-SP Peru was identified and composed of accessions from Southern Peru.
Similarly to SP, SLC accessions without a clear genetic clustering and without a common geographic origin were classified as SLC mixture. These mixture accessions were distributed between the Peruvian and Ecuadorian SLC clusters in the PCA (Figure 2B). In addition, closely related to the SLC non-Andean were seven accessions with no obvious relationship according to the passport data and were referred to as SLC 1.
The PCA for the SLL accessions showed that the first two PCs (13.6% of the total variance) separated five main genetic groups: vintage, fresh, processing 1, processing 2 and SLL 1 (Figure 2C). All SLL groups had low diversity (He = 0.06-0.10) compared with the Peruvian and Ecuadorian SLC (He = 0.188-0.177) (Table 2). The SLL vintage group was divided into subgroups that were differentiated using additional PCAs: Mesoamerica, vintage 1, vintage 2 and early breeding lines (Additional file 5: Figure S3A and B). The SLL fresh group was comprised of the subgroups fresh 1, fresh 2 and vintage/fresh (Figure 2C, Additional file 5: Figure S3C and D). The latter subgroup was named vintage/fresh because it included accessions classified as vintage as well as contemporary breeding fresh market accessions. The SLL fresh 1 was composed of Florida and North Carolina accessions while SLL fresh 2 consisted of accessions from New York (Additional file 1: Table S1). The SLL processing 1 group was subdivided into three groups, 1–1, 1–2 and 1–3. The latter group was comprised of a subset of accessions from the Ohio breeding germplasm whereas the remainder of the Ohio germplasm was found in the SLL processing 1–2 subgroup. The processing 1–1 included accessions from Oregon. The group SLL processing 2 was clearly separated from the other processing groups. This group was entirely composed of New York breeding materials which represent a predominately California genetic background with Phytophthora resistance introgressed from North Carolina fresh-market accessions. Finally, the SLL1 group was located between SLL processing 1 and SLL fresh in the PCA (Figure 2C) and was comprised by a mixture of accessions such as the plum tomatoes Rio Grande and NC EBR-6.
To determine the consistency of the structure obtained by PCA analyses, we compared the distribution of genetic distances within the following hierarchy levels: species, genetic group, genetic subgroup and samples of the same uniquely named accession (Additional file 7: Figure S5). As expected, the species showed the highest distances whereas the groups and subgroups showed progressively lower genetic distance values. All pairwise genetic differentiation among subgroups assessed by Fst and Dst were significant (p-value < 0.05) (data not shown). The distance among repeated samples of the uniquely named accessions was very low indicating a high consistency among genotyping experiments.
Comparison of the genetic and passport classifications
One of the other notable exceptions to the correspondence between genetic and passport classification was the subgroup comprised of accessions that were listed as SLL vintage, but instead were genetically classified as a SLC group closely related to SLC Ecuador. This cluster was classified as SLC vintage and consisted of genetically diverse germplasm that included accessions collected mostly at South American markets.
Isolation by distance and climate
Isolation by distance and climatic distance: correlation between climatic, geographic and genetic distances in SP, SLC Ecuador and Northern Peru and SLC non-Andean: number of accession ( n ), correlation coefficient ( r ) and p-value for Mantel test is shown
Climatic vs. genetic
Geographic vs. genetic
SLC Ecuador/N Peru
Diversity and heterozygosity
Expected heterozygosity (He) and observed heterozygosity (Ho) decreased in the succession from SP to SLC and SLL (Table 2 and Additional file 10: Figure S8). For SP, the SP Peru group retained the highest diversity followed by SP Montane and SP Ecuador. The Ecuadorian and Peruvian SLC (SLC Ecuador and SLC Peru) showed higher level of diversity (He = 0.19 and 0.18) compared to SP Ecuador and SP Montane. In contrast with the high diversity of the Ecuadorian and Northern Peruvian SLC, the other SLC subgroups exhibited low diversity, similar to that found in vintage SLL. For SLL a similarly low level of observed heterozygosity was typical for most subgroups. However, when combining the contemporary SLL subgroups (processing and fresh), slightly higher levels of diversity were found when compared to SLL vintage (He = 0.12 vs. 0.09), a situation that is likely due to the effect of introgression during breeding and differentiation into distinct market classes (Additional file 10: Figure S8).
LD was estimated between markers at different genetic distances from one another (Additional file 12: Figure S10). From highest to lowest degree of disequilibrium the groups were: fresh, processing, vintage, Andean SLC, non-Andean SLC and SP. These results suggest that LD affects estimates of allelic richness, especially when dealing with groups with different degrees of LD.
Origin and migration of the derived tomato fruit shape and weight alleles
Key questions regarding the evolutionary history of cultivated tomato include where and when the crop was domesticated and the position of SLC in this evolutionary process. In this study, we interrogated a selection of 2,313 SNPs from the SolCAP array in nearly 1,000 unique accessions comprised of SLL, SLC and the red-fruited wild relative SP. By combining accessions with robust passport data we were able to test hypotheses about the origin of cultivated tomato. Our results support the two-step domestication hypothesis proposed by Blanca et al. , and are in line with recently published work about the origin of tomato . As expected, genetic diversity was high in SP (Table 2), and genetic clusters were explained by geographic distances and climatic zones (Table 3 and Additional file 9: Figure S7). The higher number of SP accessions analyzed when compared with previous studies has allowed a more detailed definition of the SP populations, especially in Southern Peru, where sequential colonization could be proposed based on the PCA (Additional file 3: Figure S1) and the network analysis (Figure 5). SLC accessions from Ecuador and Peru also showed genetic structure that correlated with geography (Table 3 and Additional file 9: Figure S7). Genetic diversity was high in Ecuadorian and Peruvian SLC, but was reduced in SLC from Mesoamerica and elsewhere. Our results suggest that the major genetic bottleneck did not occur due to transport of SLL from Mesoamerica to Europe, but occurred earlier coinciding with the migration of SLC from Ecuador and Northern Peru to Mesoamerica (Table 2 and Additional file 9: Figure S7). In wild populations there is a strong correlation between geography, climate, and genetic distances (Table 3 and Additional file 9: Figure S7). These correlations do not occur in the non-Andean SLC and the SLL genetic subgroups, a situation that is common for plants associated with human activities, either cultivated or weedy, due to the movements of the seeds by the humans and to the artificial modifications or their environments [55,56].
The phylogenetic tree (Figure 6), neighbor network analyses (Figure 5) and the high number of private alleles (Figure 7) support the status of SP as the basal group of the red-fruited species of the Lycopersicon section. Our data also supports the view that the Northern Ecuadorian SP is the closest ancestor of SLC. Northern Ecuadorian SLC was likely to have originated from Ecuadorian SP, yet its high genetic diversity and its reticulate structure in the phylogenetic network suggests a complex history. The position of SG within SP contrasts with a recent study  in which SG was found to be closer to SLC than SP. However, firm conclusions about the position of Galapagos accessions will require further study, as both studies, Koenig’s et al. and the present one, are based on a few number of accessions and Koenig et al. lacked SP accessions from Ecuador.
The data suggest two possible scenarios for the origin of SLC. Ecuadorian SLC features twice the level of the genetic diversity as Ecuadorian SP (Table 1), thus it is not likely that SLC was simply derived from this SP subgroup, despite being very close phylogenetically. One hypothesis is that the subgroup named Peruvian SLC-SP represents the origin of SLC. This genetic subgroup is also close genetically to Ecuadorian SP (Figure 6). However the large geographic distance separating these subgroups challenges this scenario. It is possible that the Peruvian SLC-SP is the result of a secondary contact between SLC and SP. The second hypothesis is that ancestral populations of Northern Ecuadorian Coastal SP gave rise to SLC in Northern Ecuador across the Andes. Secondary gene flow between other SP populations, suggested by the reticulation of the phylogenetic network and the complex PCA structure, e.g. Montane SP from Southern Ecuador and Northern Peru may have enhanced diversity of the SLC. Alternatively, the sampling of Northern Ecuador SP may have been incomplete or an ancestral highly diverse population might have originated both Northern Ecuadorian SP and SLC. Ecologically, it is more plausible that SLC originated from Northern Ecuadorian SP. These Northern Ecuadorian SP accessions thrive in wet and forested areas of Coastal Ecuador, a climate closer to the wet environment on the Eastern side of the Andes where Ecuadorian SLC is found. In contrast, Peruvian SP is adapted to an arid climate. Climatic similarity of some SP and SLC populations may have facilitated gene flow due to animal or human movement, despite geographic differences. Previously, possible mechanisms for gene flow between SP and SLC have been proposed for this region [9,58].
The Mesoamerican SLL vintage subgroup appeared to be the most ancestral SLL according to the phylogenetic trees and the network. This SLL genetic subgroup was closely related to SLC Peru 2 in the phylogenetic network and tree. Thus, our data clearly support that SLC evolved into Mesoamerican SLL. According to the analyses with this dataset, all other SLL are monophyletic and all SLL groups originated from the SLL Mesoamerican accessions.
Proposed origin and domestication based on derived alleles for fruit weight and shape
The most ancestral SLC is found in Ecuador and Northern Peru and it is characterized by a high genetic diversity and morphological variability . It spans a wide range of domestication (from accessions collected in markets, and presumably cultivated at production scale, to weeds) and use (from human consumption to animal feed), which suggests a certain degree of selection for SLC. This finding is supported by the fact that the derived alleles of lc, fw3.2 and fw2.2 are already prevalent in the ancestral SLC accessions from Northern Peru and across Ecuador (Figure 8). The derived allele of lc and fw2.2 may have originated in the Ecuadorian SP and could represent the earliest known mutations to arise. However, this interpretation needs to be viewed with caution as only two SP accessions carrying a single derived allele of each locus were identified (Additional file 1: Table S1). Interestingly, the SLC vintage group that clusters closely with Ecuadorian SLC included accessions that were collected from markets and feature fruits that are large, ribbed and multi-loculed (Figure 3). The strongest selection may have taken place in this subgroup as all accessions carried the derived alleles for lc, fw2.2 and fw3.2, and half of them carried the derived allele of fas (Figure 3 and Additional file 1: Table S1). None of the other SLC subgroups were fixed for as many fruit weight and shape alleles as the SLC vintage category. Thus it appears that SLC was being cultivated and that selections for larger fruit were taking place (Figure 3 and Additional file 1: Table S1). SLC Mesoamerica carried derived and ancestral alleles for most of the fruit shape and weight loci, while SLC Asia and SLC Other were completely fixed for the derived allele of fw2.2 suggestive of selection for the SLC germplasm grown outside the Americas.
SLL arose in Mesoamerica as there is no evidence of the existence of ancestral SLL in South America. All SLL accessions sampled from South America were found to carry introgressions from wild relatives suggesting that they were derived from breeding efforts taking place in the last 100 years. Therefore, to complete the domestication of SLL, SLC would have had to migrate to Mesoamerica possibly as a semi-domesticated type. According to the network analysis, PCA results, and previous knowledge of species history two SLC migrations could be suggested. SLC could have migrated from Southern Ecuador to Colombia and Costa Rica arriving in Mesoamerica in a stepwise process (Figure 2D). However, a second possibility is also suggested by our results, SLC could have reached Mesoamerica from Northern Peru in one step. Fruit weight and shape allele distribution did not support one route of migration over the other. In any case, results from the gene diversity analysis suggest that the migration from the Ecuador or Northern Peruvian region to Mesoamerica led to a strong bottleneck which eventually resulted in reduced variation in Mesoamerican SLL, as described by Blanca et al. . The second phase of tomato domestication in Mesoamerica is suggested by the increase in derived allele frequency for fw3.2. Allele frequencies for fruit weight loci suggest that selection for fw2.2 and lc were important for the origin of SLC while fw3.2 was important for the origin of SLL. Our results agree with a recent study  based on 360 tomato genomes. They also find evidence for a two-step domestication, and identify new QTLs implicated in both steps of domestication and breeding.
The American origin of the first European tomato is confirmed by the genetic relationship between the Mesoamerican and vintage SLL subgroups (Figure 6). It is remarkable that the vintage SLL appears to have been derived exclusively from Mesoamerican germplasm. Although large fruited vintage SLC were found in South America, they did not appear to contribute to the germplasm that migrated to Europe and the rest of the world. It is not possible with the current data to know why the Ecuadorian and Peruvian SLC did not contribute to the Spanish vintage gene pool brought to Europe, despite being those regions also under the control of the spaniards, but we could propose that climatic similarity between Mexico and Spain could have played a role.
Contemporary tomato diversity
Since the introduction of the modern breeding in the 20th century, the pace of genetic change in SLL has accelerated. New germplasm has been created that, according to the PCA, network and population tree, differ substantially from the vintage accessions. These results are consistent with previous findings [19-22,31]. The contemporary tomatoes can be differentiated into four broad groups: fresh, processing 1, processing 2 and SLL1. This broad differentiation among the contemporary groups reflects independent breeding efforts and selection histories between the fresh and processing accessions. The further subdivision of the contemporary groups can be explained by geographic origin or founder effects in regional breeding programs. Similar results were previously reported by Sim et al. [21,31]. These subgroups differentiate accessions coming from the main public-sector breeding programs in North America. For processing they were historically carried out in California, the Midwest of the United States, the East Coast of the United States and Ontario, Canada. These programs commonly interchanged breeding materials, thus it is to be expected that the genetic groups mix those origins, albeit in different proportions . The neighbor network reticulation found in these subgroups is compatible with this history (Figure 5).
Contemporary tomatoes are the result of introgressing genes from wild species into SLL starting before 1920 . The PCA and rarefaction analyses (Figure 7) provided insight into the effect of these breeding practices on diversification and structure of the cultivated species, showing the existence the large introgressions as has also been described recently . When we consider all contemporary accessions as a single gene pool (e. g. both fresh and processing markets), they represent a slightly more diverse population than the vintage tomatoes (He 0.12 vs 0.09, respectively) (Additional file 11: Figure S9). However, the differential overestimation of the gene diversity estimates within SP, SLC and SLL depending on the number of markers used, should be taken into account in future studies, especially now that thousands of markers are routinely being used in genotyping by sequencing (GBS) and whole genome sequencing experiments. Contemporary breeding seems to have moderately increased the variation and diversity in cultivated tomato, although it remains low when compared with the most ancestral SLC subgroups. These Ecuadorian and Peruvian accessions may represent a pool of unexplored variation for future improvement.
The distribution of the fruit weight and shape alleles is skewed in processing and in fresh market classes of SLL, suggesting high selection pressures of shape loci for each market class. In contrast, the distribution of these alleles is more varied in the vintage group which is why this class is characteristically variable in shape and weight  while contemporary material is quite uniform . The observation of more phenotypic diversity for shape and weight, and more allelic diversity for these six loci in the vintage class appears to contrast with the observation of more genetic diversity in the contemporary germplasm. Contemporary market classes are bred for uniformity of shape and weight within a class, as reflected by the allele distribution for lc, fas, fw2.2 and fw3.2. At the same time there are numerous resistances that have been bred into germplasm, with any given accession having multiple introgressed alleles that are missing from the vintage class, hence increased genetic diversity instead of phenotypic diversity in the contemporary class.
This work represents an effort to show a comprehensive view of genomic variation in tomato and closely related species. We have analyzed and classified 1,008 tomato accessions, including the complete set of its closest wild relative, S. pimpinellifolium. The data are an excellent resource for evolutionary biologists and plant breeders. Our analysis support a two-step domestication as proposed by Blanca et al. ; a first domestication in South America and a second step in Mesoamerica,. The distribution of fruit weight and shape alleles also supports these two steps and shows that domestication of SLC occurred in the Andean region of Ecuador and Northern Peru. The definition and clarification of the biological status of SLC is also an important result of this work.
Availability of supporting data
All supporting data are included as supplementary files.
We are grateful to the gene banks for their collections that made this study possible. We thank Syngenta Seeds for providing genotyping data for 42 accessions. We would like to thank the Supercomputing and Bioinnovation Center (Universidad de Málaga, Spain) for providing computational resources to process the SNAPP phylogenetic tree. This research was supported in part by the USDA/NIFA funded SolCAP project under contract number to DF and USDA AFRI 2013-67013-21229 to EvdK and DF.
- Tanksley SD, McCouch SR. Seed banks and molecular maps: unlocking genetic potential from the wild. Science (80-). 1997;277:1063–6.View ArticleGoogle Scholar
- Doebley JF, Gaut BS, Smith BD. The molecular genetics of crop domestication. Cell. 2006;127:1309–21.View ArticlePubMedGoogle Scholar
- Gepts P. A comparison between crop domestication, classical plant breeding, and genetic engineering. Crop Sci. 2002;42:1780.View ArticleGoogle Scholar
- Weigel D, Nordborg M. Natural variation in Arabidopsis. How do we find the causal genes? Plant Physiol. 2005;138:567–8.View ArticlePubMed CentralPubMedGoogle Scholar
- Peralta IE, Spooner DM, Knapp S, Anderson C. Taxonomy of wild tomatoes and their relatives (Solanum sect. Lycopersicoides, sect. Juglandifolia, sect. Lycopersicon; Solanaceae). Syst Bot Monogr. 2008;84:1–186.Google Scholar
- Rick CM, Fobes JF. Allozyme variation in the cultivated tomato and closely related species. Bull Torrey Bot Club. 1975;102:376–84.View ArticleGoogle Scholar
- Zuriaga E, Blanca J, Nuez F. Classification and phylogenetic relationships in Solanum section Lycopersicon based on AFLP and two nuclear gene sequences. Genet Resour Crop Evol. 2008;56:663–78.View ArticleGoogle Scholar
- Zuriaga E, Blanca J, Cordero L, Sifres A, Blas-Cerdán WG, Morales R, et al. Genetic and bioclimatic variation in Solanum pimpinellifolium. Genet Resour Crop Evol. 2008;56:39–51.View ArticleGoogle Scholar
- Blanca J, Cañizares J, Cordero L, Pascual L, Diez MJ, Nuez F. Variation revealed by SNP genotyping and morphology provides insight into the origin of the tomato. PLoS One. 2012;7:e48198.View ArticlePubMed CentralPubMedGoogle Scholar
- Rick CM. Natural variability in wild species of Lycopersicon and its bearing on tomato breeding. Genet Agrar. 1976;30:249–59.Google Scholar
- Rick CM, Holle M. Andean Lycopersicon esculentum var. cerasiforme: genetic variation and its evolutionary significance. Econ Bot. 1990;44:69–78.View ArticleGoogle Scholar
- Nakazato T, Franklin RA, Kirk BC, Housworth EA. Population structure, demographic history, and evolutionary patterns of a green-fruited tomato, Solanum peruvianum (Solanaceae), revealed by spatial genetics analyses. Am J Bot. 2012;99:1207–16.View ArticlePubMedGoogle Scholar
- Rick CM, Butler L. Cytogenetics of the Tomato. Adv Genet. 1956;8:267–382. Advances in Genetics.View ArticleGoogle Scholar
- Jenkins JA. The origin of the cultivated tomato. Econ Bot. 1948;2:379–92.View ArticleGoogle Scholar
- Nesbitt TC, Tanksley SD. Comparative sequencing in the genus lycopersicon: implications for the evolution of fruit size in the domestication of cultivated tomatoes. Genetics. 2002;162:365–79.PubMed CentralPubMedGoogle Scholar
- Ranc N, Muños S, Santoni S, Causse M. A clarified position for Solanum lycopersicum var cerasiforme in the evolutionary history of tomatoes (solanaceae). BMC Plant Biol. 2008;8:130.View ArticlePubMed CentralPubMedGoogle Scholar
- De Candolle A. Origin of cultivated plants. 2nd ed. London: Trench, Paul; 1886.Google Scholar
- Miller JC, Tanksley SD. RFLP analysis of phylogenetic relationships and genetic variation in the genus Lycopersicon. Theor Appl Genet. 1990;80:437–48.PubMedGoogle Scholar
- Williams CE, Clair DAS. Phenetic relationships and levels of variability detected by restriction fragment length polymorphism and random amplified polymorphic DNA analysis of cultivated and wild accessions of Lycopersicon esculentum. Genome. 1993;36:619–30.View ArticlePubMedGoogle Scholar
- Park YH, West MAL, St Clair DA. Evaluation of AFLPs for germplasm fingerprinting and assessment of genetic diversity in cultivars of tomato (Lycopersicon esculentum L). Genome. 2004;47:510–8.View ArticlePubMedGoogle Scholar
- Sim S-C, Robbins MD, Van Deynze A, Michel AP, Francis DM. Population structure and genetic differentiation associated with breeding history and selection in tomato (Solanum lycopersicum L.). Heredity (Edinb). 2011;106:927–35.View ArticleGoogle Scholar
- Sim S-C, Robbins MD, Chilcott C, Zhu T, Francis DM. Oligonucleotide array discovery of polymorphisms in cultivated tomato (Solanum lycopersicum L) reveals patterns of SNP variation associated with breeding. BMC Genomics. 2009;10:466.View ArticlePubMed CentralPubMedGoogle Scholar
- Sim S-C, Durstewitz G, Plieske J, Wieseke R, Ganal MW, Van Deynze A, et al. Development of a large SNP genotyping array and generation of high-density genetic maps in tomato. PLoS One. 2012;7:e40563.View ArticlePubMed CentralPubMedGoogle Scholar
- Frary A, Nesbitt TC, Grandillo S, Knaap E, Cong B, Liu J, et al. fw2.2: a quantitative trait locus key to the evolution of tomato fruit size. Science. 2000;289:85–8.View ArticlePubMedGoogle Scholar
- Liu J, Van Eck J, Cong B, Tanksley SD. A new class of regulatory genes underlying the cause of pear-shaped tomato fruit. Proc Natl Acad Sci U S A. 2002;99:13302–6.View ArticlePubMed CentralPubMedGoogle Scholar
- Xiao H, Jiang N, Schaffner E, Stockinger EJ, van der Knaap E. A retrotransposon-mediated gene duplication underlies morphological variation of tomato fruit. Science. 2008;319:1527–30.View ArticlePubMedGoogle Scholar
- Cong B, Barrero LS, Tanksley SD. Regulatory change in YABBY-like transcription factor led to evolution of extreme fruit size during tomato domestication. Nat Genet. 2008;40:800–4.View ArticlePubMedGoogle Scholar
- Muños S, Ranc N, Botton E, Bérard A, Rolland S, Duffé P, et al. Increase in tomato locule number is controlled by two single-nucleotide polymorphisms located near WUSCHEL. Plant Physiol. 2011;156:2244–54.View ArticlePubMed CentralPubMedGoogle Scholar
- Chakrabarti M, Zhang N, Sauvage C, Muños S, Blanca J, Cañizares J, et al. A cytochrome P450 regulates a domestication trait in cultivated tomato. Proc Natl Acad Sci U S A. 2013;110:17125–30.View ArticlePubMed CentralPubMedGoogle Scholar
- Rodríguez GR, Muños S, Anderson C, Sim S-C, Michel A, Causse M, et al. Distribution of SUN, OVATE, LC, and FAS in the tomato germplasm and the relationship to fruit shape diversity. Plant Physiol. 2011;156:275–85.View ArticlePubMed CentralPubMedGoogle Scholar
- Sim S-C, Van Deynze A, Stoffel K, Douches DS, Zarka D, Ganal MW, et al. High-density SNP genotyping of tomato (Solanum lycopersicum L) reveals patterns of genetic variation due to breeding. PLoS One. 2012;7:e45520.View ArticlePubMed CentralPubMedGoogle Scholar
- Sauvage C, Segura V, Bauchet G, Stevens R, Thi Do P, Nikoloski Z, et al. Genome Wide Association in tomato reveals 44 candidate loci for fruit metabolic traits. Plant Physiol. 2014;165:1120–32.View ArticlePubMedGoogle Scholar
- Hamilton JP, Sim S-C, Stoffel K, Van Deynze A, Buell CR, Francis DM. Single nucleotide polymorphism discovery in cultivated tomato via sequencing by synthesis. Plant Genome J. 2012;5:17.View ArticleGoogle Scholar
- Patterson NJ, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190.View ArticlePubMed CentralPubMedGoogle Scholar
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–9.View ArticlePubMedGoogle Scholar
- Kosman E, Leonard KJ. Similarity coefficients for molecular markers in studies of genetic relationships between individuals for haploid, diploid, and polyploid species. Mol Ecol. 2005;14:415–24.View ArticlePubMedGoogle Scholar
- Adler D. vioplot: Violin plot. 2005.Google Scholar
- Jost L. Gst and its relatives do not measure differentiation. Mol Ecol. 2008;17:4015–26.View ArticlePubMedGoogle Scholar
- Excoffier L, Lischer H. Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows. Mol Ecol Resour. 2010;10:564–7.View ArticlePubMedGoogle Scholar
- Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Mol Ecol Evol. 2006;23:254–67.Google Scholar
- Knight R, Maxwell P, Birmingham A, Carnes J, Caporaso JG, Easton BC, et al. PyCogent: a toolkit for making sense from sequence. Genome Biol. 2007;8:R171.View ArticlePubMed CentralPubMedGoogle Scholar
- Szpiech ZA, Jakobsson M, Rosenberg NA. ADZE: a rarefaction approach for counting alleles private to combinations of populations. Bioinformatics. 2008;24:2498–504.View ArticlePubMed CentralPubMedGoogle Scholar
- Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics. 2007;23:2633–5.View ArticlePubMedGoogle Scholar
- Cleveland WS. Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc. 1979;74:829.View ArticleGoogle Scholar
- R Core Team. R: A Language and Environment for Statistical Computing. 2013.Google Scholar
- Sinnot RS. Virtues of the haversine. Sky Telesc. 1984;68:159.Google Scholar
- Hijmans RJ, Etten JV. raster: Geographic data analysis and Modeling. 2013.Google Scholar
- Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, RoyChoudhury A. Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol Biol Evol. 2012;29:1917–32.View ArticlePubMed CentralPubMedGoogle Scholar
- Drummond AJ, Rambaut A. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol. 2007;7:214.View ArticlePubMed CentralPubMedGoogle Scholar
- Rambaut A. Tracer v.1.5. 2009.Google Scholar
- Huang Z, van der Knaap E. Tomato fruit weight 11.3 maps close to fasciated on the bottom of chromosome 11. Theor Appl Genet. 2011;123:465–74.View ArticlePubMedGoogle Scholar
- Guo M, Rupe MA, Dieter JA, Zou J, Spielbauer D, Duncan KE, et al. Cell Number Regulator1 affects plant and organ size in maize: implications for crop yield enhancement and heterosis. Plant Cell. 2010;22:1057–73.View ArticlePubMed CentralPubMedGoogle Scholar
- Sambrook J, Fritsch EF, Maniatis T. Molecular cloning. New York: Cold Spring Harbor Laboratory Press; 1989.Google Scholar
- Lin T, Zhu G, Zhang J, Xu X, Yu Q, Zheng Z, et al. Genomic analyses provide insights into the history of tomato breeding. Nat Genet. 2014;46:1220–6.View ArticlePubMedGoogle Scholar
- Platt A, Horton M, Huang YS, Li Y, Anastasio AE, Mulyati NW, et al. The scale of population structure in Arabidopsis thaliana. PLoS Genet. 2010;6:e1000843.View ArticlePubMed CentralPubMedGoogle Scholar
- Pressoir G, Berthaud J. Patterns of population structure in maize landraces from the Central Valleys of Oaxaca in Mexico. Heredity (Edinb). 2004;92:88–94.View ArticleGoogle Scholar
- Koenig D, Jiménez-Gómez JM, Kimura S, Fulop D, Chitwood DH, Headland LR, et al. Comparative transcriptomics reveals patterns of selection in domesticated and wild tomato. Proc Natl Acad Sci U S A. 2013;110:e2655–62.View ArticlePubMed CentralPubMedGoogle Scholar
- Nakazato T, Housworth EA. Spatial genetics of wild tomato species reveals roles of the Andean geography on demographic history. Am J Bot. 2011;98:88–98.View ArticlePubMedGoogle Scholar
- United States. Office of Experimental Stations. Experimental Station Recod, Volumen 39. Volume 39. Washington, DC, USA: United States. Office of Experimental Stations; 1918.Google Scholar
- Merk HL, Yames SC, Van Deynze A, Tong N, Menda N, Mueller LA, et al. Trait diversity and potential for selection indeces based on variation among regionally adapted processing tomato germplasm. J Am Soc Hortic Sci. 2012;137:427–37.Google Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.