Skip to main content

A high-continuity and annotated tomato reference genome



Genetic and functional genomics studies require a high-quality genome assembly. Tomato (Solanum lycopersicum), an important horticultural crop, is an ideal model species for the study of fruit development.


Here, we assembled an updated reference genome of S. lycopersicum cv. Heinz 1706 that was 799.09 Mb in length, containing 34,384 predicted protein-coding genes and 65.66% repetitive sequences. By comparing the genomes of S. lycopersicum and S. pimpinellifolium LA2093, we found a large number of genomic fragments probably associated with human selection, which may have had crucial roles in the domestication of tomato. We also used a recombinant inbred line (RIL) population to generate a high-density genetic map with high resolution and accuracy. Using these resources, we identified a number of candidate genes that were likely to be related to important agronomic traits in tomato.


Our results offer opportunities for understanding the evolution of the tomato genome and will facilitate the study of genetic mechanisms in tomato biology.

Peer Review reports


Tomato (Solanum lycopersicum) is an important model plant for scientific researches on fruit development and quality [1]. The tomato cultivation area has increased by ~1 million hectares over the past decade, and the yield has increased from 155 million tons to 181 million tons ( As a nutritious vegetable that contributes to the human diet, tomato is reported to contain more health-promoting compounds such as lycopene than some other popular fruits. These compounds lower risk of cancer and maintain human health [2]. Tomato was originally found mainly in the Andean mountains of South America. Its fruit weight and quality differ markedly among different horticultural groups, and wild tomatoes have smaller seeds and lower yields than cultivars.

A draft genome of the tomato cultivar Heinz 1706 produced using shotgun sequencing technology was released in 2012 [3] and widely used as a reference genome for scientific researches. However, the fragmented nature of this genome and the resulting incomplete gene models could hindered the discovery and functional analysis of important genes. The completeness, accuracy, and contiguity of genome assemblies depend mainly on sequencing technology and assembly strategy. In the current genomic era, single-molecule real-time (SMRT) sequencing technology and new assembly pipelines have remarkably improved the quality of genome assemblies such as those of rice [4], cucumber [5], and tomato [6]. Although these genome assemblies have accelerated some scientific researches, such as QTL mapping and transcriptome analysis, it is far from sufficient for understanding more comprehensive and personal precision breeding. After several updates of tomato genome, the SL4.0 tomato genome has reached chromosomal assembly, but its continuity and integrity remains to be further improved to facilitate the identification of large structural variations and gene mining.

Across over the past ~10,000 years, humans have selected consciously and unconsciously for beneficial traits that have made wild plants more suitable for human use. The impacts of human selection on crops have been recorded in their genomes and have a central role in crop improvement. The resulting “domestication syndrome” comprises a common suite of traits useful for human needs and plant survival, such as larger fruits or grains, better taste, bigger seeds, and more robust plants overall [7, 8]. In crop species, bigger seeds can accumulate adequate nutrition for germination and produce more vigorous seedlings, and effectively increase yield and utilization in grain crops [9]. Seed size, as a domestication trait, has been elucidated in several crops, and genes such as GS3 [10, 11], DEP1 [12], GW2 [13], GW8 [14], and qSW5 [15] in rice; HvYrg in barley [16], and TaGW2 in wheat [17] have been characterized. But small seeds could affect the fruit size and quality in orange [18] and watermelon [19]. Therefore, it is crucial to explore the balance between seed size and other traits in crops. In tomato, seeds are smaller in wild tomatoes than in cultivars. Sw4.1 is one of the major QTLs which encode an ABC transporter responsible for phenotypic variation in tomato seed size [20, 21].

In this study, we generated a highly continuous and complete genome sequence of Heinz 1706 (version SLT1.0) that contains many fewer gaps and unplaced contigs and demonstrates better assembly of repetitive regions. By comparing the genomes of S. lycopersicum and S. pimpinellifolium LA2093, we found a large number of genomic fragments that appear to be likely involved in domestication. We also used a RIL population, which was derived from a cross between the cultivated line OH88119 and the wild line PI128216, to identify candidate genes and loci that control several important agronomic traits. Our work offers new opportunities for understanding the evolutionary history of the tomato genome and the genetic mechanisms that underlie complex traits in tomato breeding.


High-quality genome assembly

We assembled a highly continuous and complete genome sequence of Heinz 1706 using an integrated genome sequencing approach that combined 131.78 Gb (168.52×) of SMRT data, 226.97 Gb (290.24×) of BioNano data, 140.52 Gb (179.70×) of Hi-C data, and 50.93 Gb (61.53×) of Illumina short-read data (Supplementary Table S1). The PacBio long reads with an N50 read length of 32.82 kb were assembled with CANU software [22], generating a 875.21-Mb genome with a contig N50 of 17.83 Mb (Table 1). To reduce fragmentation and order the contigs, BioNano data and Hi-C data were used to assist with scaffold construction using RefAligner and Assembler [23], HERA [24], and Juicer [25] software. A Hi-C-based physical heatmap comprising 12 groups was generated (Supplementary Fig. S1) and used to create 12 pseudo-chromosomes that anchor ~790.59 Mb of the genome and harbor 97.61% (33,562) of the predicted protein-coding genes. The genome assembly was polished with Illumina short reads for error homozygous SNPs or indels using Pilon software [26]. As a result, we generated a 799.09-Mb genome assembly, namely SLT1.0.

Table 1 Genome assembly and annotation of SLT1.0

The conserved genes from the Benchmarking Universal Single-Copy Orthologs (BUSCO) gene set [27] were used to gauge the accuracy and completeness of the SLT1.0 assembly. The results showed that the SLT1.0 assembly contained 97.70% complete genes and 0.30% fragmented genes. The value of the LTR Assembly Index (LAI) was 12.41, which was consistent with that of the previously released SL4.0 tomato reference genome (LAI 12.54). Whereas, the SLT1.0 genome had only 210 gaps, which is less than 286 and 22,700 in the SL4.0 and SL3.0 genomes, respectively (Table 1). More than 99.88% of the genome assembly had greater than one-fold coverage with Illumina short reads. All these evidences demonstrated the high continuity and completeness of the SLT1.0 genome assembly.

High-quality genome annotation

Except for ab initio prediction and protein-homology-based prediction, we also used transcriptome data, including the bulked RNA-seq data with a mapping rate of 99.73%, and previously-released RNA-seq data from various tissues [3] with a mapping rate of 97.97%, to facilitate gene annotation of the assembled genome. In total, we predicted 34,384 protein-coding genes with an average length of 3,766.53 bp and 6.55 exons per gene in the SLT1.0 genome (Table 1 and Supplementary Table S2). Gene completeness was estimated to be 98.20% based on the BUSCO gene set, higher than that (92.9%) in the SL4.0 genome [27], and the protein-coding genes were unevenly distributed along the chromosomes (Fig. 1). Comparative analysis showed that each gene in a gene set (in total 234 genes) in the SLT1.0 genome corresponded to more than two genes in the SL4.0 genome (Supplementary Table S3). Gene collinearity analysis identified 33 collinear gene blocks between the SLT1.0 and SL4.0 genomes, harboring 28,892 (84.03%) and 28,389 (83.30%) homologous genes, respectively (Fig. 2A, Supplementary Fig. S2). Some unplaced contigs in the SL4.0 genome were successfully assigned to chromosomes 6, 8, 9 and 11 in the SLT1.0 genome, which contained 100 genes in the SLT1.0 genome. These results highlight the high accuracy and completeness of the SLT1.0 genome assembly and gene models.

Fig. 1
figure 1

Genomic landscape and structural variants of S. lycopersicum cv. Heinz 1706. (i) Ideogram of the 12 chromosomes with scale in Mb, with centromere positions marked between gray and white. (ii) Gene density along each chromosome (number of genes per Mb). (iii) Repeat content along each chromosome (% nucleotides per Mb). (iv) Gypsy retrotransposons content (% nucleotides per Mb). (v) Copia retrotransposons content (% nucleotides per Mb). (vi) Bin marker content (% nucleotides per Mb)

Fig. 2
figure 2

Alignment between the Heinz 1706 SLT1.0 and SL4.0 genomes. A Genome collinearity analysis showed that four scaffolds from SL4.0 are placed on chromosomes of the SLT1.0 genome and that there is an inversion on chromosome 2. B The color intensity of the Hi-C heatmap represents the number of links between two 25-kb windows. The presence of an inversion is supported by high-density contacts indicated by two asterisks in the Hi-C heatmap generated from SL4.0 Hi-C reads (lower left), whereas no corresponding contact is found in the SLT1.0 Hi-C heatmap (upper right). C The inversion shown in red on chromosome 2. F1, R1, F2, and R2 are primers around the break points. The blots are cropped. D Seven Heinz 1706 individuals were identified, two of which (I, III) had inversions

A comprehensive analysis of the genome sequences identified 965 collinear chromosomal blocks between the SLT1.0 and SL4.0 genomes. These blocks contained 32,922 and 32,554 genes, accounting for 95.75% and 95.54% of the SLT1.0 and SL4.0 genomes, respectively. Genome alignment revealed that 97.52% of the SLT1.0 genome accounted for 99.58% of the SL4.0 genome, including 27,417 SNPs and 119,246 Indels. However, we detected a 2.76 Mb inversion from 39.17 to 41.93 Mb on chromosome 2 of the SLT1.0 genome (Fig. 2B). The continuous interaction signals on the Hi-C heatmap, as well as PCR and Sanger sequencing, showed that this region was not misassembled (Fig. 2B-C, Supplementary Fig. S3, Supplementary Table S4). This result indicated that heterozygous variation may exist in the previously reported Heinz 1706 accession.

Transposable element analysis

A total of 524.84 Mb of repetitive sequences were identified, accounting for 65.66% of the SLT1.0 genome assembly, which was similar to that reported in the SL4.0 genome (508.89 Mb, 65.03%) (Supplementary Table S5). Among these repetitive sequences, long terminal repeats (LTRs) were the predominant TE family, covering 50.25% (401.60 Mb) of the genome. Gypsy-type LTRs (344.52 Mb) were the most common subfamily and six times more abundant than Copia-type LTRs (57.09 Mb), slightly higher than the Gypsy-type LTRs (342.44 Mb) and Copia-type (56.60 Mb) in the SL4.0 genome. We used a combination of methods, including LTR-FINDER [28], LTR-Harvest [29], and LTR-Retriever [30], to identify intact LTRs. A total of 3,220 LTRs were detected in the SLT1.0 genome assembly, including 1,553 Gypsy-type LTRs and 1,346 Copia-type LTRs. The estimated insertion time of the LTR retrotransposons showed that Gypsy and Copia-type LTRs had a recent and similar burst 0.60-1.00 million years ago (Mya) (Fig. 3A), and were enriched far from coding genes (Fig. 3B). These results indicated that the burst of Gypsy-type LTRs may be the major driving force for the expansion of the tomato genome.

Fig. 3
figure 3

Repetitive sequence analysis. A The estimated insertion time of LTR retrotransposons, showing the statistics of Gypsy-type LTRs in red and Copia-type LTRs in blue. B Frequencies of transposable elements (TE) in the vicinity of genes. Gypsy and Copia had the highest frequencies in the intergenic region and the lowest frequencies near the gene regions. LINE, TcMar-Stowaway, PIF-Harbinger, and MULE-MuDR exhibited the opposite distribution pattern. C The top 12 TE subfamilies with the longest length, including 11 Gypsy and one Unknown-type subfamily. D The frequency distribution of the Unknown-type rnd-1_family-4 subfamily, showing that it was enriched towards the centromere of chromosome 1

To identify the centromere regions, we detected the top 12 TE subfamilies, including 11 Gypsy and one unknown-type subfamilies, which together comprised over 15.47% of the genome (Fig. 3C). The density of these TE subfamilies along all the chromosomes showed that only the Unknown-type rnd-1_family-4 subfamily (1.65% of the genome) was enriched near centromeres but absent from the rest of the genome (Fig. 3D, Supplementary Fig. S4). In addition, we found that 65.21% of the unanchored Contig/Scaffold sequence length comprised highly repetitive regions. Overall, we predicted 12 potential centromeric regions ranging from 1.90 to 6.90 Mb on the 12 chromosomes.

Comparison of the SLT1.0 and S. pimpinellifolium LA2093 genomes

Structural variations (SVs) between wild and cultivated species can cause many phenotypic differences in domestication traits such as fruit weight and quality [31]. Based on protein homologies between the SLT1.0 and LA2093 genomes, we found that 23,544 genes (68.47%) in the SLT1.0 genome had one-to-one collinear relationships with 23,474 genes (65.64%) in the LA2093 genome (Fig. 4A). In addition, genome collinearity analysis showed that syntenic genomic blocks occupied 95.63% of the SLT1.0 genome and 96.67% of the LA2093 genome, respectively. We also identified 6,647 SVs (more than 1 kb in length) between the SLT1.0 and LA2093 genomes, including 3,054 (45.95%) SVs in 2,862 genes (Fig. 4B). GO analysis showed that these genes were significantly enriched in the function of oxidation-reduction process, photosynthetic electron transport chain and proton-transporting ATP synthase complex (Supplementary Fig. S5). We also identified 4,493,889 SNPs and 2,459,597 indels between the two genomes (Fig. 4B), including 418,844 SNPs and 245,310 indels located in 29,862 genes. We noted that 45,229 nonsynonymous SNPs resided in 18,178 genes and 9,148 frameshift indels in 1,559 genes, including 7,788 located in domestication regions [32]. They were significantly enriched in macromolecular complex, pigment metabolic process, nutrient reservoir activity, and intracellular organelle parts (Fig. 4C), suggesting these genes may have contributed to disease resistance and fruit traits during tomato domestication.

Fig. 4
figure 4

Alignment between the SLT1.0 and S. pimpinellifolium LA2093 genomes. A Gene colinearity of the SLT1.0 and LA2093 genomes. The red bar represents the SLT1.0 chromosome, the blue bar represents the LA2093 chromosome, and the gray lines represent the collinear regions. B Numbers of SNPs, indels, and structural variation in CDS and intergenic regions. The numbers of SNPs with nonsynonymous mutations (large-effect), SNPs with synonymous mutations (small-effect), and SNPs in intergenic regions, as well as the number of non-triple (large-effect) indels, triple (small-effect) indels, and indels in intergenic regions were further shown. C GO terms enriched in genes affected by SNPs and indels selected during domestication

High-density genetic map construction

A high-quality tomato reference genome can provide new insights into the genetic basis of important agronomic traits. Here, we constructed a RIL7 population of 247 progenies derived from a cross between the cultivated line OH88119 (64.27×) and the wild line PI128216 (42.35×). Resequencing of these progenies generated 1.7 Tb of data with an average depth of 6.91× and 97.98% coverage of the SLT1.0 genome. After aligning the reads to the genome, we identified 4,739,716 SNPs between the parental lines, 2,818,901 (59.35%) of which were genotyped in the RIL7 population. To construct a high-density tomato genetic map, we employed a Hidden Markov Model (HMM) to infer all recombination events in the RIL population. A total of 17,726 recombination events were detected across the whole genome, with one recombination event at an average interval of 3.78 kb (Fig. 5A). We found that 1,477 crossovers per chromosome ranged from 1.01 kb to 7.92 Mb in size and that the recombination rate varied across the genome with an average of 11.22 cM/Mb (Fig. 5B, Supplementary Fig. S6). Furthermore, recombination rate increased with distance from the centromeric regions on chromosomes 5 and 9, but this trend was not evident on the other chromosomes (Fig. 1).

Fig. 5
figure 5

Genetic map construction and QTL mapping of the RIL population. A Graphical representation of the re-sequencing-based mapping results of 247 RILs sorted according to the SLT1.0 genomic location. B High-density genetic map constructed by RIL population. C Mapping of QTLs across the entire RIL population for seed length, D stem trichomes density E and leaf type traits. F Seed-size-related candidate genes and non-synonymous mutations between the cultivated line OH88119 and the wild line PI128216.

We defined 35,836 raw bins (an average physical length of 22.06 kb) across the entire RIL population and constructed a high-quality set of 17,741 bin markers with a physical length of greater than 1 kb and a minor allele frequency of greater than 0.05. Based on recombination rate, we created a genetic linkage map of 13,973 unique bin markers anchoring twelve linkage groups that corresponded to the twelve chromosomes (Fig. 5B). This genetic map spanned 786.81 Mb, representing more than 98.46% of the assembled SLT1.0 reference genome. Taken together, these analyses produced a high-quality genetic map for genetic research on important agronomic traits in tomato.

QTL mapping of important agronomic traits in tomato

The RIL population exhibited diversity for several important agronomic traits. The cultivated line OH88119 has a larger fruit and seed size, both of which are key traits associated with human selection in tomato. To gauge the accuracy of the genetic bin map, we performed an association study on three agronomic traits (seed size, leaf architecture, and trichomes) and identified five significantly associated QTLs in the SLT1.0 genome (Fig. 5C-E). Plant glandular trichomes can produce secondary metabolites that defend against herbivores and pathogens [33, 34]. Previous studies have shown that a 618-bp H gene encoding a C2H2 zinc finger protein regulates the formation of multicellular trichomes in tomato leaves [35]. Here, the leaf trichome locus was localized to a smaller interval of 2.82 kb that included two bins on chromosome 10 (Fig. 5D), suggesting the high quality of the genetic bin map.

The seed is an important plant reproductive organ, but larger seeds affect fruit taste, reducing the economic value of the berry crop. We used the genetic bin map to identify two seed size QTLs above the LOD threshold associated with seed length (Fig. 5C). These major-effect QTLs were resolved into a 1.32-Mb region with 40 bins ranging from 5.75 to 7.07 Mb on chromosome 4 (sl4.1) and a 0.74-Mb region with 36 bins ranging from 4.08 to 4.82 on chromosome 11 (sl11.1). Intriguingly, we found that sl11.1 on chromosome 11 was located in domestication sweep DS169 [32], indicating that it may have played an essential role during human selection. We identified 11,709 parental SNPs in this region, including 303 nonsynonymous variations in 87 genes. Among these SNPs, 132 nonsynonymous variations in 65 genes were also present in the RIL population. After exploring their functions, we identified six candidate genes for seed length, including a TIFY domain protein, a lipid droplet-associated protein (LDAP)-interacting protein (LDIP), and three small auxin-up RNA (SAUR) proteins (Fig. 5F). In particular, one gene encoding an LDIP, SlT11G005670, appeared to be a strong candidate for seed length. Based on homology comparison, we found that the LDIP gene influences lipid droplet size and neutral lipid homeostasis in Arabidopsis seeds [36].


A highly contiguous and complete genome assembly is a powerful tool for molecular genetic studies of agronomic traits in tomato. Since the tomato cultivar Heinz 1706 is considered as a reference genome for scientific researches [3], its precise and complete genome has be updated using the advanced assembly and annotation technology. However, the continuity of genome assembly is still a challenge to the tomato genome. In this study, we combined PacBio, BioNano, and Hi-C data to produce the high-quality and contiguous SLT1.0 tomato genome. The 799.09-Mb assembly had an N50 of 17.83 Mb, and more than 98.94% of its sequences were anchored to 12 chromosomes. The SLT1.0 genome had more repeats sorted and anchored to chromosomes than the previously released SL4.0 genome. In the SLT1.0 genome, the number and length of LTRs, mainly Gypsy-type LTRs and Copia-type LTRs, were longer than the SL4.0 genome, indicating a more accurate assembly of the SLT1.0 genome. Analysis of repeat subfamilies showed that a specific subfamily, rnd-1_family-4, was found in centromeric regions of the SLT1.0 genome, whereas a similar reliable repeat family was not detected in the SL4.0 genome [6]. Therefore, we could determine the centromere regions in the SLT1.0 genome. Comparative genome analysis revealed that a 2.76-Mb inversion was present on chromosome 2 in SLT1.0 relative to SL4.0 (Fig. 2). The inversion was validated by Sanger sequencing and contained no functional genes in adjacent breakpoints, suggesting it is a continuous fragment that has no effect on the SLT1.0 genome. However, we must be cautious and further verify these different fragments between the SLT1.0 and SL4.0 genomes.

Because of the small size and poor taste of wild species, fruit mass and quality are important domestication traits in tomato. Human domestication and improvement have increased the size of modern tomato fruits about 100 times relative to their ancestors [32]. At the same time, the flavor and resistance of tomatoes have been greatly reduced [37]. Comparative genomic analysis of the SLT1.0 and S. pimpinellifolium LA2093 genomes found that genes located in the domesticated regions had large effects including non-synonymous and frameshift mutations. These genes are mainly involved in the macromolecular complex and pigment metabolic process, which could be possibly selected by humans for a long time. Compared with SNPs and small indels, structural variations had high levels of heritability in plants [38]. We identified structural variations in some important genes that affect tomato fruit weight, such as SlT11G021690 (oxidoreductase activity). This result suggests that structural variations in breeds were more likely to have been fixed in cultivars during the process of human selection. Furthermore, these structural variations may be used as potential targets for future breeding programs to improve fruit mass and quality.

In the process of crop domestication, human beings have paid particular attention to yield, shelf-life, and resistance to biotic stresses [7, 39], especially seed and fruit development, but human have different perference for different species. It seems that people are inclined to select larger seeds to improve the emergence rate, yield, quality, and other important traits, especially in edible seed crops such as rapeseed [40] and bean [41]. However, size and number of seed affect the taste and reduce the quality of edible fruits. Meanwhile, small seeds will affect the fruit flavor and germination rate, the balance between seed size and fruit flavor is particularly important in some species such as orange [18] and watermelon [19]. Here, QTL analysis of 247 tomato RIL populations indicated that the SlT11G005670 gene may have influenced tomato seed length during domestication. By extracting the SNP sites of the parents, we found that the gene contained a nonsynonymous mutation in the cultivated tomato line. We also found that LDIP genes from other species have high homology with SlT11G005670 affecting lipid droplet size and neutral lipid homeostasis in Arabidopsis thaliana [36]. These findings have potential contributions to future tomato breeding and fruit quality improvement. However, more evidence are required to understand the potential molecular mechanisms by which this gene controls seed length.


Overall, we produced a high-quality tomato genome that will facilitate the molecular dissection of important agronomic traits in tomato. We generated a high-density genetic map and detected five QTLs related to seed and leaf traits. We also identified six candidate genes in two genomic regions that appear to control differences in seed length. This high-quality genome and high-density genetic map will be powerful tools for tomato breeding and can deepen our understanding of tomato biology.


Plant materials and sequencing

Plants were grown in the greenhouse in China Agricultural University in Beijing, with a 16 h light/ 8 h dark cycle. The cultivated line OH88119 was crossed with the wild line PI128216 to create the F1 progeny, and RIL population was developed through single-seed-descent. Fresh leaf tissues were collected from each line of RIL population and resequenced on the Illumina NovaSeq 6000 platform. A PacBio SMRT library of Solanum lycopersicum accession Heinz 1706 was constructed and sequenced on the PacBio Sequel II platform. The Hi-C libraries were prepared following the Proximo Hi-C plant protocol with HindIII as the restriction enzyme for chromatin digestion. The Hi-C libraries were sequenced on the Illumina NovaSeq platform with a read length of 150 bp. For optical mapping, high-molecular-weight DNA was isolated and labeled using a Bionano Saphyr System. The RNA of mixed tissues including root, stem, leaf, flower, and fruit, were extracted using Promega RNG extraction kit (Promega Biotech, Beijing, China), and sequenced.

De novo genome assembly

The raw SLT1.0 SMRT reads were corrected and assembled into sequence contigs using CANU with default parameters. The contigs were used for HERA assembly with the corrected SMRT reads. To identify sequence overlaps, all contigs and corrected reads were aligned all-against-all using Minimap2 [42] and BWA [43] with default parameters. The HERA-assembled super-contigs were combined with BioNano genome maps to generate hybrid maps using IrysView software (BioNano Genomics) with a minimum length of 150 kb. The resulting contigs were further clustered basing on the Hi-C data using 3D-DNA software [44] with the default parameters. Pilon [26] was used for further error correction with three rounds of polishing.

Repeat analysis and gene annotation

The integrity of the final genome assembly was assessed in conjunction with BUSCO (v4.1.4) [27] using Benchmarking Universal Single-Copy Orthologs. A combination of de novo and homology-based methods was used to identify interspersed transposable elements (TEs). A de novo repeat library was built using RepeatModeler (v2.0.1) [45] and LTR_retriever (v2.9.0), [30]. Both the de novo library and RepBaseRepeatMaskerEdition-20181026, which is the most commonly used repetitive DNA element database, were used to identify TEs with RepeatMasker (v4.1.0) [46].

The RNA-Seq reads from this study were used to predict protein-coding genes in the repeat-masked SLT1.0 genome [3]. The cleaned high-quality RNA-Seq reads were aligned to the assembled genome using HISAT2 [47] with default parameters, and the read alignments were assembled into transcripts using StringTie [48]. The complete coding sequences (CDS) were predicted from the assembled transcripts by the PASA pipeline [49]. The BRAKER [50], GeneMark-ET [51], and SNAP [52] softwares were performed on ab initio gene predictions. Finally, high-confidence gene models were predicted by integrating ab initio predictions, transcript mapping, and protein homology evidence with the MAKER pipeline [53].

Genome comparisons and SV identification

Genome comparisons between SLT1.0 and SL4.0 and between SLT1.0 and LA2093 were performed via whole-genome alignment using the MUMmer package (v3.23) [54]. The one-to-one alignment blocks were identified using delta-filter program. Then the show-snp tools were used to identify SNPs and indels using uniquely aligned fragments, and the show-diff tool statistics were used to screen for structural variations over 1 kb in length. The SnpEff [55] software was used to analyze the various SNPs and indel types on the chromosomes.

Genetic map construction and QTL analysis

We identified SNPs across the 247 F7 RILs and the two parents using BWA (v0.7.10) [43] and samtools (v0.1.19) [56] softwares. The high-quality SNPs were called by bcftools (v1.10.2) [57], and SNPs were further filtered to retain only those with different homozygous genotypes in both parents, of which the quality ≥ 30, MQ ≥ 30, 2 ≤ AF1 ≤200. We generated a genotype matrix from the 247 RILs , and genetic distances were calculated using MSTMap [58]. The resulting bin marker data were imported into MG2C (v2.0) ( to construct the genetic map.

QTL analysis was performed using trichomes, leaf type, and seed size phenotype data from 247 RIL population samples. For the trichomes phenotype, we divided it into five grades based on its density and length, and tomato leaf types are classified into four types according to the shapes. These two phenotypes in RIL population were independently estimated by three persons and the final phenotypes were taken from the average of them. To investigate seed size traits, we measured seed length with three biological replicates. The maximum likelihood estimation method was used to calculate the recombination rate and LOD values between bin markers. The bin markers with LOD value greater than 3.0 were selected as QTLs. The candidate genes were identified based on non-synonymous SNP mutations in both parents and their homologs in other species. We determined the function of these gene based on the homologs genes of the NCBI database (

Availability of data and materials

All the raw sequencing data for genome assembly and annotation have been deposited into the Genome Sequence Archive (GSA) database (http://bigd.big. in BIG Data Center under Accession Number PRJCA004585. Information for the assembled genome SLT1.0 was deposited both into the Genome Warehouse (GWH) database ( in the BIG Data Center under Accession Number GWHBAUD00000000, and the NCBI database under Accession Number PRJNA757583.


  1. Meissner R, Jacobson Y, Melamed S, Levyatuv S, Shalev G, Ashri A, et al. A new model system for tomato genetics. Plant J. 1997;12(6):1465–72.

    Article  CAS  Google Scholar 

  2. Giovannucci E. Tomatoes, tomato-based products, lycopene, and cancer: Review of the epidemiologic literature. JNCI-J Natl Cancer Inst. 1999;91(4):317–31.

    Article  CAS  Google Scholar 

  3. The Tomato Genome Consortium. The tomato genome sequence provides insights into fleshy fruit evolution. Nature. 2012;485(7400):635–41.

    Article  Google Scholar 

  4. Du H, Ying Y, Ma Y, Qiang G, Liang C. Sequencing and de novo assembly of a near complete indica rice genome. Nat Commun. 2017;8(1):15324.

  5. Li Q, Li H, Huang W, Xu Y, Zhou Q, Wang S, et al. A chromosome-scale genome assembly of cucumber (Cucumis sativus L.). GigaScience. 2019;8(6):6.

  6. Hosmani PS, Flores Gonzalez M, van de Geest H, Maumus F, Bakker LV, Schijlen E, et al. An improved de novo assembly and annotation of the tomato reference genome using single-molecule sequencing, Hi-C proximity ligation and optical maps. bioRxiv. 2019:767764.

  7. Doebley JF, Gaut BS, Smith BD. The molecular genetics of crop domestication. Cell. 2006;127(7):1309–21.

    Article  CAS  PubMed  Google Scholar 

  8. Chen Y, Song W, Xie X, Wang Z, Guan P, Peng H, et al. A collinearity-incorporating homology inference strategy for connecting emerging assemblies in the triticeae tribe as a pilot practice in the plant pangenomic era. Mol Plant. 2020;13(12):1694–708.

    Article  CAS  PubMed  Google Scholar 

  9. Galindez G, Ortega Baes P, Seal CE, Daws MI, Scopel AL, Pritchard HW. Physical seed dormancy in Collaea argentina (Fabaceae) and Abutilon pauciflorum (Malvaceae) after 4 years storage. Seed Sci Technol. 2010;38(3):777–82.

    Article  Google Scholar 

  10. Fan CH, Xing YZ, Mao HL, Lu TT, Han B, Xu CG, et al. GS3, a major QTL for grain length and weight and minor QTL for grain width and thickness in rice, encodes a putative transmembrane protein. Theor Appl Genet. 2006;112(6):1164–71.

    Article  CAS  PubMed  Google Scholar 

  11. Mao H, Sun S, Yao J, Wang C, Yu S, Xu C, et al. Linking differential domain functions of the GS3 protein to natural variation of grain size in rice. Proc Natl Acad Sci U S A. 2010;107(45):19579–84.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Huang X, Qian Q, Liu Z, Sun H, He S, Luo D, et al. Natural variation at the DEP1 locus enhances grain yield in rice. Nat Genet. 2009;41(4):494–7.

    Article  CAS  PubMed  Google Scholar 

  13. Song XJ, Huang W, Shi M, Zhu MZ, Lin HX. A QTL for rice grain width and weight encodes a previously unknown RING-type E3 ubiquitin ligase. Nat Genet. 2007;39(5):623–30.

    Article  CAS  PubMed  Google Scholar 

  14. Wang S, Wu K, Yuan Q, Liu X, Liu Z, Lin X, et al. Control of grain size, shape and quality by OsSPL16 in rice. Nat Genet. 2012;44(8):950–4.

    Article  CAS  PubMed  Google Scholar 

  15. Shomura A, Izawa T, Ebana K, Ebitani T, Kanegae H, Konishi S, et al. Deletion in a gene associated with grain size increased yields during rice domestication. Nat Genet. 2008;40(8):1023–8.

    Article  CAS  PubMed  Google Scholar 

  16. Zombori Z, Nagy B, Mihaly R, Pauk J, Cseri A, Sass L, et al. RING-Type E3 Ubiqitin ligase barley genes (HvYrg1-2) control characteristics of both vegetative organs and deeds as yield components. Plants Basel. 2020;9(12):1693.

    Article  CAS  PubMed Central  Google Scholar 

  17. Su Z, Hao C, Wang L, Dong Y, Zhang X. Identification and development of a functional marker of TaGW2 associated with grain weight in bread wheat (Triticum aestivum L.). Theor Appl Genet. 2011;122(1):211–23.

    Article  CAS  PubMed  Google Scholar 

  18. Koltunow AM, Brennan P, Bond JE, Barker SJ. Evaluation of genes to reduce seed size in shape Arabidopsis and tobacco and their application to shape Citrus. Mol Breed. 1998;4(3):235–51.

    Article  CAS  Google Scholar 

  19. Yongjae K, Yang TJ, Younghoon P, Yongjik L, Kang SC, Yongkwon K, Jeounglai C. Development of near isogenic lines with various seed sizes and study on seed size-related characteristics in watermelon. Korean J Breed Sci. 2009;(41):403–11.

  20. Doganlar S, Frary A, Tanksley SD. The genetic basis of seed-weight variation: tomato as a model system. Theor Appl Genet. 2000;100(8):1267–73.

    Article  CAS  Google Scholar 

  21. Orsi CH, Tanksley SD. Natural variation in an ABC transporter gene associated with seed size evolution in tomato species. PLoS Genet. 2009;5(1):e1000347.

  22. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;27(5):722–36.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Shelton JM, Coleman MC, Hemdon N, Lu N, Lam ET, Anantharaman T, et al. Tools and pipelines for BioNano data: molecule assembly pipeline and FASTA super scaffolding tool. BMC Genomics. 2015;16:734.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Du H, Liang C. Assembly of chromosome-scale contigs by efficiently resolving repetitive sequences with long reads. Nat Commun. 2019;10(1):5360.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Durand NC, Shamim MS, Machol I, Rao SSP, Huntley MH, Lander ES, et al. Juicer orovides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 2016;3(1):95–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 2014;9(11):e112963.

  27. Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31(19):3210–2.

    Article  CAS  PubMed  Google Scholar 

  28. Xu Z, Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 2007;35:265–8.

    Article  Google Scholar 

  29. Ellinghaus D, Kurtz S, Willhoeft U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 2008;9(1):18.

    Article  PubMed  PubMed Central  Google Scholar 

  30. Ou S, Jiang N. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 2018;176(2):1410–22.

    Article  CAS  PubMed  Google Scholar 

  31. Jin L, Zhao L, Wang Y, Zhou R, Song L, Xu L, et al. Genetic diversity of 324 cultivated tomato germplasm resources using agronomic traits and InDel markers. Euphytica. 2019;215(4):69.

    Article  Google Scholar 

  32. Lin T, Zhu G, Zhang J, Xu X, Yu Q, Zheng Z, et al. Genomic analyses provide insights into the history of tomato breeding. Nat Genet. 2014;46(11):1220–6.

    Article  CAS  PubMed  Google Scholar 

  33. Aharoni A, Jongsma MA, Kim TY, Ri MB, Giri AP, Verstappen FWA, et al. Metabolic engineering of terpenoid biosynthesis in plants. Phytochem Rev. 2006;5(1):49–58.

    Article  CAS  Google Scholar 

  34. Kang JH, McRoberts J, Shi F, Moreno JE, Jones AD, Howe GA. The flavonoid biosynthetic enzyme chalcone isomerase modulates terpenoid production in glandular trichomes of tomato. Plant Physiol. 2014;164(3):1161–74.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Chang J, Yu T, Yang Q, Li C, Xiong C, Gao S, et al. Hair, encoding a single C2H2 zinc-finger protein, regulates multicellular trichome formation in tomato. Plant J. 2018;96(1):90–102.

    Article  CAS  PubMed  Google Scholar 

  36. Pyc M, Cai Y, Gidda SK, Yurchenko O, Park S, Kretschmar FK, et al. Arabidopsis LDAP-interacting protein (LDIP) influences lipid droplet size and neutral lipid homeostasis in both leaves and seeds. Plant J. 2017;92(6):1182–201.

    Article  CAS  PubMed  Google Scholar 

  37. Wang X, Gao L, Jiao C, Stravoravdis S, Hosmani PS, Saha S, et al. Genome of Solanum pimpinellifolium provides insights into structural variants during tomato breeding. Nat Commun. 2020;11(1):5817.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Morgante M. Structural variation and the plant pan genomes. Embnet J. 2013;19(A):11.

  39. Chen Q, Li W, Tan L, Tian F. Harnessing knowledge from maize and rice domestication for new crop breeding. Mol Plant. 2021;14(1):9–26.

    Article  CAS  PubMed  Google Scholar 

  40. Ahmed SU, Zuberi MI. Effects of seed size on yield and some of its components in rapeseed, Brassica Campestris L. Var Toria Crop Sci. 1973;13(1):119–20.

    Article  Google Scholar 

  41. Nakamura RR. seed abortion and seed size variation within fruits of phaseolus vulgaris: pollen donor and resource limitation effects. Am J Bot. 1988;75(7):1003–10.

    Article  Google Scholar 

  42. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;18:18.

    Google Scholar 

  43. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–60.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Dudchenko O, Batra SS, Omer AD, Nyquist SK, Hoeger M, Durand NC, et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science. 2017;356(6333):92–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Bao Z. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 2002;12(8):1269–76.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Graovac MT, Chen N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics. 2009;25(1):4.10.11-14.10.14.

  47. Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37(8):907–15.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33(3):290–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Hannick LI, et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003;31(19):5654–66.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Hoff KJ, Lomsadze A, Borodovsky M, Stanke M. Whole-Genome Annotation with BRAKER. In: Kollmar M, editor. Gene Prediction: Methods and Protocols., vol. 1962; 2019. p. 65–95.

    Google Scholar 

  51. Alexandre L, Burns PD, Mark B. Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res. 2014;15:119.

    Google Scholar 

  52. Korf I. Gene finding in novel genomes. BMC Bioinformatics. 2004;5:9.

    Article  Google Scholar 

  53. Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, et al. MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008;18(1):188–96.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004;5(2):R12.

    Article  PubMed  PubMed Central  Google Scholar 

  55. Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly. 2012;6(2):80–92.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.

    Article  PubMed  PubMed Central  Google Scholar 

  57. Vagheesh N, Petr D, Aylwyn S, Xue Y, Chris TS, Richard D. BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from next-generation sequencing data. Bioinformatics. 2016;11:1749–51.

    Google Scholar 

  58. Wu Y, Bhat PR, Close TJ, Lonardi S, Kruglyak L. Efficient and accurate construction of genetic linkage maps from the minimum spanning tree of a graph. PLoS Genet. 2008;4(10):1000212.

    Article  Google Scholar 

Download references


Not applicable.


This work was supported by the National Key Research and Development Program of China (2019YFD1000300), the National Natural Science Foundation of China (32072571), the 111 Project (B17043), and the Construction of Beijing Science and Technology Innovation and Service Capacity in Top Subjects (grant CEFF-PXM2019_014207_000032).

Author information

Authors and Affiliations



T. L., Y. Z. and W. Y. conceived and designed the project. G. M. collected the plant materials. X. S., B. W., Y. D. and Q. Y. performed all the data analysis under the supervision of T. L.. X. G and B. L. designed and conducted the experiments. S. H. supervised the project and Q. G. modified the article. All authors contributed and approved the final paper.

Corresponding authors

Correspondence to Wencai Yang, Yingfang Zhu or Tao Lin.

Ethics declarations

Ethics approval and consent to participate

With the permission to collect, all materials of Solanum lycopersicum accessions were identified and collected from China, and now deposited at China Agricultural University in Beijing, China. The study comply with relevant institutional, national, and international guidelines and legislation. This research did not involve any human subjects, human material, or human data. Solanum lycopersicum in current research did not belong to the endangered or protected species.

Consent for publication

Not applicable.

Competing interests

We certify that there is no conflict of interest with any financial organization regarding the material discussed in the manuscript.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Su, X., Wang, B., Geng, X. et al. A high-continuity and annotated tomato reference genome. BMC Genomics 22, 898 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: