De novo assembly and characterization of the carrot transcriptome reveals novel genes, new markers, and genetic diversity
© Iorizzo et al; licensee BioMed Central Ltd. 2011
Received: 14 April 2011
Accepted: 2 August 2011
Published: 2 August 2011
Among next generation sequence technologies, platforms such as Illumina and SOLiD produce short reads but with higher coverage and lower cost per sequenced nucleotide than 454 or Sanger. A challenge now is to develop efficient strategies to use short-read length platforms for de novo assembly and marker development. The scope of this study was to develop a de novo assembly of carrot ESTs from multiple genotypes using the Illumina platform, and to identify polymorphisms.
A de novo assembly of transcriptome sequence from four genetic backgrounds produced 58,751 contigs and singletons. Over 50% of these assembled sequences were annotated allowing detection of transposable elements and new carrot anthocyanin genes. Presence of multiple genetic backgrounds in our assembly allowed the identification of 114 computationally polymorphic SSRs, and 20,058 SNPs at a depth of coverage of 20× or more. Polymorphisms were predominantly between inbred lines except for the cultivated x wild RIL pool which had high intra-sample polymorphism. About 90% and 88% of tested SSR and SNP primers amplified a product, of which 70% and 46%, respectively, were of the expected size. Out of verified SSR and SNP markers 84% and 82% were polymorphic. About 25% of SNPs genotyped were polymorphic in two diverse mapping populations.
This study confirmed the potential of short read platforms for de novo EST assembly and identification of genetic polymorphisms in carrot. In addition we produced the first large-scale transcriptome of carrot, a species lacking genomic resources.
The most widely cultivated member of the Apiaceae, carrot (Daucus carota var. sativus L.) is a diploid (2n = 2x = 18) with a relatively small genome of 480 Mb. It has a history of cultivation as a root crop that dates back about 1100 years to Afghanistan . Today, carrot is the largest single source of provitamin A carotenoids in the U.S. diet , and it is among the top ten vegetable crops globally in term of area of production and market value [http://faostat.fao.org/faostat/, [1, 3]]. World carrot production is increasing and this has, in part, been attributed to an increased awareness of health benefits associated to carrot consumption by consumers .
Phenotypic and molecular diversity of carrot is expansive  and this diversity has been important in improving nutritional value and consumer quality; disease and pest resistance; and yield characteristics important for growers. Carrot genetic linkage maps have primarily been developed with anonymous AFLPs and RAPDs that require no prior genomic information, but these techniques yield primarily dominant markers . Sequence-tagged codominant markers have been developed to facilitate selection for several traits, including major genes for carotenoid accumulation , sugar type , root-knot nematode resistance , and for 22 genes in the carotenoid biosynthetic pathway targeted for candidate gene analysis . Although the carrot plastome has been sequenced , a 17-fold bacterial artificial chromosome library (BAC) has been developed and characterized , and microsatellite markers are being developed (unpublished data, Cavagnaro et al.), only relatively modest genomic resources have been developed for carrot .
ESTs have been a valuable resource to develop SNP (Single Nucleotide Polymorphism) and SSR (Simple Sequence Repeat) markers for many plants and animals, but as of Feb 22, 2011, only 2,914 non-organellar DNA sequences of Daucus carota are available in GenBank. An expansion of genomic sequence resources for carrot will be useful for a broad range of carrot genomic applications including the development of co-dominant markers such as SNPs and SSRs for genome mapping and diversity assessment.
The development of ESTs has historically relied upon Sanger sequence with some recent efforts also using longer-read next-generation sequencing technology such as Roche 454. Next generation sequence (NGS) technologies are revolutionizing molecular biology by lowering the cost per sequenced nucleotide and increasing the throughput . Short reads platforms such as Illumina and SOLiD produce higher coverage and lower cost per sequenced nucleotide, but due to their shorter reads the use of those platforms has usually been restricted to resequencing applications which rely on a reference sequence for assisting the assembly. In absence of a reference sequence, a computational de novo assembly approach is required. With increased read length from technologies such as Illumina, and the development of new computational tools, short reads can be assembled and used for transcriptome analysis . Recently three de novo assemblies using sequence reads from Illumina technology have been developed and described for plants [15–17]. Currently there are few genomic sequence or ESTs available for carrot.
The objectives of this study were to sequence carrot ESTs using the Illumina platform and to characterize the carrot transcriptome to develop a molecular resource for marker development and gene identification. This represents an application of short read sequence technology for transcriptome assembly of a plant lacking extensive genomic molecular resources. This EST collection will provide a valuable resource for genetic, diversity, structural and functional genomic studies in carrot.
Sequencing and assembly
To develop an overview of the carrot transcriptome and obtain an initial comparison of cultivated and wild carrot transcripts, normalized cDNA libraries were constructed from four sources: two orange unrelated inbred lines of European origin, B493 and B6274 with Imperator and Nantes root shapes, respectively, ; a purple/yellow inbred line B7262 derived from an intercross between purple Turkish and orange Danvers (European) carrots ; and a pool of F4 RILs derived from a cross between B493 and QAL, a wild carrot from North America. These pooled F4's are referred to as B493 × QAL and were derived from a single B493 × QAL F1 plant. Therefore at most two haplotypes are represented among these transcripts. A total of 18 044 high quality Sanger sequences were obtained from the B493 library to generate a total of 8,221,411 nt with an average length of 456 nt. The three other libraries, B6274, B7262 and B493 × QAL were sequenced with Illumina GAII platform with 61 cycles to yield from 34 M to 39 M usable reads of 41 nt or longer for each genotype.
To test the quality of the assembly, we compared 20 full-length carrot mRNA sequences available from NCBI as references (Additional file 1, Figure S3). Corresponding de novo contigs were located using a BLASTN search, and a single best match for each de novo contig was found for each of the 20 reference genes. Raw Illumina and Sanger reads from each genotype were mapped onto each reference sequence and its corresponding de novo contig. All reference sequences were well covered by raw reads except at the ends, with 3' and 5' regions having relatively low coverage. Five of these 20 sequences were partially covered (353 to 675 nt) by B493 Sanger reads (data not shown). The average coverage among three Illumina sequenced-genotypes ranged from 32 (DCMY B) to 660 (DcPRP) reads. Fifteen of the 20 Sanger-derived reference cDNAs were 100% covered by Illumina reads. Of those, four were somewhat longer (40-100 base pairs) than reference sequences, perhaps due to differences in trimming of reads and coverage. The remaining five were partially covered by raw Illumina reads.
Results of BLASTN of anthocyanin reference sequences against EST consensus sequences in this study
Reference sequence source
Contigs with hits to reference
Phenylalanine ammoniam-lyase (PAL)
cinammate 4-hydroxylase (CA4H)
4-coumaroyl-coenzyme A ligase (4CL)
Chalcone synthase (CHS)
Chalcone--flavonone isomerase (CHI)
flavanone 3-hydroxylase (F3H)
Flavonoid 3'-monooxygenase (F3'H)
Petunia x hybrida
flavonol synthase (FLS)
Flavonoid 3',5'-hydroxylase (F3'5'H)
Dihydroflavonol 4-reductase (DFR)
leucoanthocyanidin dioxygenase 2 (LDOX)
Anthocyanidin 3-O-glucosyltransferase (UFGT)
Previously unreported carrot anthocyanin genes
Reference sequence source
Carrot EST contig
Contig length (nt)
Transposable elements (TEs) play an important role in shaping plant phenotypes, including an invertase mutant in carrot that conditions the type of sugar in storage roots [8, 24]. In order to identify ESTs potentially originating from TEs, we selected 360 contigs annotated by BLAST2GO as TE-related. They were queried against RepBase ver. 15.12 (Dec 18, 2010) using Censor . The resulting file (Additional file 3, Table S3) was manually searched for contigs with over 100 nt similarity to a particular superfamily and these contigs were subsequently assigned to that superfamily based on the classification system proposed by . This resulted in the classification of 204 TE-related contigs. Transcripts related to Class I elements were four times more abundant than those of Class II (163 and 41 contigs, respectively). ESTs related to LTR retrotransposons, gypsy-derived transcripts were the most abundant. Transcripts originating from non-LTR retroelements and DNA transposons of the TIR order were also identified, while no Helitron-related transcripts were observed. ESTs attributed to hAT, PIF/Harbinger, and CACTA superfamilies were three times more abundant than those from Tc1/Mariner and Mutator superfamilies.
We also searched EST sequences for regions derived from known carrot Class II transposable elements, namely DcMaster-like, including Krak MITEs , a family of Stowaway-like MITEs - DcSto (unpublished data, Grzebelus et al.), non-autonomous hAT elements - Dc-hAT1 (unpublished data, Grzebelus et al.), and a family of CACTA elements - Tdc (Additional file 1, Table S4). Only for the latter group of TEs could transposase-specific transcripts be detected. Three ESTs represented Tdc A1 transposase, while five others were likely chimeric transcripts containing portions of the transposase fused with other coding sequences (Additional file 3, Table S5). Only one of the Tdc1 transposase-specific ESTs (Contig3495) was identified as CACTA-derived in the general screen described above. In contrast, DcMaster transposase-specific transcripts were not detected, even though in the general screen 13 ESTs were attributed to the PIF/Harbinger superfamily. All hits found for DcMaster/Krak, DcSto, and Dc-hAT1 elements originated from non-coding regions of non-autonomous elements. DcSto elements were the most abundantly represented in the ESTs, including three transcripts containing complete copies of the MITEs. In a number of ESTs, regions derived from the non-autonomous TEs were present near the 5' or 3' end of the functional protein-coding transcript (Additional file 3, Table S6), suggesting that they were inserted in the 5' or 3'UTRs.
In total, 92% of the contigs contained sequences from at least two genotypes, which make the assembly a suitable resource for detecting candidate polymorphic markers such as SNPs, SSRs, and Indels (insertions or deletions).
We identified 8823 putative SSRs in 6,995 contigs (11.2% of the total 58,751 contigs) with 1,394 contigs containing two or more SSRs. A total of 323 SSRs were categorized as compound repeats. Repeat numbers ranged from 3 to 28 and length ranged from 12 to 84 nt. Trimers were the most abundant repeat motif accounting for 4,196 (47.6%) of the SSRs. Hexamers were the least common motifs with 527 (6%) of the SSRs (Additional file 1, Table S7).
Based on sequence alignments, we identified in silico polymorphic SSRs among the four genotypes. We detected 114 candidate polymorphic markers with SSR allele difference ranging from 3 nt to 14 nt. Trimers were confirmed to be the most abundant motif with 85 (75%) of the polymorphic SSRs (data not shown). Pentamers were the least common motif with 1 polymorphic SSR.
Evaluation of SSR primer pairs in four carrot genetic stocks used to develop the EST library
Number of primers
Percentage of all primer tested
Percentage of amplified primer
Single expected product
Single larger product
Evaluation of SNP primer pairs in four carrot genetic stocks used to develop the EST library
Number of primers
Percentage of all primer tested
Percentage of amplified primer
Percentage of amplified and sequenced primers
Single expected product
Single larger product
Primer pairs that amplify a larger than expected products are presumed to be the result of an intron between the two primer sites. To evaluate how we could reduce the number of markers in this category which complicate SNP primer design, we predicted introns for the 20,148 SNPs using the Solanaceae Genomics Network Intron Finder, which uses an Arabidopsis database (Additional file 1, Table S8). The number of SNPs with no predicted nearby intron was 11,183 (55.5%), in 4,814 contigs. We compared our SNP PCR validation rate with this intron prediction. Of the 354 designed and tested primers, 46% (162) gave a single expected product (SEP) (see SNP validation above and Table 4). Computational intron prediction found 120 SEP primer pairs, which would be candidate for primer design (Additional file 1, Table S8). Of this 91 (76%) were validated by PCR. That prediction would also have rejected 71 [30% (162 minus 91)] useful SEP category. Nevertheless, in our case, the use of this intron-prediction tool would have increased success rate of single expected products from 46% to 76%. Using this approach, primer design and contig detection can be improved to optimize SNP discovery.
SNPs polymorphism within mapping populations
The 212 polymorphic SNP markers were screened in two mapping populations; B493 × QAL , which had alleles identified from this EST project, and an unrelated mapping population 70349 (unpublished data, Cavagnaro et al.). Ten genotypes from each mapping population were screened by PCR and all amplicons sequenced, thus allowing identification of polymorphic markers. In total, 48 polymorphic markers (23%) were identified in the B493 × QAL population and 50 (24%) were polymorphic in the 70349 mapping population (Additional file 1, Table S9), with 11 polymorphic markers common to both populations. Thus, a total of 87 polymorphic markers were identified. Overall, 41% (87/212) of the SNP markers that were polymorphic among the genotypes used for developing the EST datasets, and 28% (87/311) of the primer pairs that produced amplicons were polymorphic in the two mapping populations. These markers have potential for mapping and as anchor points for carrot map merging.
Next generation sequencing technologies are revolutionizing molecular biology by lowering the cost per sequenced nucleotide, increasing the throughput and eliminating laborious bacterial cloning. Despite of the shorter sequence reads, compared to Sanger sequencing, the lower cost and high coverage of NGS are the two main factors that drive researchers to use these new technologies. Due to the longer reads produced by the Roche GS FLX technology, this NGS platform is most commonly used for de novo transcriptome sequencing. This platform has been used for transcriptome sequencing of pine , oats , Aegilops and buckwheat . In contrast, short read-length platforms such as Illumina and SOLiD, which produce higher coverage and lower cost per sequenced nucleotide, have been relegated to resequencing applications which usually depend, for their assembly, on a reference sequence. With the improvement of read length for technologies such as Illumina, and the development of new computational tools, we have demonstrated that short reads can be assembled and used for transcriptome analysis. Indeed, other recently de novo transcriptome assemblies using Illumina sequences have been successfully developed and described in Ipomoea, whitefly , Eucalyptus chickpea  and orchids . Consistent with previous work, our results demonstrate that short reads can be assembled and used for transcriptome analysis, gene identification and marker development in carrot. We assembled 58,751 contigs and singletons from 114 M Illumina reads and 18,044 Sanger sequences from four different genotypes.
Quality of the de novo assembly was confirmed by sequence comparison, annotation and marker validation. Comparison of assembled contigs with full-length cloned carrot gene sequences confirmed the high quality of the assembly. Seventy-five percent of the contigs aligned nearly completely with mRNA reference sequences. These results were similar to those previously obtained by Mizrachi and colleagues  in Eucalyptus. Distributions of genotype-specific contigs in the different EST datasets of carrot were similar, with B493 × QAL and B7262 showing the highest number of reads in common. In addition, only 2.3% of all the EST contigs were unique to the wild x cultivated (B493 × QAL) dataset. Altogether, these results suggest that the wild carrot transcriptome is not significantly different from the cultivated carrot transcriptome, which is consistent with cross-ability among wild and cultivated carrot in D. carota.
About 67% and 55% of the contigs exhibited homology using BLASTX and BLAST2GO, respectively, indicating that contiguity of the sequence is consistent for most of the assembled transcriptome. BLAST2GO annotation indicated that a wide range of transcriptome diversity was included in the ESTs we evaluated. Contigs without significant matches to the existing databases could reflect either novel, carrot-specific genes or could reflect a poor representation of Apiales sequences in GenBank.
Manual annotation confirmed expression of 11 known carrot anthocyanin genes and allowed identification of five new ones. In addition to the three known carrot phenylalanine ammonia-lyase (PAL) genes, identification of two new PAL genes suggests that multiple -and diverse- members comprise this gene family in carrot. The five previously-unreported anthocyanin biosynthesis genes discovered in this study confirms the usefulness of this new molecular resource for discovering genes of carrot.
Transcripts related to transposable elements constituted 0.34% of the assembled contigs. As TE-related transcripts were initially identified on the basis of the BLAST2GO annotation, that estimate should be viewed as conservative. For example, no Helitron-related contigs were found in the carrot transcriptome, which might be due to the fact that few transcripts originating from Helitrons have been annotated in the GenBank database. Our observation of a range of functional TE transcripts suggests that members of many TE families could potentially be active in carrot. Earlier reports indicated that Tdc elements were activated in the course of long term in vitro cultures  and this along with our observations that DcMaster and related MITEs were highly polymorphic, likely suggests their very recent activity [24–27].
Several contigs contained MITE-derived regions, usually located close to the 5' or 3' ends. This was observed for both Tourist-like (Krak, KrakL1), Stowaway-like (DcSto), and hAT-like (Dc-hAT1) elements, indicating that MITEs in the carrot genome were associated with non-coding regions of genes, similar to their reported occurrence in grasses [37–39]. In contrast, CACTA elements did not show this pattern of insertion preference.
Verification of assembled contigs through PCR amplification represents a good test of quality of the assembly. The goals of this study were to use a multi-genotype based assembly to develop co-dominant molecular markers, such as Simple Sequence Repeats (SSRs) and Single Nucleotide Polymorphisms (SNPs). SSR trimers were the most abundant repeat motif, confirming previous observations by Cavagnaro and colleagues (unpublished data).
Use of multiple genetic backgrounds in our EST analysis allowed us to identify 114 computational polymorphic SSRs and 20,058 SNPs at a depth of coverage of at least 20. Most of the polymorphic SNPs found in carrot inbred lines were polymorphic when compared against other lines but were monomorphic within lines. This observation indicates that transcriptome comparison is an efficient strategy to identify SNP markers for molecular genetic mapping and diversity analysis. Within-sample SNP polymorphism in the cultivated × wild carrot (B493 × QAL) derivatives was highest (Figure 4). As QAL is expected to be heterozygous this sample was designed as a pool of RILs to represent alleles segregating in the B493 x QAL mapping population. SNP polymorphism (35%) between B493 × QAL and the two cultivated genotypes and among cultivated genotypes used in this study was also high. Although the gene content in wild and cultivated carrot was highly similar, there is a high degree of allelic variation among them.
Primers were designed and tested for 114 computational polymorphic SSRs and 354 SNP loci. Of these, we were able to amplify 102 SSRs (~90%) and 311 SNPs (~88%), with 27 and 110 markers, respectively, amplifying a product larger than expected, suggesting the presence of intron(s) within the amplicon. Validation rate (expected size rate and polymorphic rate) showed that our results were similar or higher than what was previously obtained in Cajanus, iris , Epimedium, Pinus, chickpea , Cryptomeria, apple , bean  and oat  where Sanger, 454, and Illumina platforms were used for sequencing.
To evaluate how intron prediction could affect SNP validation rate we predicted introns using the Sol Genomics Network Intron Finder Arabidopsis database. Based on our SNP validation data, intron prediction would increase the yield of single expected size PCR products from 46% to 76%. In contrast, due to the genetic distance between carrot and Arabidopsis, carrot specific regions would be excluded and decrease the total number of useful SNPs by about 20%. Our data suggests that for species unrelated to Arabidopsis it would be better to use both introns predicted and empirical data for assay design to maximize validation rate and evaluate genetic diversity.
In our evaluation of two mapping populations, the B493 × QAL population had alleles identified directly from the ESTs, whereas the second mapping population, 70349 was unrelated to our EST sequence data. Interestingly, about a 25% of the 212 SNPs evaluated were polymorphic in both mapping populations. About 13% of the SNPs were polymorphic in both mapping populations, the remainder being polymorphic in one population but not the other. This small-scale assay provides important information useful in predicting the number of markers to screen in designing high throughput molecular assays.
In this study we confirmed the potential of using a short read sequencing platform for de novo assembly producing the first large-scale transcriptome sequence set of carrot a species lacking genomic resources. EST characterization provided evidence of the usefulness of this resource for gene detection and mapping of carrot. In addition we demonstrated that transcriptome comparisons provide an efficient strategy for marker development allowing detection and validation of computational polymorphic SSRs and a large set of SNPs.
Carrot material for inbred lines, B6274, B7262, and B493, as well as the pool of F4 B493xQAL RILs were grown in pots under greenhouse conditions. Root and leaf tissues were harvested after ten weeks post planting, with the leaf tissue separated from the root immediately upon harvest. Both the leaf and the storage root tissues were flash frozen in liquid nitrogen and stored at -80°C.
A CTAB based RNA extraction protocol modified from Chang and colleagues  was used to extract RNAs for both the Sanger and Illumna sequencing projects. For the pooled extraction of the F4 B493xQAL RILs, RNAs were extracted from the leaf tissue and root tissue of individual plants representing a variety of pigmented carrot root tissue and then pooled after extraction using equal amounts of mRNA from each tissue type and genotype. The extracted RNAs were analyzed for potential degradation by gel electrophoresis. RNA concentration was quantified using a NanoDrop spectrophotometer (NanoDrop Technologies). cDNA was synthesized and prepared for paired-end sequencing as described by Illumina . Samples were sheared, 300-350 bp fragments selected, and were normalized using double-stranded nuclease that digests high copy double stranded DNA during re-association after denaturation. For Sanger sequencing, normalized cDNA libraries were constructed from root and leaf as described above for B493 with insert sizes ranging from 1.0-2.0 kb and above 2 kb and cloned into pDNR-LIB using the SMART(TM) cDNA library kit (Clontech, Palo Alto, CA).
Sequencing and assembly
Illumina sequencing was performed with the GAII platform at the UC Davis Genome Center (University of California, Davis, USA) according to the manufacturer's instructions (Illumina, San Diego, CA). Twenty thousand reads were attempted with T7 primer on Sanger 3730 (Symbio Inc, Menlo Park, CA).
Details of assembly procedure are reported in additional file 4. Briefly, Sanger read basecalls and quality scores were made with phred version 0.020425.c . Vector sequence and low quality bases were trimmed with Lucy version 1.19p . Resulting sequences and quality files were assembled with CAP3 version 12/21/07  with default parameters.
Three versions of the original Illumina reads from each genotype were created to be used for assembly: unmodified original reads (A); 10 base pairs trimmed (in order to remove low quality reads) sequences from the left (5') (B) and right (5') (C) Optimal kmer length was determined by assembling one genotype at all possible kmer values from 20 to 60 using Velvet version 0.7.55 . The kmer length of 41 was selected for further assemblies. Each of the three Illumina read sets for each of the three genotypes was assembled separately. The resulting three assemblies within each genotype were merged by assembling with CAP3. Unmodified original reads (set A) were also assembled using ABySS version 1.0.15 . Optimal kmer length was determined by performing assemblies on one genotype at a time and each of the three Illumina read sets was assembled separately. Resulting contigs and singlets from the CAP3 Sanger sequence assembly, Velvet+CAP3 Illumina sequence assemblies and ABySS Illumina sequence assemblies were assembled into a consensus assembly with CAP3, generating the reference de novo multi-genotype assembly. Illumina sequences generated in this study have been deposited in Data Bank of Japan (DDBJ) Sequence Read Archive (DRA, http://trace.ddbj.nig.ac.jp/dra/) with accession number SRA 035376. Sanger sequences have been deposited in NCBI Expressed Sequence Tags database (dbEST, http://www.ncbi.nlm.nih.gov/dbEST/) with accession number JG753039-JG771082.
To verify quality of the assembly, 20 full-length carrot genes available in GenBank were used to map raw Illumina reads and align the correspondent de novo contigs. Alignment of reads against full-length reference sequences and correspondent contig was carried out using BLAST ver. 2.2.24+ (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) with the following parameters: e-value: 10; dust filter; off; minimum blast hit length: 51 nt; minimum blast hit percent match: 80. A global pairwise alignment of the full-length reference sequence and corresponding contig was performed using the Needle program ver. 6.3.1 from the EMBOSS package .
Homology search and functional annotation
Assembled sequences were used for blast searches and annotation against the NCBI nr database using a cutoff e-value of e-05 and minimum coverage length ≥33aa. Daucus protein sequences greater than 33 amino acids available in GenBank were used for blast (TBLASTX) analysis against our EST collection, using a cutoff value of e-05 and minimum coverage length ≥100 nt. Functional annotation and gene ontology term assignment was carried out using BLAST2GO (http://www.geneontology.org). In order to find ESTs potentially originating from anthocyanin genes, 21 complete sequences from GenBank were selected and blasted against our local database with a cutoff e-value of e-05. We also searched for ESTs potentially originating from transposable elements (TEs). We filtered contigs annotated as TE-related from the BLAST2GO output. They were queried against RepBase ver. 15.12 (Dec 18, 2010) using Censor .
In order to identify transcripts containing fragments of previously described carrot Class II transposons DcMaster, Krak, and Tdc1, together with unpublished MITEs DcSto and Dc-hAT1, their consensus sequences were used as blast queries against the EST database with e-value cutoff equal e-02.
Identification of EST-SSRs and SNPs
SSR motifs were identified using MISA 1.0 (MIcroSAtellite; http://pgrc.ipk-gatersleben.de/misa/) , which identifies both perfect and compound repeats. We searched for di-, tri-, tetra-, penta- and hexa-nucleotide repeats with a minimum of six repeat units for dinucleotides, four for trinucletides and three repeat units for tetra, penta and hexanucleotides. Adjacent microsatellites ≤10 nt apart were considered compound repeats.
Polymorphic SSRs were detected computationally by a custom Perl program that analyzed the output of the final CAP3 assembly stage. Indels from 3 nt to 50 nt in size, and with at least 25 nt of flanking sequence were considered.
SNP detection was performed using Mosaik 1.0.1388  with the following parameters: maximum hash positions per seed: 100; alignment candidate threshold: 20. This resulted in the detection of 346,456 SNPs. For marker validation and data analysis we reduced this to 20,148 SNPs using the following parameter: most abundant allele minimum read coverage: 10; second most abundant allele minimum read coverage: 10; minimum average coverage: 10; minimum flanking length: 100 nt; minimum quality score: 0.99; minimum absolute isolation: 30 nt.
Primer design and amplification
Primer pairs flanking the SSR motifs were designed using MISA with the following parameters: end stability: 250; minimum size: 100 nt; maximum size: 300 nt. In total we synthesized 114 primer pairs (Additional file 3, Table S10).
Primer pairs flanking the SNPs were designed using Primer3  with the following parameters: end stability: 250; optimum Tm: 60°C; minimum size: 120 nt; maximum size: 201 nt. In total we synthesized 354 primer pairs (Additional file 3, Table S11).
Primer validation was carried out on genomic DNA of the B493 × QAL wild carrot derivatives and three cultivated genotypes, B493, B6272 and B7262. Microsatellite flanking primers were tested in a PCR of 20 μl volume containing 13 μl water, 2 μl 10X DNA polymerase buffer, 0.8 μl dNTPs (2.5 mM each), 1 μl 5 μM of each primer, 0.2 μl Taq polymerase (MBI, Fermentas, USA) and 2 μl of genomic DNA (~50 ng). PCR conditions were: initial denaturation at 94°C for 2 min., followed by 33 cycles of 94°C for 45 sec., Tm (°C) for 1.0 min., 72°C for 1.0 min. and 20 sec., and a final step at 72°C for 10.0 min.. Electrophoresis was carried out for 2-3 hours at 100 V on 2% agarose TAE gels supplemented with 0.2 μg/ml of ethidium bromide.
To verify the predicted polymorphism of SSRs, 31 primer pairs were tested using a fluorescent method . PCR was performed in a 20 μl final volume including 12.5 μl water, 2 μl 10X DNA polymerase buffer, 0.8 μl dNTPs (2.5 mM each), 1 μl 5 μM of reverse primer, 0.5 μl 5 μM of M13-tailed forward primer, 1 μl 5 μM of M13 labelled either with 6-FAM, HEX or NED fluorochromes, 0.2 μl Taq polymerase (MBI, Fermentas, USA) and 2 μl of genomic DNA (~50 ng). The amplification conditions were 94°C for 2 min; 10 cycles of 94°C for 40 sec, 60°C for 1 min with a reduction of 0.5°C each cycle, 72°C for 1 min; 40 cycle of 94°C for 45 sec., 60°C for 1.0 min., 72°C for 1.0 min. and 20 sec, and a final step at 72°C for 10.0 min. Amplicon lengths were estimated using an ABI 3730xl capillary sequencer available at the University of Wisconsin Biotechnology Center and analyzed with GeneMarker software version 1.5 (SoftGenetics, State College, Pennsylvania).
Single Nucleotide Polymorphism validation was carried out on a PCR of 20 μl volume containing 12.2 μl water, 2 μl 10X DNA polymerase buffer, 1.6 μl dNTPs (2.5 mM each), 1 μl 5 μM of each primer, 0.2 μl Taq polymerase (MBI, Fermentas, USA) and 2 μl of genomic DNA (~50 ng). PCR conditions were: initial denaturation at 94°C for 2 min., followed by 25 cycles of 94°C for 30 sec., appropriate annealing temperature (Table S4) for 30 sec., and 72°C for 45 sec., and a final step at 72°C for 10 min. Presence and length of the amplicon was detected on 2% agarose TAE gels supplemented with 0.2 μg/ml of ethidium bromide, and separated for 2-3 hours at 100 V. To verify presence of the expected SNP, single products of expected size and single products larger than expected, with an overall size shorter than 500 bp, were sequenced. Sequencing reactions were performed in a 5 μl final volume including, 1.75 μl of water, 1 μl of 5 μM primer, 0.75 μl 5 × BigDye®3.1 sequencing buffer, 0.5 μl of BigDye®3.1 ready reaction mix and 1 μl of PCR product, previously diluted 1:10 with water. Amplification conditions were: 25 cycles of 96°C for 10 sec., and 58°C for 2 minutes, and a final step at 72°C for 5.0 min. The sequences were generated by the University of Wisconsin Biotechnology Center and analyzed using Sequencher software version 4.8 (GeneCodes Corporation, Ann Arbor, MI).
Intron prediction was carried out using Intron Finder (http://solgenomics.net/tools/intron_detection/find_introns.pl) with a cutoff e-value of e-50. Intron prediction results for the 354 assembled contigs screened for SNPs detection, were compared with our validation data results.
SNP polymorphisms within mapping populations
The in silico predicted polymorphic SNP markers were screened in two mapping populations including B493 × QAL  and 70349. Ten genotypes from each mapping population were screened on a PCR of 15 μl volume containing 12.2 μl water, 2 μl 10X DNA polymerase buffer, 1.6 μl dNTPs (2.5 mM each), 1 μl 5 μM of each primer, 0.2 μl Taq polymerase (MBI, Fermentas, USA) and 2 μl of genomic DNA (~50 ng). PCR conditions were: initial denaturation at 94°C for 2 min., followed by 25 cycles of 94°C for 30 sec., appropriate annealing temperature (Table S5) for 30 sec., and 72°C for 45 sec., and a final step at 72°C for 10 min. Quality of the amplicon was detected on 2% agarose TAE gels supplemented with 0.2 μg/ml of ethidium bromide, and separated for 2-3 hours at 100 V. To detect SNP polymorphism, PCR products were analyzed by sequencing as previously described.
amplified fragment length polymorphism
expressed sequences tag
polymerase chain reaction
single nucleotide polymorphism
quantitative trait loci
bacterial artificial chromosome
random amplified polymorphic DNA
simple sequence repeat
next generation sequence
We acknowledge financial support from the carrot industry and the following vegetable seed companies: Bejo, Nunhems, Rijk Zwaan, Takii, and Vilmorin. We would like to thank Alex Kozik for providing scripts and guidance for sequence assembly.
- Simon PW, Rubatzky VE, Bassett MJ, Strandberg JO, White JM: B7262, Purple carrot inbred. HortScience. 1997, 32: 146-147.Google Scholar
- Simon PW, Pollak LM, Clevidence BA, Holden JM, Haytowitz DB: Plant breeding for human nutritional quality. Plant Breed Rev. 2009, 31: 325-392.Google Scholar
- Vilela NJ: Cenoura: un alimento nobre ne mesa popular. Horticultura Brasileir. 2004, 22: 1-2.Google Scholar
- Arscott SA, Tanumihardjo SA: Carrots of many colors provide basic nutrition and bioavailable phytochemicals acting as a functional food. Comprehensive reviews in Food Science and Food Safety. 2010, 9: 223-239. 10.1111/j.1541-4337.2009.00103.x.View ArticleGoogle Scholar
- Bradeen JM, Bach IC, Briard M, le Clerc V, Grzebelus D, Senalik DA, Simon PW: Molecular diversity analysis of cultivated carrot (Daucus carota L.) and wild Daucus populations reveals a genetically nonstructured composition. J Amer Soc Hort Sci. 2002, 127: 383-391.Google Scholar
- Bradeen JM, Simon PW: Carrot. Genome Mapping and Molecular Breeding in Plants. Edited by: Kole C. 2007, Berlin: Springer-Verlag, 161-184.Google Scholar
- Bradeen JM, Simon PW: Conversion of an AFLP fragment linked to the carrot Y2 locus to a simple, codominant, PCR-based marker form. Theor Appl Genet. 1998, 97: 960-967. 10.1007/s001220050977.View ArticleGoogle Scholar
- Yau YY, Simon PW: A 2.5-kb insert eliminates acid soluble invertase isozyme II transcript in carrot (Daucus carota L.) roots, causing high sucrose accumulation. Plant Mol Biol. 2003, 53: 151-162.PubMedView ArticleGoogle Scholar
- Boiteux LS, Hyman JR, Bach IC, Fonseca MEN, Matthews WC, Roberts PA, Simon PW: Employment of flanking codominant STS markers to estimate allelic substitution effect in a nematode resistance locus in carrot. Euphytica. 2004, 136: 37-44.View ArticleGoogle Scholar
- Just BJ, Santos CAF, Fonseca MEN, Boiteux LS, Oloiza BB, Simon PW: Carotenoid biosynthesis structural genes in carrot (Daucus carota): isolation, sequence-characterization, single nucleotide polymorphism (SNP) markers and genome mapping. Theor Appl Genet. 2007, 114: 693-704. 10.1007/s00122-006-0469-x.PubMedView ArticleGoogle Scholar
- Ruhlman T, Lee SB, Jansen RK, Hostetler JB, Tallon LJ, Town CD, Daniell H: Complete plastid genome of Daucus carota: implications for biotechnology and phylogeny of angiosperms. BMC Genomics. 2006, 7: 222-10.1186/1471-2164-7-222.PubMed CentralPubMedView ArticleGoogle Scholar
- Cavagnaro PF, Chung SM, Szklarczyk M, Grzebelus D, Senalik D, Atkins AE, Simon PW: Characterization of a deep-coverage carrot (Daucus carota L.) BAC library and initial analysis of BAC-end sequences. Mol Genet Genom. 2009, 281: 273-288. 10.1007/s00438-008-0411-9.View ArticleGoogle Scholar
- Bhasi A, Senalik D, Simon PW, Kumar B, Manikandan V, Philip P, Senapathy P: RoBuST: An integrated resource of genomics information for plants in the root and bulb crop families Apiaceae and Alliaceae. BMC Plant Biol. 2010, 10: 161-10.1186/1471-2229-10-161.PubMed CentralPubMedView ArticleGoogle Scholar
- Paszkiewicz K, Studholme DJ: De novo assembly of short sequence reads. Briefings in bioinformatics. 2010, 5: 457-472.View ArticleGoogle Scholar
- Wang Z, Fang B, Chen J, Zhang X, Luo Z, Huang L, Chen X, Li Y: De novo assembly and characterization of root transcriptome using Illumina paired-end sequencing and development of cSSR markers in sweetpotato (Ipomoea batatas). BMC Genomics. 2010b, 11: 726-10.1186/1471-2164-11-726.View ArticleGoogle Scholar
- Mizrachi E, Hefer CA, Ranik M, Joubert F, Myburg AA: De novo assembled expressed gene catalog of a fast-growing Eucalyptus tree produced by Illumina mRNA-Seq. BMC Genomics. 2010, 11: 681-10.1186/1471-2164-11-681.PubMed CentralPubMedView ArticleGoogle Scholar
- Garg R, Patel RK, Tyagi AK, Jain M: De novo assembly of chickpea transcriptome using short reads for gene discovery and marker identification. DNA Research. 2011, 1: 11-Google Scholar
- Simon PW, Peterson CE, Gabelman WH: B493 and B9304, carrot inbreds for use in breeding, genetics and tissue culture. HortScience. 1990, 25: 815-Google Scholar
- Simon PW: Domestication, historical development, and modern breeding of carrot. Plant Breed Rev. 2000, 19: 157-190.Google Scholar
- Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Res. 1999, 9: 868-877. 10.1101/gr.9.9.868.PubMed CentralPubMedView ArticleGoogle Scholar
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18: 821-829. 10.1101/gr.074492.107.PubMed CentralPubMedView ArticleGoogle Scholar
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol İ: ABySS: A parallel assembler for short read sequence data. Genome Res. 2009, 19: 1117-1123. 10.1101/gr.089532.108.PubMed CentralPubMedView ArticleGoogle Scholar
- Kurilich AC, Clevidence BA, Britz SJ, Simon PW, Novotny JA: Plasma and urine responses are lower for acylated vs. non acylated anthocyanins from raw and cooked purple carrots. J Agric Food Chem. 2005, 53: 6537-6542. 10.1021/jf050570o.PubMedView ArticleGoogle Scholar
- Grzebelus D, Yau YY, Simon PW: Master: a novel family of PIF/Harbinger-like transposable elements identified in carrot (Daucus carota L.). Mol Genet Genomics. 2006, 275: 450-459. 10.1007/s00438-006-0102-3.PubMedView ArticleGoogle Scholar
- Kohany O, Gentles AJ, Hankus L, Jurka J: Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics. 2006, 7: 474-10.1186/1471-2105-7-474.PubMed CentralPubMedView ArticleGoogle Scholar
- Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavel A, Leroy P, Morgante M, Panaud O, Paux E, SanMiguel P, Schulman AH: A unified classification system for eukaryotic transposable elements. Nature Rev Genet. 2007, 8: 973-982. 10.1038/nrg2165.PubMedView ArticleGoogle Scholar
- Grzebelus D, Simon PW: Diversity of DcMaster-like elements of the PIF/Harbinger superfamily in the carrot genome. Genetica. 2009, 135: 347-53. 10.1007/s10709-008-9282-6.PubMedView ArticleGoogle Scholar
- Itoh Y, Hasebe M, Davies E, Takeda J, Ozeki Y: Survival of Tdc transposable elements of the En/Spm superfamily in the carrot genome. Mol Genet Genomics. 2003, 269: 49-59.PubMedGoogle Scholar
- Santos CAF, Simon PW: QTL analyses reveal clustered loci for accumulation of major provitamin A carotenes and lycopene in carrot roots. Mol Genet Genom. 2002, 268: 122-129. 10.1007/s00438-002-0735-9.View ArticleGoogle Scholar
- Parchman TL, Geist KS, Grahnen JA, Benkman CW, Buerkle CA: Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery. BMC Genomics. 2010, 11: 180-10.1186/1471-2164-11-180.PubMed CentralPubMedView ArticleGoogle Scholar
- Oliver RE, Lazo GR, Lutz JD, Rubenfield MJ, Tinker NA, Anderson JM, Wisniewski Morehead NH, Adhikary D, Jellen EN, Maughan PJ, Brown Guedira GL, Chao S, Beattie AD, Carson ML, Rines HW, Obert DE, Bonman JM, Jackson EW: Model SNP development for complex genomes based on hexaploid oat using high-throughput 454 sequencing technology. BMC Genomics. 2011, 12: 77-10.1186/1471-2164-12-77.PubMed CentralPubMedView ArticleGoogle Scholar
- You FM, Huo N, Del KR, Gu YQ, Luo MC, McGuire PE, Dvorak J, Anderson OA: Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence. BMC Genomics. 2011, 12: 59-10.1186/1471-2164-12-59.PubMed CentralPubMedView ArticleGoogle Scholar
- Logacheva MD, Kasianov AS, Vinogradov DV, Samigullin TH, Gelfand MS, Makeev VJ, Penin AA: De novo sequencing and characterization of floral transcriptome in two species of buckwheat (Fagopyrum). BMC Genomics. 2011, 12: 30-10.1186/1471-2164-12-30.PubMed CentralPubMedView ArticleGoogle Scholar
- Wang XW, Luan JB, Li JM, Bao YY, Zhang CX, Liu SS: De novo characterization of a whitefly transcriptome and analysis of its gene expression during development. BMC Genomics. 2010, 11: 400-10.1186/1471-2164-11-400.PubMed CentralPubMedView ArticleGoogle Scholar
- Fu CH, Chen YW, Hsiao YY, Pan ZJ, Liu ZJ, Huang YM, Tsai WC, Chen HH: OrchidBase: A collection of sequences of transcriptome derived from orchids. Plant and Cell Physiol. 2011, 55: 238-243.View ArticleGoogle Scholar
- Ozeki Y, Davies E, Takeda J: Plant cell culture variation during long-term subculturing caused by insertion of a transposable element in a phenylalanine ammonia-lyase (PAL) gene. Mol Gen Genet. 1997, 254: 407-416. 10.1007/s004380050433.PubMedView ArticleGoogle Scholar
- Dutta S, Kumawat G, Singh BP, Gupta DK, Singh S, Dogra V, Gaikwad K, Sharma TR, Raje RS, Bandhopadhya TK, Datta S, Singh MN, Bachasab F, Kulwal P, Wanjari KB, Varchney RK, Cook DR, Singh NK: Development of genic-SSR markers by deep transcriptome sequencing in pigeonpea (Cajanus cajan (L.) Millspaugh). BMC Plant Biol. 2011, 11: 17-10.1186/1471-2229-11-17.PubMed CentralPubMedView ArticleGoogle Scholar
- Bureau TE, Wessler SR: Tourist: a large family of inverted-repeat elements frequently associated with maize genes. Plant Cell. 1992, 4: 1283-1294.PubMed CentralPubMedView ArticleGoogle Scholar
- Bureau TE, Wessler SR: Stowaway: a new family of inverted-repeat elements associated with genes of both monocotyledonous and dicotyledonous plants. Plant Cell. 1994, 6: 907-916.PubMed CentralPubMedView ArticleGoogle Scholar
- Bureau TE, Ronald PC, Wessler SR: A computer-based systematic survey reveals the predominance of small inverted-repeat elements in wild-type rice genes. Proc Natl Acad Sci USA. 1996, 93: 8524-8529. 10.1073/pnas.93.16.8524.PubMed CentralPubMedView ArticleGoogle Scholar
- Tang S, Okashah RA, Cordonnier-Pratt MM, Pratt LH, Johnson VE, Taylor CA, Arnold ML, Knapp S: EST and EST-SSR marker resources for Iris. BMC Plant Biology. 2009, 9: 72-10.1186/1471-2229-9-72.PubMed CentralPubMedView ArticleGoogle Scholar
- Zeng S, Xiao G, Guo J, Fei Z, Xu Y, Roe BA, Wang Y: Development of a EST dataset and characterization of EST-SSRs in a traditional Chenese medicinal plant, Epimedium sagittatum (Sieb. Et Zucc.) Maxim. BMC Genomics. 2010, 11: 94-10.1186/1471-2164-11-94.PubMed CentralPubMedView ArticleGoogle Scholar
- Ujino-Ihara T, Taguchi Y, Moriguchi Y, Tsumura Y: An efficient method for developing SNP markers based on EST data combined with high resolution melting (HRM) analysis. BMC Research Notes. 2010, 3: 51-10.1186/1756-0500-3-51.PubMed CentralPubMedView ArticleGoogle Scholar
- Chagné D, Gasic K, Crowhurst RN, Han Y, Bassett HC, Bowatte DR, Lawrence TJ, Rikkerink EHA, Gardiner SE, Korban SS: Development of a set of SNP markers present in expressed genes of the apple. Genomics. 2008, 92: 353-358. 10.1016/j.ygeno.2008.07.008.PubMedView ArticleGoogle Scholar
- Hyten DL, Song Q, Fickus EW, Quigley CV, Lim JS, Choi IY, Hwang EY, Pastros-Corrales M, Cregan PB: High-throughput SNP discovery and assay development in common bean. BMC Genomics. 2010, 11: 475-10.1186/1471-2164-11-475.PubMed CentralPubMedView ArticleGoogle Scholar
- Chang S, Pryear J, Cairney J: A simple and efficient method for isolating RNA from pine trees. Plant Mol Biol Rep. 1993, 11: 113-116. 10.1007/BF02670468.View ArticleGoogle Scholar
- Illumina: mRNA sequencing sample preparation guide. 2009, Illumina, 24-
- Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998, 8: 175-185.PubMedView ArticleGoogle Scholar
- Chou HH, Holmes MH: DNA sequence quality trimming and vector removal. Bioinformatics. 2001, 17: 1093-1104. 10.1093/bioinformatics/17.12.1093.PubMedView ArticleGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: the European molecular biology open software suite. Trends Genet. 2000, 16: 276-277. 10.1016/S0168-9525(00)02024-2.PubMedView ArticleGoogle Scholar
- Thiel T, Michalek W, Varshney RK, Graner A: Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.). Theor Appl Genet. 2003, 106: 411-422.PubMedGoogle Scholar
- Hillier LW, Marth GT, Quinlan AR, Dooling D, Barnett D, Fox P, Glasscock JI, Hickenbotham M, Huang W, Magrini VJ, Richt RJ, Sander SN, Stewart DA, Stromberg M, Tsung EF, Wylle T, Schedl T, Wilson RK, Mardis ER: Whole genome sequencing and variant discovery in C. elegans. Nat Methods. 2008, 5: 183-188. 10.1038/nmeth.1179.PubMedView ArticleGoogle Scholar
- Rozen S, Skaletsky H: Primer3 on the www for general users and for biologist programmers. Methods Mol Biol. 2000, 132: 365-86.PubMedGoogle Scholar
- Schuelke M: An economic method for fluorescent labeling of PCR fragments. Nature Biotechnology. 2000, 18: 233-234. 10.1038/72708.PubMedView ArticleGoogle Scholar
- Boss PK, Davies C, Robinson SP: Analysis of the expression of anthocyanin pathway genes in developing Vitis vinifera L. cv Shiraz grape barriers and the implications for pathway regulation. Plant Physiol. 1996, 111 (4): 1059-1066.PubMed CentralPubMedGoogle Scholar
- Hummer W, Schreier P: Analysis of proanthocyanidins. Mol Nutr Food Res. 2008, 52 (12): 1382-1398.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.