Next generation sequencing technologies are revolutionizing molecular biology by lowering the cost per sequenced nucleotide, increasing the throughput and eliminating laborious bacterial cloning. Despite of the shorter sequence reads, compared to Sanger sequencing, the lower cost and high coverage of NGS are the two main factors that drive researchers to use these new technologies. Due to the longer reads produced by the Roche GS FLX technology, this NGS platform is most commonly used for de novo transcriptome sequencing. This platform has been used for transcriptome sequencing of pine , oats , Aegilops  and buckwheat . In contrast, short read-length platforms such as Illumina and SOLiD, which produce higher coverage and lower cost per sequenced nucleotide, have been relegated to resequencing applications which usually depend, for their assembly, on a reference sequence. With the improvement of read length for technologies such as Illumina, and the development of new computational tools, we have demonstrated that short reads can be assembled and used for transcriptome analysis. Indeed, other recently de novo transcriptome assemblies using Illumina sequences have been successfully developed and described in Ipomoea , whitefly , Eucalyptus  chickpea  and orchids . Consistent with previous work, our results demonstrate that short reads can be assembled and used for transcriptome analysis, gene identification and marker development in carrot. We assembled 58,751 contigs and singletons from 114 M Illumina reads and 18,044 Sanger sequences from four different genotypes.
Quality of the de novo assembly was confirmed by sequence comparison, annotation and marker validation. Comparison of assembled contigs with full-length cloned carrot gene sequences confirmed the high quality of the assembly. Seventy-five percent of the contigs aligned nearly completely with mRNA reference sequences. These results were similar to those previously obtained by Mizrachi and colleagues  in Eucalyptus. Distributions of genotype-specific contigs in the different EST datasets of carrot were similar, with B493 × QAL and B7262 showing the highest number of reads in common. In addition, only 2.3% of all the EST contigs were unique to the wild x cultivated (B493 × QAL) dataset. Altogether, these results suggest that the wild carrot transcriptome is not significantly different from the cultivated carrot transcriptome, which is consistent with cross-ability among wild and cultivated carrot in D. carota.
About 67% and 55% of the contigs exhibited homology using BLASTX and BLAST2GO, respectively, indicating that contiguity of the sequence is consistent for most of the assembled transcriptome. BLAST2GO annotation indicated that a wide range of transcriptome diversity was included in the ESTs we evaluated. Contigs without significant matches to the existing databases could reflect either novel, carrot-specific genes or could reflect a poor representation of Apiales sequences in GenBank.
Manual annotation confirmed expression of 11 known carrot anthocyanin genes and allowed identification of five new ones. In addition to the three known carrot phenylalanine ammonia-lyase (PAL) genes, identification of two new PAL genes suggests that multiple -and diverse- members comprise this gene family in carrot. The five previously-unreported anthocyanin biosynthesis genes discovered in this study confirms the usefulness of this new molecular resource for discovering genes of carrot.
Transcripts related to transposable elements constituted 0.34% of the assembled contigs. As TE-related transcripts were initially identified on the basis of the BLAST2GO annotation, that estimate should be viewed as conservative. For example, no Helitron-related contigs were found in the carrot transcriptome, which might be due to the fact that few transcripts originating from Helitrons have been annotated in the GenBank database. Our observation of a range of functional TE transcripts suggests that members of many TE families could potentially be active in carrot. Earlier reports indicated that Tdc elements were activated in the course of long term in vitro cultures  and this along with our observations that DcMaster and related MITEs were highly polymorphic, likely suggests their very recent activity [24–27].
Several contigs contained MITE-derived regions, usually located close to the 5' or 3' ends. This was observed for both Tourist-like (Krak, KrakL1), Stowaway-like (DcSto), and hAT-like (Dc-hAT1) elements, indicating that MITEs in the carrot genome were associated with non-coding regions of genes, similar to their reported occurrence in grasses [37–39]. In contrast, CACTA elements did not show this pattern of insertion preference.
Verification of assembled contigs through PCR amplification represents a good test of quality of the assembly. The goals of this study were to use a multi-genotype based assembly to develop co-dominant molecular markers, such as Simple Sequence Repeats (SSRs) and Single Nucleotide Polymorphisms (SNPs). SSR trimers were the most abundant repeat motif, confirming previous observations by Cavagnaro and colleagues (unpublished data).
Use of multiple genetic backgrounds in our EST analysis allowed us to identify 114 computational polymorphic SSRs and 20,058 SNPs at a depth of coverage of at least 20. Most of the polymorphic SNPs found in carrot inbred lines were polymorphic when compared against other lines but were monomorphic within lines. This observation indicates that transcriptome comparison is an efficient strategy to identify SNP markers for molecular genetic mapping and diversity analysis. Within-sample SNP polymorphism in the cultivated × wild carrot (B493 × QAL) derivatives was highest (Figure 4). As QAL is expected to be heterozygous this sample was designed as a pool of RILs to represent alleles segregating in the B493 x QAL mapping population. SNP polymorphism (35%) between B493 × QAL and the two cultivated genotypes and among cultivated genotypes used in this study was also high. Although the gene content in wild and cultivated carrot was highly similar, there is a high degree of allelic variation among them.
Primers were designed and tested for 114 computational polymorphic SSRs and 354 SNP loci. Of these, we were able to amplify 102 SSRs (~90%) and 311 SNPs (~88%), with 27 and 110 markers, respectively, amplifying a product larger than expected, suggesting the presence of intron(s) within the amplicon. Validation rate (expected size rate and polymorphic rate) showed that our results were similar or higher than what was previously obtained in Cajanus , iris , Epimedium , Pinus , chickpea , Cryptomeria , apple , bean  and oat  where Sanger, 454, and Illumina platforms were used for sequencing.
To evaluate how intron prediction could affect SNP validation rate we predicted introns using the Sol Genomics Network Intron Finder Arabidopsis database. Based on our SNP validation data, intron prediction would increase the yield of single expected size PCR products from 46% to 76%. In contrast, due to the genetic distance between carrot and Arabidopsis, carrot specific regions would be excluded and decrease the total number of useful SNPs by about 20%. Our data suggests that for species unrelated to Arabidopsis it would be better to use both introns predicted and empirical data for assay design to maximize validation rate and evaluate genetic diversity.
In our evaluation of two mapping populations, the B493 × QAL population had alleles identified directly from the ESTs, whereas the second mapping population, 70349 was unrelated to our EST sequence data. Interestingly, about a 25% of the 212 SNPs evaluated were polymorphic in both mapping populations. About 13% of the SNPs were polymorphic in both mapping populations, the remainder being polymorphic in one population but not the other. This small-scale assay provides important information useful in predicting the number of markers to screen in designing high throughput molecular assays.