We report on two transcriptome assemblies of pepper, the first is based on Sanger-EST sequences was used in the pepper GeneChip® project
. The second is based on a collection of transcriptomes of three pepper lines that were sequenced by IGA technology. The majority of pepper EST sequences that were used in the current project had been first assembled by Kim et al. (2008), in which they had assembled 22,011 unigenes with an average consensus sequences length of 1,688 bp. However, in order to construct the pepper GeneChip microarray prior to the Kim publication, we added all pepper sequences and resources that were available at the time (2006) of the assembly. In addition to C. annuum EST sequences from Korean F1 hybrid of Bukang, we added >700 sequences from other C. annuum cultivars and other pepper species such as C. baccatum, C. frutescens and C. chinense. We added pepper genomic and mRNA sequences from GenBank and COS marker sequences from Solanaceae Genome Network (SGN) and UC Davis
. We used a combination of several in-house scripts and CAP3 to make our assembly, while Kim et al. took a different approach to make the assembly. Regardless of the methods used to assemble EST sequences, the database that Kim et al. has created is useful per se to query for the sequence information and find annotation of each contig. We have enhanced the information for Sanger pepper ESTs by mining and validating a subset of SNPs from this assembly. We have also leveraged the information to develop an Affymetrix tiling array to construct two ultra-saturated genetic maps of pepper
 and to evaluate genetic diversity in pepper breeding germplasm
. Overall, we were able to map >17,500 unigenes representing over 3,000 genetic bins of pepper
. In the second pepper assembly we attempted to capture as many transcribed genes as possible by collecting tissues from three different genotypes (than Bukang) in different developmental stages. Recently a transcriptome assembly of two pepper parental lines (CM334 and Taean) and their hybrid line (TF68) was carried out by Lu et al.
[22, 23]. Lu et al. used the GS-454 FLX Titanium (Roche, Mannheim, Germany) to sequence mRNA that was collected from fruits of greenhouse-grown peppers. The pepper land race, CM334, in the Lu et al. study was the same land race that we used, but they sequenced it by Roche 454 system and sampled fewer tissues. Furthermore, we normalized our libraries prior to sequencing. Using GS de novo assembling software (Newbler) they were able to assemble 25,597 (N50=911), 29,335 (N50=898) and 33,530 (N50=884) contigs in each of CM334, Taean and TF68, respectively. Functional annotation of these contigs was performed by FunCat
, by which it was determined that the majority of contigs were involved in proteins with binding function, regulation and metabolism. These results are similar to our functional annotation. The Capsicum transcriptome database, a most recent study of pepper transcriptomes, was recently introduced by Góngora-Castillo et al.
. Using Sanger and GS-pyrosequencing technologies they sequenced thirty-three cDNA libraries of C. annuum var. Sonora Anaheim and C. annuum var. Serrano Tampiqueño. Finally, creating a hybrid assembly of Sanger-EST sequences and GS-pyrosequencing using the 454 Newbler program was made using over 1.9 M 454 reads and Sanger-EST sequences. This assembly consists of 32,314 contigs with N50 of 631 and contig length ranging from 100–3,033 nt. The number of contigs of their assembly was close to our Sanger-EST assembly, as well as the three pepper assemblies reported by Lu et al.
[22, 23]. However, the number of contigs might be slightly over estimated because they took into account contigs with a minimum of 100 nt in length, whereas in our Sanger-EST assembly the smallest contig was 200 nt.
While the 454 system generates long sequences, it suffers from low sequence depth, which is the unique advantage of the IGA system. Roche 454 performs poorly over the homopolymer regions of the genome. While IGA performs better on those regions, it has the disadvantage of generating short reads. Therefore a hybrid assembly of long and short reads to resolve the shortfalls of both sequencing systems would improve the quality of assembly. In spite of using IGA technology alone by sequencing three lines of pepper and boosting the number and length of reads (currently up to 120 nt) per IGA lane, we were able to assemble >135 M nucleotides in our assembly, which is 26 times more than any previously reported assemblies. In addition to the number of bases assembled, the N50 of the transcriptome assembly of this study is twice that of assemblies that were made with pyrosequencing alone.
In the present study we also annotated the two assemblies of pepper transcriptomes. According to the percentage of annotated contigs, 65% of the Sanger-EST assembly contigs and 35% of the IGA transcriptome assembly contigs were annotated. There are a number of reasons for the lower percentage of annotation of the IGA transcriptome assembly; one is that there were more novel sequences in the IGA transcriptome assembly compared to the Sanger-EST assembly. These new sequences did not have any hit in the GenBank, and as a result the number of sequences that were not annotated increased. Contig length also contributes to lower annotation. Since there were relatively more short contigs in the IGA transcriptome assembly than the Sanger-EST assembly, the percent of annotated sequences was lower. Also, during the Sanger sequencing procedure there is a cloning step involved in library construction, which favors selection for higher copy number transcripts, resulting in redundancy in annotated sequences and a lower number of unannotated sequences as well as poor sampling of single-copy sequences. Based on the number of annotated contigs our results for IGA analysis are similar to Lu et al.
. Considering the number of assembled nucleotides in contrast to the number of contigs, the present two assemblies were quite comparable, 70% in the IGA transcriptome assembly vs. 82% in the Sanger-EST assembly. In the Sanger-EST assembly 23% of the contigs or 17.5% of nucleotides did not align to any homologous sequences in the GenBank, therefore these sequences can be identified as potential novel transcripts or genes in pepper that were not previously characterized or simply were too short for conclusive annotation. Not surprising, the annotations of both assemblies presented here are very similar in terms of species distribution of top-hits. This is probably due to the bias in databases toward having more data for certain species that have been annotated better than the others. At the time of analysis tomato genome annotation was not available in GenBank databases which could be the reason as to why S. lycopersicum is not on the top of species hit list.
Another aspect of our study was to assign transcripts to different metabolic pathways. Generating KEGG maps and designating enzymes to different metabolic pathways is an effective way to identify candidate genes. In an ultra-saturated genetic map of pepper, contigs that are spanning a QTL can be further examined for their role in one or more metabolic pathways. Finding annotated contigs will then help to identify KEGG maps related to the enzymes and metabolites involved in the traits and further investigate their function in controlling traits.
One of our goals in this project was to develop markers that can readily be used in breeding programs. We presented here two sets of markers, SSR and SNP for genetic and breeding analyses in pepper. The putative SNPs that were discovered in the Sanger-EST assembly were internally validated by KASPar assays in a genotyping panel of 43 pepper lines and accessions. It is deemed to be very robust and reliable despite the lower sequence depth compared to SNPs that were discovered in the IGA transcriptome assembly. We also observed a comparable SNP frequency in both assemblies (1/2,798 bp vs.1/2,523 bp) indicating SNP frequency in pepper transcriptomes is plausibly consistent across methods and accessions used in different experiments. Coincidently, the polymorphism among three diverse lines, CM334, Early Jalapeño and Maor
, and those within the F1-hybrid of Bukang was similar.